2026-02-21T08:05:14.2301907Z Current runner version: '2.331.0' 2026-02-21T08:05:14.2304727Z Runner name: 'linux.rocm.gpu.gfx942.2-n2gvb-runner-2vdnn' 2026-02-21T08:05:14.2305182Z Runner group name: 'default' 2026-02-21T08:05:14.2305581Z Machine name: 'linux' 2026-02-21T08:05:14.2306790Z ##[group]GITHUB_TOKEN Permissions 2026-02-21T08:05:14.2307857Z Contents: read 2026-02-21T08:05:14.2308110Z Metadata: read 2026-02-21T08:05:14.2308345Z ##[endgroup] 2026-02-21T08:05:14.2309389Z Secret source: Actions 2026-02-21T08:05:14.2309700Z Prepare workflow directory 2026-02-21T08:05:14.2576887Z Prepare all required actions 2026-02-21T08:05:14.2596514Z Getting action download info 2026-02-21T08:05:14.5733057Z Download action repository 'actions/checkout@v6' (SHA:de0fac2e4500dabe0009e67214ff5f5447ce83dd) 2026-02-21T08:05:14.9792738Z Download action repository 'actions/setup-python@v6' (SHA:a309ff8b426b58ec0e2a45f0f869d46889d02405) 2026-02-21T08:05:15.4150774Z Download action repository 'astral-sh/setup-uv@v7' (SHA:eac588ad8def6316056a12d4907a9d4d84ff7a3b) 2026-02-21T08:05:15.8568637Z Download action repository 'pytorch/test-infra@main' (SHA:bb8f04ff3961233c844fde6533c7c6c5f0857909) 2026-02-21T08:05:16.5667078Z Download action repository 'actions/upload-artifact@v6' (SHA:b7c566a772e6b6bfb58ed0dc250532a479d7789f) 2026-02-21T08:05:17.0747658Z Getting action download info 2026-02-21T08:05:17.2238014Z Uses: pytorch/helion/.github/workflows/benchmark.yml@refs/heads/main (874a7d0cadab18218a84ad3579d329dc95c51820) 2026-02-21T08:05:17.2240361Z ##[group] Inputs 2026-02-21T08:05:17.2240517Z runner: linux.rocm.gpu.gfx942.2 2026-02-21T08:05:17.2240646Z python-version: 3.12 2026-02-21T08:05:17.2240771Z image: rocm/dev-ubuntu-24.04:6.4.4-complete 2026-02-21T08:05:17.2240904Z runtime-version: rocm6.4 2026-02-21T08:05:17.2241062Z container-options: --device=/dev/kfd --device=/dev/dri 2026-02-21T08:05:17.2241204Z alias: mi325x 2026-02-21T08:05:17.2241309Z kernels: int4_gemm,flash_attention 2026-02-21T08:05:17.2241439Z env-vars: 2026-02-21T08:05:17.2241551Z custom-args: 2026-02-21T08:05:17.2241834Z run_h100: true 2026-02-21T08:05:17.2241927Z run_b200: true 2026-02-21T08:05:17.2242019Z run_mi325x: true 2026-02-21T08:05:17.2242119Z ##[endgroup] 2026-02-21T08:05:17.2242351Z Complete job name: run-mi325x (int4_gemm,flash_attention) / benchmark-rocm6.4-int4_gemm,flash_attention-py3.12-mi325x 2026-02-21T08:05:17.2450631Z ##[group]Checking docker version 2026-02-21T08:05:17.2457267Z ##[command]/usr/bin/docker version --format '{{.Server.APIVersion}}' 2026-02-21T08:05:17.2605496Z '1.51' 2026-02-21T08:05:17.2620210Z Docker daemon API version: '1.51' 2026-02-21T08:05:17.2620516Z ##[command]/usr/bin/docker version --format '{{.Client.APIVersion}}' 2026-02-21T08:05:17.2748376Z '1.51' 2026-02-21T08:05:17.2761616Z Docker client API version: '1.51' 2026-02-21T08:05:17.2764332Z ##[endgroup] 2026-02-21T08:05:17.2765591Z ##[group]Clean up resources from previous jobs 2026-02-21T08:05:17.2767747Z ##[command]/usr/bin/docker ps --all --quiet --no-trunc --filter "label=ec4e10" 2026-02-21T08:05:17.2882356Z ##[command]/usr/bin/docker network prune --force --filter "label=ec4e10" 2026-02-21T08:05:17.3009538Z ##[endgroup] 2026-02-21T08:05:17.3009668Z ##[group]Create local container network 2026-02-21T08:05:17.3014434Z ##[command]/usr/bin/docker network create --label ec4e10 github_network_2fc8484199d94410beae10a38ccd998e 2026-02-21T08:05:17.3252137Z 57e0e7788a2aef9c834902a24870ebc4a4e8aa9a49e24de76cd8ade1d6b56e3e 2026-02-21T08:05:17.3266011Z ##[endgroup] 2026-02-21T08:05:17.3286861Z ##[group]Starting job container 2026-02-21T08:05:17.3298277Z ##[command]/usr/bin/docker pull rocm/dev-ubuntu-24.04:6.4.4-complete 2026-02-21T08:05:19.4539921Z 6.4.4-complete: Pulling from rocm/dev-ubuntu-24.04 2026-02-21T08:05:19.4540231Z 953cdd413371: Pulling fs layer 2026-02-21T08:05:19.4540345Z 3a9c27801271: Pulling fs layer 2026-02-21T08:05:19.4540466Z 4c8a4cb43e3b: Pulling fs layer 2026-02-21T08:05:19.4540662Z 624e685c2697: Pulling fs layer 2026-02-21T08:05:19.4540767Z 624e685c2697: Waiting 2026-02-21T08:05:20.2574465Z 3a9c27801271: Verifying Checksum 2026-02-21T08:05:20.2574735Z 3a9c27801271: Download complete 2026-02-21T08:05:20.8122336Z 624e685c2697: Verifying Checksum 2026-02-21T08:05:20.8122726Z 624e685c2697: Download complete 2026-02-21T08:05:23.0098030Z 953cdd413371: Verifying Checksum 2026-02-21T08:05:23.0098569Z 953cdd413371: Download complete 2026-02-21T08:05:23.6783803Z 953cdd413371: Pull complete 2026-02-21T08:05:23.8039571Z 3a9c27801271: Pull complete 2026-02-21T08:05:46.1801431Z 4c8a4cb43e3b: Download complete 2026-02-21T08:06:25.9784575Z 4c8a4cb43e3b: Pull complete 2026-02-21T08:06:26.1113282Z 624e685c2697: Pull complete 2026-02-21T08:06:26.1133090Z Digest: sha256:31418ac10a3769a71eaef330c07280d1d999d7074621339b8f93c484c35f6078 2026-02-21T08:06:26.1136636Z Status: Downloaded newer image for rocm/dev-ubuntu-24.04:6.4.4-complete 2026-02-21T08:06:26.1145836Z docker.io/rocm/dev-ubuntu-24.04:6.4.4-complete 2026-02-21T08:06:26.1206265Z ##[command]/usr/bin/docker create --name c6efec98bce142318ef5a57c93d78ef5_rocmdevubuntu2404644complete_87ce82 --label ec4e10 --workdir /__w/helion/helion --network github_network_2fc8484199d94410beae10a38ccd998e --device=/dev/kfd --device=/dev/dri -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/_work":"/__w" -v "/home/runner/externals":"/__e":ro -v "/home/runner/_work/_temp":"/__w/_temp" -v "/home/runner/_work/_actions":"/__w/_actions" -v "/home/runner/_work/_tool":"/__w/_tool" -v "/home/runner/_work/_temp/_github_home":"/github/home" -v "/home/runner/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" rocm/dev-ubuntu-24.04:6.4.4-complete "-f" "/dev/null" 2026-02-21T08:06:26.3422630Z 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 2026-02-21T08:06:26.3446048Z ##[command]/usr/bin/docker start 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 2026-02-21T08:06:26.5085777Z 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 2026-02-21T08:06:26.5100724Z ##[command]/usr/bin/docker ps --all --filter id=9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 --filter status=running --no-trunc --format "{{.ID}} {{.Status}}" 2026-02-21T08:06:26.5207144Z 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 Up Less than a second 2026-02-21T08:06:26.5219961Z ##[command]/usr/bin/docker inspect --format "{{range .Config.Env}}{{println .}}{{end}}" 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 2026-02-21T08:06:26.5322961Z HOME=/github/home 2026-02-21T08:06:26.5323078Z GITHUB_ACTIONS=true 2026-02-21T08:06:26.5323252Z CI=true 2026-02-21T08:06:26.5323482Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:06:26.5341500Z ##[endgroup] 2026-02-21T08:06:26.5349558Z ##[group]Waiting for all services to be ready 2026-02-21T08:06:26.5350694Z ##[endgroup] 2026-02-21T08:06:26.5454800Z ##[group]Run echo "Detected ROCm image" 2026-02-21T08:06:26.5454988Z echo "Detected ROCm image" 2026-02-21T08:06:26.5455150Z rocminfo || echo "rocminfo not found" 2026-02-21T08:06:26.5456699Z shell: bash -l {0} 2026-02-21T08:06:26.5456871Z env: 2026-02-21T08:06:26.5456957Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:26.5457068Z ##[endgroup] 2026-02-21T08:06:26.5901830Z Detected ROCm image 2026-02-21T08:06:26.7771283Z ROCk module version 6.12.12 is loaded 2026-02-21T08:06:26.7772328Z ===================== 2026-02-21T08:06:26.7772633Z HSA System Attributes 2026-02-21T08:06:26.7772932Z ===================== 2026-02-21T08:06:26.7773219Z Runtime Version: 1.15 2026-02-21T08:06:26.7773525Z Runtime Ext Version: 1.7 2026-02-21T08:06:26.7773881Z System Timestamp Freq.: 1000.000000MHz 2026-02-21T08:06:26.7774418Z Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) 2026-02-21T08:06:26.7775109Z Machine Model: LARGE 2026-02-21T08:06:26.7775573Z System Endianness: LITTLE 2026-02-21T08:06:26.7776585Z Mwaitx: DISABLED 2026-02-21T08:06:26.7776889Z XNACK enabled: NO 2026-02-21T08:06:26.7777213Z DMAbuf Support: YES 2026-02-21T08:06:26.7777475Z VMM Support: YES 2026-02-21T08:06:26.7777563Z 2026-02-21T08:06:26.7777623Z ========== 2026-02-21T08:06:26.7777753Z HSA Agents 2026-02-21T08:06:26.7777889Z ========== 2026-02-21T08:06:26.7778019Z ******* 2026-02-21T08:06:26.7778142Z Agent 1 2026-02-21T08:06:26.7778261Z ******* 2026-02-21T08:06:26.7778441Z Name: AMD EPYC 9575F 64-Core Processor 2026-02-21T08:06:26.7778660Z Uuid: CPU-XX 2026-02-21T08:06:26.7778876Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2026-02-21T08:06:26.7779108Z Vendor Name: CPU 2026-02-21T08:06:26.7779318Z Feature: None specified 2026-02-21T08:06:26.7779572Z Profile: FULL_PROFILE 2026-02-21T08:06:26.7779787Z Float Round Mode: NEAR 2026-02-21T08:06:26.7780018Z Max Queue Number: 0(0x0) 2026-02-21T08:06:26.7780235Z Queue Min Size: 0(0x0) 2026-02-21T08:06:26.7780447Z Queue Max Size: 0(0x0) 2026-02-21T08:06:26.7780653Z Queue Type: MULTI 2026-02-21T08:06:26.7780860Z Node: 0 2026-02-21T08:06:26.7781063Z Device Type: CPU 2026-02-21T08:06:26.7781505Z Cache Info: 2026-02-21T08:06:26.7781705Z L1: 49152(0xc000) KB 2026-02-21T08:06:26.7781910Z Chip ID: 0(0x0) 2026-02-21T08:06:26.7782128Z ASIC Revision: 0(0x0) 2026-02-21T08:06:26.7782342Z Cacheline Size: 64(0x40) 2026-02-21T08:06:26.7782558Z Max Clock Freq. (MHz): 3300 2026-02-21T08:06:26.7782763Z BDFID: 0 2026-02-21T08:06:26.7782975Z Internal Node ID: 0 2026-02-21T08:06:26.7783199Z Compute Unit: 128 2026-02-21T08:06:26.7783413Z SIMDs per CU: 0 2026-02-21T08:06:26.7783622Z Shader Engines: 0 2026-02-21T08:06:26.7783835Z Shader Arrs. per Eng.: 0 2026-02-21T08:06:26.7784292Z WatchPts on Addr. Ranges:1 2026-02-21T08:06:26.7784483Z Memory Properties: 2026-02-21T08:06:26.7784620Z Features: None 2026-02-21T08:06:26.7784776Z Pool Info: 2026-02-21T08:06:26.7784913Z Pool 1 2026-02-21T08:06:26.7785102Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T08:06:26.7785319Z Size: 1584690840(0x5e747698) KB 2026-02-21T08:06:26.7785538Z Allocatable: TRUE 2026-02-21T08:06:26.7785765Z Alloc Granule: 4KB 2026-02-21T08:06:26.7786004Z Alloc Recommended Granule:4KB 2026-02-21T08:06:26.7786295Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7786526Z Accessible by all: TRUE 2026-02-21T08:06:26.7786705Z Pool 2 2026-02-21T08:06:26.7786902Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T08:06:26.7787129Z Size: 1584690840(0x5e747698) KB 2026-02-21T08:06:26.7787435Z Allocatable: TRUE 2026-02-21T08:06:26.7787640Z Alloc Granule: 4KB 2026-02-21T08:06:26.7787819Z Alloc Recommended Granule:4KB 2026-02-21T08:06:26.7788008Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7788187Z Accessible by all: TRUE 2026-02-21T08:06:26.7788363Z Pool 3 2026-02-21T08:06:26.7788511Z Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED 2026-02-21T08:06:26.7788680Z Size: 1584690840(0x5e747698) KB 2026-02-21T08:06:26.7788849Z Allocatable: TRUE 2026-02-21T08:06:26.7789026Z Alloc Granule: 4KB 2026-02-21T08:06:26.7789228Z Alloc Recommended Granule:4KB 2026-02-21T08:06:26.7789425Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7789607Z Accessible by all: TRUE 2026-02-21T08:06:26.7789760Z Pool 4 2026-02-21T08:06:26.7789918Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T08:06:26.7790086Z Size: 1584690840(0x5e747698) KB 2026-02-21T08:06:26.7790252Z Allocatable: TRUE 2026-02-21T08:06:26.7790432Z Alloc Granule: 4KB 2026-02-21T08:06:26.7790611Z Alloc Recommended Granule:4KB 2026-02-21T08:06:26.7790799Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7790988Z Accessible by all: TRUE 2026-02-21T08:06:26.7791134Z ISA Info: 2026-02-21T08:06:26.7791249Z ******* 2026-02-21T08:06:26.7791357Z Agent 2 2026-02-21T08:06:26.7791465Z ******* 2026-02-21T08:06:26.7791602Z Name: AMD EPYC 9575F 64-Core Processor 2026-02-21T08:06:26.7791775Z Uuid: CPU-XX 2026-02-21T08:06:26.7791950Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2026-02-21T08:06:26.7792133Z Vendor Name: CPU 2026-02-21T08:06:26.7792307Z Feature: None specified 2026-02-21T08:06:26.7792481Z Profile: FULL_PROFILE 2026-02-21T08:06:26.7792658Z Float Round Mode: NEAR 2026-02-21T08:06:26.7792834Z Max Queue Number: 0(0x0) 2026-02-21T08:06:26.7793057Z Queue Min Size: 0(0x0) 2026-02-21T08:06:26.7793227Z Queue Max Size: 0(0x0) 2026-02-21T08:06:26.7793402Z Queue Type: MULTI 2026-02-21T08:06:26.7793566Z Node: 1 2026-02-21T08:06:26.7793738Z Device Type: CPU 2026-02-21T08:06:26.7793885Z Cache Info: 2026-02-21T08:06:26.7794024Z L1: 49152(0xc000) KB 2026-02-21T08:06:26.7794185Z Chip ID: 0(0x0) 2026-02-21T08:06:26.7794353Z ASIC Revision: 0(0x0) 2026-02-21T08:06:26.7794527Z Cacheline Size: 64(0x40) 2026-02-21T08:06:26.7794703Z Max Clock Freq. (MHz): 3300 2026-02-21T08:06:26.7794873Z BDFID: 0 2026-02-21T08:06:26.7795045Z Internal Node ID: 1 2026-02-21T08:06:26.7795252Z Compute Unit: 128 2026-02-21T08:06:26.7795420Z SIMDs per CU: 0 2026-02-21T08:06:26.7795593Z Shader Engines: 0 2026-02-21T08:06:26.7795769Z Shader Arrs. per Eng.: 0 2026-02-21T08:06:26.7795954Z WatchPts on Addr. Ranges:1 2026-02-21T08:06:26.7796107Z Memory Properties: 2026-02-21T08:06:26.7796223Z Features: None 2026-02-21T08:06:26.7796336Z Pool Info: 2026-02-21T08:06:26.7796448Z Pool 1 2026-02-21T08:06:26.7796597Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T08:06:26.7796775Z Size: 1581097388(0x5e3da1ac) KB 2026-02-21T08:06:26.7796958Z Allocatable: TRUE 2026-02-21T08:06:26.7797140Z Alloc Granule: 4KB 2026-02-21T08:06:26.7797370Z Alloc Recommended Granule:4KB 2026-02-21T08:06:26.7797548Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7797724Z Accessible by all: TRUE 2026-02-21T08:06:26.7797871Z Pool 2 2026-02-21T08:06:26.7798017Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T08:06:26.7798191Z Size: 1581097388(0x5e3da1ac) KB 2026-02-21T08:06:26.7798356Z Allocatable: TRUE 2026-02-21T08:06:26.7798544Z Alloc Granule: 4KB 2026-02-21T08:06:26.7798721Z Alloc Recommended Granule:4KB 2026-02-21T08:06:26.7798922Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7799094Z Accessible by all: TRUE 2026-02-21T08:06:26.7799270Z Pool 3 2026-02-21T08:06:26.7799436Z Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED 2026-02-21T08:06:26.7799604Z Size: 1581097388(0x5e3da1ac) KB 2026-02-21T08:06:26.7799771Z Allocatable: TRUE 2026-02-21T08:06:26.7799940Z Alloc Granule: 4KB 2026-02-21T08:06:26.7800128Z Alloc Recommended Granule:4KB 2026-02-21T08:06:26.7800306Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7800512Z Accessible by all: TRUE 2026-02-21T08:06:26.7800659Z Pool 4 2026-02-21T08:06:26.7800849Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T08:06:26.7801042Z Size: 1581097388(0x5e3da1ac) KB 2026-02-21T08:06:26.7801209Z Allocatable: TRUE 2026-02-21T08:06:26.7801380Z Alloc Granule: 4KB 2026-02-21T08:06:26.7801557Z Alloc Recommended Granule:4KB 2026-02-21T08:06:26.7801738Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7801916Z Accessible by all: TRUE 2026-02-21T08:06:26.7802058Z ISA Info: 2026-02-21T08:06:26.7802164Z ******* 2026-02-21T08:06:26.7802267Z Agent 3 2026-02-21T08:06:26.7802369Z ******* 2026-02-21T08:06:26.7802495Z Name: gfx942 2026-02-21T08:06:26.7802771Z Uuid: GPU-eba66cc485172b60 2026-02-21T08:06:26.7802941Z Marketing Name: AMD Instinct MI325X 2026-02-21T08:06:26.7803120Z Vendor Name: AMD 2026-02-21T08:06:26.7803355Z Feature: KERNEL_DISPATCH 2026-02-21T08:06:26.7803524Z Profile: BASE_PROFILE 2026-02-21T08:06:26.7803688Z Float Round Mode: NEAR 2026-02-21T08:06:26.7803856Z Max Queue Number: 128(0x80) 2026-02-21T08:06:26.7804018Z Queue Min Size: 64(0x40) 2026-02-21T08:06:26.7804180Z Queue Max Size: 131072(0x20000) 2026-02-21T08:06:26.7804340Z Queue Type: MULTI 2026-02-21T08:06:26.7804498Z Node: 2 2026-02-21T08:06:26.7804658Z Device Type: GPU 2026-02-21T08:06:26.7804796Z Cache Info: 2026-02-21T08:06:26.7804923Z L1: 32(0x20) KB 2026-02-21T08:06:26.7805072Z L2: 4096(0x1000) KB 2026-02-21T08:06:26.7805219Z L3: 262144(0x40000) KB 2026-02-21T08:06:26.7805365Z Chip ID: 29861(0x74a5) 2026-02-21T08:06:26.7805525Z ASIC Revision: 1(0x1) 2026-02-21T08:06:26.7805687Z Cacheline Size: 128(0x80) 2026-02-21T08:06:26.7805856Z Max Clock Freq. (MHz): 2100 2026-02-21T08:06:26.7806016Z BDFID: 29952 2026-02-21T08:06:26.7806173Z Internal Node ID: 2 2026-02-21T08:06:26.7806338Z Compute Unit: 304 2026-02-21T08:06:26.7806502Z SIMDs per CU: 4 2026-02-21T08:06:26.7806665Z Shader Engines: 32 2026-02-21T08:06:26.7806833Z Shader Arrs. per Eng.: 1 2026-02-21T08:06:26.7807006Z WatchPts on Addr. Ranges:4 2026-02-21T08:06:26.7807178Z Coherent Host Access: FALSE 2026-02-21T08:06:26.7807322Z Memory Properties: 2026-02-21T08:06:26.7807443Z Features: KERNEL_DISPATCH 2026-02-21T08:06:26.7807594Z Fast F16 Operation: TRUE 2026-02-21T08:06:26.7807763Z Wavefront Size: 64(0x40) 2026-02-21T08:06:26.7807929Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7808079Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7808216Z x 1024(0x400) 2026-02-21T08:06:26.7808411Z y 1024(0x400) 2026-02-21T08:06:26.7808565Z z 1024(0x400) 2026-02-21T08:06:26.7808756Z Max Waves Per CU: 32(0x20) 2026-02-21T08:06:26.7808926Z Max Work-item Per CU: 2048(0x800) 2026-02-21T08:06:26.7809114Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7809272Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7809397Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7809544Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7809697Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7809879Z Max fbarriers/Workgrp: 32 2026-02-21T08:06:26.7820148Z Packet Processor uCode:: 185 2026-02-21T08:06:26.7820536Z SDMA engine uCode:: 24 2026-02-21T08:06:26.7820720Z IOMMU Support:: None 2026-02-21T08:06:26.7820866Z Pool Info: 2026-02-21T08:06:26.7821074Z Pool 1 2026-02-21T08:06:26.7821222Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T08:06:26.7821392Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7821560Z Allocatable: TRUE 2026-02-21T08:06:26.7821758Z Alloc Granule: 4KB 2026-02-21T08:06:26.7821938Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7822119Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7822293Z Accessible by all: FALSE 2026-02-21T08:06:26.7822437Z Pool 2 2026-02-21T08:06:26.7822582Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T08:06:26.7822752Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7822930Z Allocatable: TRUE 2026-02-21T08:06:26.7823099Z Alloc Granule: 4KB 2026-02-21T08:06:26.7823272Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7823446Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7823619Z Accessible by all: FALSE 2026-02-21T08:06:26.7823761Z Pool 3 2026-02-21T08:06:26.7823899Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T08:06:26.7824059Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7824221Z Allocatable: TRUE 2026-02-21T08:06:26.7824388Z Alloc Granule: 4KB 2026-02-21T08:06:26.7824566Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7824747Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7824916Z Accessible by all: FALSE 2026-02-21T08:06:26.7825060Z Pool 4 2026-02-21T08:06:26.7825208Z Segment: GROUP 2026-02-21T08:06:26.7825364Z Size: 64(0x40) KB 2026-02-21T08:06:26.7825525Z Allocatable: FALSE 2026-02-21T08:06:26.7825690Z Alloc Granule: 0KB 2026-02-21T08:06:26.7825864Z Alloc Recommended Granule:0KB 2026-02-21T08:06:26.7826042Z Alloc Alignment: 0KB 2026-02-21T08:06:26.7826248Z Accessible by all: FALSE 2026-02-21T08:06:26.7826393Z ISA Info: 2026-02-21T08:06:26.7826502Z ISA 1 2026-02-21T08:06:26.7826653Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T08:06:26.7826835Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7827012Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7827185Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7827360Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7827530Z Fast f16: TRUE 2026-02-21T08:06:26.7827697Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7827850Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7827993Z x 1024(0x400) 2026-02-21T08:06:26.7828146Z y 1024(0x400) 2026-02-21T08:06:26.7828296Z z 1024(0x400) 2026-02-21T08:06:26.7828494Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7828642Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7828775Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7828926Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7829073Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7829232Z FBarrier Max Size: 32 2026-02-21T08:06:26.7829373Z ISA 2 2026-02-21T08:06:26.7829528Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T08:06:26.7829721Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7829896Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7830067Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7830245Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7830413Z Fast f16: TRUE 2026-02-21T08:06:26.7830592Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7830745Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7830889Z x 1024(0x400) 2026-02-21T08:06:26.7831062Z y 1024(0x400) 2026-02-21T08:06:26.7831210Z z 1024(0x400) 2026-02-21T08:06:26.7831364Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7831512Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7831675Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7831845Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7832005Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7832163Z FBarrier Max Size: 32 2026-02-21T08:06:26.7832301Z ******* 2026-02-21T08:06:26.7832401Z Agent 4 2026-02-21T08:06:26.7832510Z ******* 2026-02-21T08:06:26.7832630Z Name: gfx942 2026-02-21T08:06:26.7832796Z Uuid: GPU-c8f81c2eb8adeb06 2026-02-21T08:06:26.7832978Z Marketing Name: AMD Instinct MI325X 2026-02-21T08:06:26.7833154Z Vendor Name: AMD 2026-02-21T08:06:26.7833319Z Feature: KERNEL_DISPATCH 2026-02-21T08:06:26.7833548Z Profile: BASE_PROFILE 2026-02-21T08:06:26.7833718Z Float Round Mode: NEAR 2026-02-21T08:06:26.7833883Z Max Queue Number: 128(0x80) 2026-02-21T08:06:26.7834084Z Queue Min Size: 64(0x40) 2026-02-21T08:06:26.7834269Z Queue Max Size: 131072(0x20000) 2026-02-21T08:06:26.7834435Z Queue Type: MULTI 2026-02-21T08:06:26.7834596Z Node: 3 2026-02-21T08:06:26.7834751Z Device Type: GPU 2026-02-21T08:06:26.7834891Z Cache Info: 2026-02-21T08:06:26.7835017Z L1: 32(0x20) KB 2026-02-21T08:06:26.7835168Z L2: 4096(0x1000) KB 2026-02-21T08:06:26.7835313Z L3: 262144(0x40000) KB 2026-02-21T08:06:26.7835467Z Chip ID: 29861(0x74a5) 2026-02-21T08:06:26.7835627Z ASIC Revision: 1(0x1) 2026-02-21T08:06:26.7835841Z Cacheline Size: 128(0x80) 2026-02-21T08:06:26.7836008Z Max Clock Freq. (MHz): 2100 2026-02-21T08:06:26.7836164Z BDFID: 1280 2026-02-21T08:06:26.7836327Z Internal Node ID: 3 2026-02-21T08:06:26.7836487Z Compute Unit: 304 2026-02-21T08:06:26.7836650Z SIMDs per CU: 4 2026-02-21T08:06:26.7836813Z Shader Engines: 32 2026-02-21T08:06:26.7836977Z Shader Arrs. per Eng.: 1 2026-02-21T08:06:26.7837150Z WatchPts on Addr. Ranges:4 2026-02-21T08:06:26.7837318Z Coherent Host Access: FALSE 2026-02-21T08:06:26.7837469Z Memory Properties: 2026-02-21T08:06:26.7837586Z Features: KERNEL_DISPATCH 2026-02-21T08:06:26.7837740Z Fast F16 Operation: TRUE 2026-02-21T08:06:26.7837906Z Wavefront Size: 64(0x40) 2026-02-21T08:06:26.7838070Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7838222Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7838353Z x 1024(0x400) 2026-02-21T08:06:26.7838497Z y 1024(0x400) 2026-02-21T08:06:26.7838635Z z 1024(0x400) 2026-02-21T08:06:26.7838787Z Max Waves Per CU: 32(0x20) 2026-02-21T08:06:26.7838950Z Max Work-item Per CU: 2048(0x800) 2026-02-21T08:06:26.7839119Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7839261Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7839384Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7839534Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7839673Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7839831Z Max fbarriers/Workgrp: 32 2026-02-21T08:06:26.7840006Z Packet Processor uCode:: 185 2026-02-21T08:06:26.7840180Z SDMA engine uCode:: 24 2026-02-21T08:06:26.7840347Z IOMMU Support:: None 2026-02-21T08:06:26.7840485Z Pool Info: 2026-02-21T08:06:26.7840592Z Pool 1 2026-02-21T08:06:26.7840731Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T08:06:26.7853427Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7853616Z Allocatable: TRUE 2026-02-21T08:06:26.7853791Z Alloc Granule: 4KB 2026-02-21T08:06:26.7853965Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7854140Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7854357Z Accessible by all: FALSE 2026-02-21T08:06:26.7854498Z Pool 2 2026-02-21T08:06:26.7854635Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T08:06:26.7854798Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7854968Z Allocatable: TRUE 2026-02-21T08:06:26.7855135Z Alloc Granule: 4KB 2026-02-21T08:06:26.7855326Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7855520Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7855763Z Accessible by all: FALSE 2026-02-21T08:06:26.7855905Z Pool 3 2026-02-21T08:06:26.7856039Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T08:06:26.7856238Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7856422Z Allocatable: TRUE 2026-02-21T08:06:26.7856603Z Alloc Granule: 4KB 2026-02-21T08:06:26.7856780Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7856978Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7857158Z Accessible by all: FALSE 2026-02-21T08:06:26.7857299Z Pool 4 2026-02-21T08:06:26.7857505Z Segment: GROUP 2026-02-21T08:06:26.7857658Z Size: 64(0x40) KB 2026-02-21T08:06:26.7857864Z Allocatable: FALSE 2026-02-21T08:06:26.7858029Z Alloc Granule: 0KB 2026-02-21T08:06:26.7858735Z Alloc Recommended Granule:0KB 2026-02-21T08:06:26.7858915Z Alloc Alignment: 0KB 2026-02-21T08:06:26.7859080Z Accessible by all: FALSE 2026-02-21T08:06:26.7859222Z ISA Info: 2026-02-21T08:06:26.7859331Z ISA 1 2026-02-21T08:06:26.7859476Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T08:06:26.7859655Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7859828Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7860001Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7860175Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7860340Z Fast f16: TRUE 2026-02-21T08:06:26.7860501Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7860653Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7860790Z x 1024(0x400) 2026-02-21T08:06:26.7860933Z y 1024(0x400) 2026-02-21T08:06:26.7861079Z z 1024(0x400) 2026-02-21T08:06:26.7861231Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7861375Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7861551Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7861699Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7861846Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7862001Z FBarrier Max Size: 32 2026-02-21T08:06:26.7862141Z ISA 2 2026-02-21T08:06:26.7862291Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T08:06:26.7862480Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7862650Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7862821Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7862994Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7863154Z Fast f16: TRUE 2026-02-21T08:06:26.7863319Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7863466Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7863632Z x 1024(0x400) 2026-02-21T08:06:26.7863772Z y 1024(0x400) 2026-02-21T08:06:26.7863916Z z 1024(0x400) 2026-02-21T08:06:26.7864069Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7864210Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7864339Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7864483Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7864629Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7864783Z FBarrier Max Size: 32 2026-02-21T08:06:26.7864924Z ******* 2026-02-21T08:06:26.7865026Z Agent 5 2026-02-21T08:06:26.7865125Z ******* 2026-02-21T08:06:26.7865247Z Name: gfx942 2026-02-21T08:06:26.7865406Z Uuid: GPU-8b5a3495e4b669cf 2026-02-21T08:06:26.7865570Z Marketing Name: AMD Instinct MI325X 2026-02-21T08:06:26.7865733Z Vendor Name: AMD 2026-02-21T08:06:26.7865895Z Feature: KERNEL_DISPATCH 2026-02-21T08:06:26.7866052Z Profile: BASE_PROFILE 2026-02-21T08:06:26.7866216Z Float Round Mode: NEAR 2026-02-21T08:06:26.7866380Z Max Queue Number: 128(0x80) 2026-02-21T08:06:26.7866539Z Queue Min Size: 64(0x40) 2026-02-21T08:06:26.7866735Z Queue Max Size: 131072(0x20000) 2026-02-21T08:06:26.7866900Z Queue Type: MULTI 2026-02-21T08:06:26.7867094Z Node: 4 2026-02-21T08:06:26.7867250Z Device Type: GPU 2026-02-21T08:06:26.7867413Z Cache Info: 2026-02-21T08:06:26.7867549Z L1: 32(0x20) KB 2026-02-21T08:06:26.7867713Z L2: 4096(0x1000) KB 2026-02-21T08:06:26.7867871Z L3: 262144(0x40000) KB 2026-02-21T08:06:26.7868017Z Chip ID: 29861(0x74a5) 2026-02-21T08:06:26.7868178Z ASIC Revision: 1(0x1) 2026-02-21T08:06:26.7868338Z Cacheline Size: 128(0x80) 2026-02-21T08:06:26.7868541Z Max Clock Freq. (MHz): 2100 2026-02-21T08:06:26.7868762Z BDFID: 25856 2026-02-21T08:06:26.7868918Z Internal Node ID: 4 2026-02-21T08:06:26.7869098Z Compute Unit: 304 2026-02-21T08:06:26.7869258Z SIMDs per CU: 4 2026-02-21T08:06:26.7869443Z Shader Engines: 32 2026-02-21T08:06:26.7869607Z Shader Arrs. per Eng.: 1 2026-02-21T08:06:26.7869775Z WatchPts on Addr. Ranges:4 2026-02-21T08:06:26.7869942Z Coherent Host Access: FALSE 2026-02-21T08:06:26.7870090Z Memory Properties: 2026-02-21T08:06:26.7870209Z Features: KERNEL_DISPATCH 2026-02-21T08:06:26.7870382Z Fast F16 Operation: TRUE 2026-02-21T08:06:26.7870551Z Wavefront Size: 64(0x40) 2026-02-21T08:06:26.7870716Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7870866Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7871029Z x 1024(0x400) 2026-02-21T08:06:26.7871174Z y 1024(0x400) 2026-02-21T08:06:26.7871313Z z 1024(0x400) 2026-02-21T08:06:26.7871468Z Max Waves Per CU: 32(0x20) 2026-02-21T08:06:26.7871634Z Max Work-item Per CU: 2048(0x800) 2026-02-21T08:06:26.7871797Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7871939Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7872060Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7872206Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7872351Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7872505Z Max fbarriers/Workgrp: 32 2026-02-21T08:06:26.7872684Z Packet Processor uCode:: 185 2026-02-21T08:06:26.7872853Z SDMA engine uCode:: 24 2026-02-21T08:06:26.7873019Z IOMMU Support:: None 2026-02-21T08:06:26.7873155Z Pool Info: 2026-02-21T08:06:26.7873261Z Pool 1 2026-02-21T08:06:26.7873396Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T08:06:26.7873563Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7873728Z Allocatable: TRUE 2026-02-21T08:06:26.7873895Z Alloc Granule: 4KB 2026-02-21T08:06:26.7874069Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7874242Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7874414Z Accessible by all: FALSE 2026-02-21T08:06:26.7874557Z Pool 2 2026-02-21T08:06:26.7874695Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T08:06:26.7874859Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7875016Z Allocatable: TRUE 2026-02-21T08:06:26.7875183Z Alloc Granule: 4KB 2026-02-21T08:06:26.7875352Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7875525Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7875691Z Accessible by all: FALSE 2026-02-21T08:06:26.7875831Z Pool 3 2026-02-21T08:06:26.7876008Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T08:06:26.7876165Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7876325Z Allocatable: TRUE 2026-02-21T08:06:26.7876490Z Alloc Granule: 4KB 2026-02-21T08:06:26.7876661Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7876831Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7877001Z Accessible by all: FALSE 2026-02-21T08:06:26.7877141Z Pool 4 2026-02-21T08:06:26.7877270Z Segment: GROUP 2026-02-21T08:06:26.7877426Z Size: 64(0x40) KB 2026-02-21T08:06:26.7877582Z Allocatable: FALSE 2026-02-21T08:06:26.7877751Z Alloc Granule: 0KB 2026-02-21T08:06:26.7877920Z Alloc Recommended Granule:0KB 2026-02-21T08:06:26.7878129Z Alloc Alignment: 0KB 2026-02-21T08:06:26.7878298Z Accessible by all: FALSE 2026-02-21T08:06:26.7878437Z ISA Info: 2026-02-21T08:06:26.7878542Z ISA 1 2026-02-21T08:06:26.7878682Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T08:06:26.7878863Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7879032Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7879231Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7879404Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7879565Z Fast f16: TRUE 2026-02-21T08:06:26.7879731Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7879878Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7880018Z x 1024(0x400) 2026-02-21T08:06:26.7880162Z y 1024(0x400) 2026-02-21T08:06:26.7880310Z z 1024(0x400) 2026-02-21T08:06:26.7880465Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7880608Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7880737Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7880883Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7881030Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7881186Z FBarrier Max Size: 32 2026-02-21T08:06:26.7881330Z ISA 2 2026-02-21T08:06:26.7881479Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T08:06:26.7881671Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7881843Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7882012Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7882186Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7882349Z Fast f16: TRUE 2026-02-21T08:06:26.7882513Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7882708Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7882845Z x 1024(0x400) 2026-02-21T08:06:26.7882987Z y 1024(0x400) 2026-02-21T08:06:26.7883175Z z 1024(0x400) 2026-02-21T08:06:26.7883331Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7883475Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7883604Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7883749Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7883896Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7884053Z FBarrier Max Size: 32 2026-02-21T08:06:26.7884188Z ******* 2026-02-21T08:06:26.7884290Z Agent 6 2026-02-21T08:06:26.7884388Z ******* 2026-02-21T08:06:26.7884511Z Name: gfx942 2026-02-21T08:06:26.7884664Z Uuid: GPU-a8cd01be4ce60285 2026-02-21T08:06:26.7884836Z Marketing Name: AMD Instinct MI325X 2026-02-21T08:06:26.7885002Z Vendor Name: AMD 2026-02-21T08:06:26.7885164Z Feature: KERNEL_DISPATCH 2026-02-21T08:06:26.7885377Z Profile: BASE_PROFILE 2026-02-21T08:06:26.7885538Z Float Round Mode: NEAR 2026-02-21T08:06:26.7885702Z Max Queue Number: 128(0x80) 2026-02-21T08:06:26.7885860Z Queue Min Size: 64(0x40) 2026-02-21T08:06:26.7886019Z Queue Max Size: 131072(0x20000) 2026-02-21T08:06:26.7886177Z Queue Type: MULTI 2026-02-21T08:06:26.7886331Z Node: 5 2026-02-21T08:06:26.7886485Z Device Type: GPU 2026-02-21T08:06:26.7886619Z Cache Info: 2026-02-21T08:06:26.7886747Z L1: 32(0x20) KB 2026-02-21T08:06:26.7886891Z L2: 4096(0x1000) KB 2026-02-21T08:06:26.7887038Z L3: 262144(0x40000) KB 2026-02-21T08:06:26.7887186Z Chip ID: 29861(0x74a5) 2026-02-21T08:06:26.7887345Z ASIC Revision: 1(0x1) 2026-02-21T08:06:26.7887508Z Cacheline Size: 128(0x80) 2026-02-21T08:06:26.7887668Z Max Clock Freq. (MHz): 2100 2026-02-21T08:06:26.7887822Z BDFID: 5376 2026-02-21T08:06:26.7887977Z Internal Node ID: 5 2026-02-21T08:06:26.7888139Z Compute Unit: 304 2026-02-21T08:06:26.7888295Z SIMDs per CU: 4 2026-02-21T08:06:26.7888457Z Shader Engines: 32 2026-02-21T08:06:26.7888621Z Shader Arrs. per Eng.: 1 2026-02-21T08:06:26.7888791Z WatchPts on Addr. Ranges:4 2026-02-21T08:06:26.7888960Z Coherent Host Access: FALSE 2026-02-21T08:06:26.7889103Z Memory Properties: 2026-02-21T08:06:26.7889222Z Features: KERNEL_DISPATCH 2026-02-21T08:06:26.7889367Z Fast F16 Operation: TRUE 2026-02-21T08:06:26.7889532Z Wavefront Size: 64(0x40) 2026-02-21T08:06:26.7889694Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7889972Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7890130Z x 1024(0x400) 2026-02-21T08:06:26.7890310Z y 1024(0x400) 2026-02-21T08:06:26.7890508Z z 1024(0x400) 2026-02-21T08:06:26.7890712Z Max Waves Per CU: 32(0x20) 2026-02-21T08:06:26.7890906Z Max Work-item Per CU: 2048(0x800) 2026-02-21T08:06:26.7891095Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7891284Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7891433Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7891602Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7891774Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7891973Z Max fbarriers/Workgrp: 32 2026-02-21T08:06:26.7892168Z Packet Processor uCode:: 185 2026-02-21T08:06:26.7892381Z SDMA engine uCode:: 24 2026-02-21T08:06:26.7892598Z IOMMU Support:: None 2026-02-21T08:06:26.7892754Z Pool Info: 2026-02-21T08:06:26.7892896Z Pool 1 2026-02-21T08:06:26.7893092Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T08:06:26.7893290Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7893496Z Allocatable: TRUE 2026-02-21T08:06:26.7893684Z Alloc Granule: 4KB 2026-02-21T08:06:26.7893892Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7894086Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7894300Z Accessible by all: FALSE 2026-02-21T08:06:26.7894461Z Pool 2 2026-02-21T08:06:26.7894636Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T08:06:26.7894842Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7895019Z Allocatable: TRUE 2026-02-21T08:06:26.7895224Z Alloc Granule: 4KB 2026-02-21T08:06:26.7895419Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7895620Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7895803Z Accessible by all: FALSE 2026-02-21T08:06:26.7895988Z Pool 3 2026-02-21T08:06:26.7896152Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T08:06:26.7896327Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7896527Z Allocatable: TRUE 2026-02-21T08:06:26.7896708Z Alloc Granule: 4KB 2026-02-21T08:06:26.7896914Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7897131Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7897359Z Accessible by all: FALSE 2026-02-21T08:06:26.7897530Z Pool 4 2026-02-21T08:06:26.7897687Z Segment: GROUP 2026-02-21T08:06:26.7897933Z Size: 64(0x40) KB 2026-02-21T08:06:26.7915479Z Allocatable: FALSE 2026-02-21T08:06:26.7915661Z Alloc Granule: 0KB 2026-02-21T08:06:26.7915842Z Alloc Recommended Granule:0KB 2026-02-21T08:06:26.7916019Z Alloc Alignment: 0KB 2026-02-21T08:06:26.7916190Z Accessible by all: FALSE 2026-02-21T08:06:26.7916338Z ISA Info: 2026-02-21T08:06:26.7916528Z ISA 1 2026-02-21T08:06:26.7916724Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T08:06:26.7916915Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7917096Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7917270Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7917442Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7917609Z Fast f16: TRUE 2026-02-21T08:06:26.7917770Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7917919Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7918058Z x 1024(0x400) 2026-02-21T08:06:26.7918201Z y 1024(0x400) 2026-02-21T08:06:26.7918345Z z 1024(0x400) 2026-02-21T08:06:26.7918498Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7918680Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7918805Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7918949Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7919091Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7919243Z FBarrier Max Size: 32 2026-02-21T08:06:26.7919380Z ISA 2 2026-02-21T08:06:26.7919528Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T08:06:26.7919715Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7919883Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7920055Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7920262Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7920455Z Fast f16: TRUE 2026-02-21T08:06:26.7920625Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7920772Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7920902Z x 1024(0x400) 2026-02-21T08:06:26.7921044Z y 1024(0x400) 2026-02-21T08:06:26.7921204Z z 1024(0x400) 2026-02-21T08:06:26.7921359Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7921506Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7921650Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7921795Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7921938Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7922096Z FBarrier Max Size: 32 2026-02-21T08:06:26.7922264Z ******* 2026-02-21T08:06:26.7922359Z Agent 7 2026-02-21T08:06:26.7922453Z ******* 2026-02-21T08:06:26.7922631Z Name: gfx942 2026-02-21T08:06:26.7922786Z Uuid: GPU-9c66436f78af1ebe 2026-02-21T08:06:26.7922948Z Marketing Name: AMD Instinct MI325X 2026-02-21T08:06:26.7923111Z Vendor Name: AMD 2026-02-21T08:06:26.7923271Z Feature: KERNEL_DISPATCH 2026-02-21T08:06:26.7923427Z Profile: BASE_PROFILE 2026-02-21T08:06:26.7923649Z Float Round Mode: NEAR 2026-02-21T08:06:26.7923818Z Max Queue Number: 128(0x80) 2026-02-21T08:06:26.7923978Z Queue Min Size: 64(0x40) 2026-02-21T08:06:26.7924156Z Queue Max Size: 131072(0x20000) 2026-02-21T08:06:26.7924313Z Queue Type: MULTI 2026-02-21T08:06:26.7924467Z Node: 6 2026-02-21T08:06:26.7924618Z Device Type: GPU 2026-02-21T08:06:26.7924753Z Cache Info: 2026-02-21T08:06:26.7924877Z L1: 32(0x20) KB 2026-02-21T08:06:26.7925028Z L2: 4096(0x1000) KB 2026-02-21T08:06:26.7925178Z L3: 262144(0x40000) KB 2026-02-21T08:06:26.7925326Z Chip ID: 29861(0x74a5) 2026-02-21T08:06:26.7925490Z ASIC Revision: 1(0x1) 2026-02-21T08:06:26.7925656Z Cacheline Size: 128(0x80) 2026-02-21T08:06:26.7925860Z Max Clock Freq. (MHz): 2100 2026-02-21T08:06:26.7926013Z BDFID: 62720 2026-02-21T08:06:26.7926172Z Internal Node ID: 6 2026-02-21T08:06:26.7926334Z Compute Unit: 304 2026-02-21T08:06:26.7926490Z SIMDs per CU: 4 2026-02-21T08:06:26.7926650Z Shader Engines: 32 2026-02-21T08:06:26.7926812Z Shader Arrs. per Eng.: 1 2026-02-21T08:06:26.7926980Z WatchPts on Addr. Ranges:4 2026-02-21T08:06:26.7927149Z Coherent Host Access: FALSE 2026-02-21T08:06:26.7927301Z Memory Properties: 2026-02-21T08:06:26.7927421Z Features: KERNEL_DISPATCH 2026-02-21T08:06:26.7927567Z Fast F16 Operation: TRUE 2026-02-21T08:06:26.7927736Z Wavefront Size: 64(0x40) 2026-02-21T08:06:26.7927899Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7928054Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7928191Z x 1024(0x400) 2026-02-21T08:06:26.7928340Z y 1024(0x400) 2026-02-21T08:06:26.7928487Z z 1024(0x400) 2026-02-21T08:06:26.7928643Z Max Waves Per CU: 32(0x20) 2026-02-21T08:06:26.7928816Z Max Work-item Per CU: 2048(0x800) 2026-02-21T08:06:26.7928983Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7929135Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7929267Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7929423Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7929570Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7929735Z Max fbarriers/Workgrp: 32 2026-02-21T08:06:26.7929923Z Packet Processor uCode:: 185 2026-02-21T08:06:26.7930092Z SDMA engine uCode:: 24 2026-02-21T08:06:26.7930264Z IOMMU Support:: None 2026-02-21T08:06:26.7930403Z Pool Info: 2026-02-21T08:06:26.7930517Z Pool 1 2026-02-21T08:06:26.7930659Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T08:06:26.7930834Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7931048Z Allocatable: TRUE 2026-02-21T08:06:26.7931228Z Alloc Granule: 4KB 2026-02-21T08:06:26.7931422Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7931608Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7931787Z Accessible by all: FALSE 2026-02-21T08:06:26.7931962Z Pool 2 2026-02-21T08:06:26.7932155Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T08:06:26.7932330Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7932496Z Allocatable: TRUE 2026-02-21T08:06:26.7932684Z Alloc Granule: 4KB 2026-02-21T08:06:26.7932862Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7933096Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7933267Z Accessible by all: FALSE 2026-02-21T08:06:26.7933447Z Pool 3 2026-02-21T08:06:26.7933638Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T08:06:26.7933813Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7933973Z Allocatable: TRUE 2026-02-21T08:06:26.7934233Z Alloc Granule: 4KB 2026-02-21T08:06:26.7934445Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7934614Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7934846Z Accessible by all: FALSE 2026-02-21T08:06:26.7935006Z Pool 4 2026-02-21T08:06:26.7935170Z Segment: GROUP 2026-02-21T08:06:26.7935336Z Size: 64(0x40) KB 2026-02-21T08:06:26.7935508Z Allocatable: FALSE 2026-02-21T08:06:26.7935676Z Alloc Granule: 0KB 2026-02-21T08:06:26.7935844Z Alloc Recommended Granule:0KB 2026-02-21T08:06:26.7936037Z Alloc Alignment: 0KB 2026-02-21T08:06:26.7936204Z Accessible by all: FALSE 2026-02-21T08:06:26.7936350Z ISA Info: 2026-02-21T08:06:26.7936459Z ISA 1 2026-02-21T08:06:26.7936642Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T08:06:26.7936821Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7936997Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7937169Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7937340Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7937509Z Fast f16: TRUE 2026-02-21T08:06:26.7937673Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7937821Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7937959Z x 1024(0x400) 2026-02-21T08:06:26.7938102Z y 1024(0x400) 2026-02-21T08:06:26.7938247Z z 1024(0x400) 2026-02-21T08:06:26.7938401Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7938547Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7938675Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7938857Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7939003Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7939159Z FBarrier Max Size: 32 2026-02-21T08:06:26.7939300Z ISA 2 2026-02-21T08:06:26.7939449Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T08:06:26.7939639Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7939811Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7939980Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7940152Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7940313Z Fast f16: TRUE 2026-02-21T08:06:26.7940477Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7940626Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7940761Z x 1024(0x400) 2026-02-21T08:06:26.7940940Z y 1024(0x400) 2026-02-21T08:06:26.7941080Z z 1024(0x400) 2026-02-21T08:06:26.7941234Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7941374Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7941504Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7941647Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7941792Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7941949Z FBarrier Max Size: 32 2026-02-21T08:06:26.7942083Z ******* 2026-02-21T08:06:26.7942185Z Agent 8 2026-02-21T08:06:26.7942283Z ******* 2026-02-21T08:06:26.7942403Z Name: gfx942 2026-02-21T08:06:26.7942559Z Uuid: GPU-489fccc039800b1a 2026-02-21T08:06:26.7942722Z Marketing Name: AMD Instinct MI325X 2026-02-21T08:06:26.7942886Z Vendor Name: AMD 2026-02-21T08:06:26.7943047Z Feature: KERNEL_DISPATCH 2026-02-21T08:06:26.7943209Z Profile: BASE_PROFILE 2026-02-21T08:06:26.7943369Z Float Round Mode: NEAR 2026-02-21T08:06:26.7943532Z Max Queue Number: 128(0x80) 2026-02-21T08:06:26.7943691Z Queue Min Size: 64(0x40) 2026-02-21T08:06:26.7943850Z Queue Max Size: 131072(0x20000) 2026-02-21T08:06:26.7944008Z Queue Type: MULTI 2026-02-21T08:06:26.7944196Z Node: 7 2026-02-21T08:06:26.7944349Z Device Type: GPU 2026-02-21T08:06:26.7944486Z Cache Info: 2026-02-21T08:06:26.7944636Z L1: 32(0x20) KB 2026-02-21T08:06:26.7944785Z L2: 4096(0x1000) KB 2026-02-21T08:06:26.7944928Z L3: 262144(0x40000) KB 2026-02-21T08:06:26.7945078Z Chip ID: 29861(0x74a5) 2026-02-21T08:06:26.7945255Z ASIC Revision: 1(0x1) 2026-02-21T08:06:26.7945466Z Cacheline Size: 128(0x80) 2026-02-21T08:06:26.7945673Z Max Clock Freq. (MHz): 2100 2026-02-21T08:06:26.7945835Z BDFID: 34048 2026-02-21T08:06:26.7946026Z Internal Node ID: 7 2026-02-21T08:06:26.7946189Z Compute Unit: 304 2026-02-21T08:06:26.7946348Z SIMDs per CU: 4 2026-02-21T08:06:26.7946511Z Shader Engines: 32 2026-02-21T08:06:26.7946671Z Shader Arrs. per Eng.: 1 2026-02-21T08:06:26.7946867Z WatchPts on Addr. Ranges:4 2026-02-21T08:06:26.7947076Z Coherent Host Access: FALSE 2026-02-21T08:06:26.7947218Z Memory Properties: 2026-02-21T08:06:26.7947338Z Features: KERNEL_DISPATCH 2026-02-21T08:06:26.7947492Z Fast F16 Operation: TRUE 2026-02-21T08:06:26.7947677Z Wavefront Size: 64(0x40) 2026-02-21T08:06:26.7947844Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7948040Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7948174Z x 1024(0x400) 2026-02-21T08:06:26.7948349Z y 1024(0x400) 2026-02-21T08:06:26.7948503Z z 1024(0x400) 2026-02-21T08:06:26.7948653Z Max Waves Per CU: 32(0x20) 2026-02-21T08:06:26.7948816Z Max Work-item Per CU: 2048(0x800) 2026-02-21T08:06:26.7948977Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7949118Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7949240Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7949382Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7949526Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7949683Z Max fbarriers/Workgrp: 32 2026-02-21T08:06:26.7949860Z Packet Processor uCode:: 185 2026-02-21T08:06:26.7950031Z SDMA engine uCode:: 24 2026-02-21T08:06:26.7950199Z IOMMU Support:: None 2026-02-21T08:06:26.7950336Z Pool Info: 2026-02-21T08:06:26.7950438Z Pool 1 2026-02-21T08:06:26.7950576Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T08:06:26.7950742Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7950907Z Allocatable: TRUE 2026-02-21T08:06:26.7951072Z Alloc Granule: 4KB 2026-02-21T08:06:26.7951247Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7951420Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7951586Z Accessible by all: FALSE 2026-02-21T08:06:26.7951732Z Pool 2 2026-02-21T08:06:26.7951873Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T08:06:26.7952040Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7952199Z Allocatable: TRUE 2026-02-21T08:06:26.7952366Z Alloc Granule: 4KB 2026-02-21T08:06:26.7952540Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7952711Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7952879Z Accessible by all: FALSE 2026-02-21T08:06:26.7953017Z Pool 3 2026-02-21T08:06:26.7953149Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T08:06:26.7953349Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7953512Z Allocatable: TRUE 2026-02-21T08:06:26.7953676Z Alloc Granule: 4KB 2026-02-21T08:06:26.7953848Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7954024Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7954187Z Accessible by all: FALSE 2026-02-21T08:06:26.7954332Z Pool 4 2026-02-21T08:06:26.7954462Z Segment: GROUP 2026-02-21T08:06:26.7954621Z Size: 64(0x40) KB 2026-02-21T08:06:26.7954778Z Allocatable: FALSE 2026-02-21T08:06:26.7954945Z Alloc Granule: 0KB 2026-02-21T08:06:26.7955122Z Alloc Recommended Granule:0KB 2026-02-21T08:06:26.7955291Z Alloc Alignment: 0KB 2026-02-21T08:06:26.7955487Z Accessible by all: FALSE 2026-02-21T08:06:26.7955630Z ISA Info: 2026-02-21T08:06:26.7955739Z ISA 1 2026-02-21T08:06:26.7955893Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T08:06:26.7956108Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7956286Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7956457Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7956642Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7956805Z Fast f16: TRUE 2026-02-21T08:06:26.7956998Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7957156Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7957295Z x 1024(0x400) 2026-02-21T08:06:26.7957486Z y 1024(0x400) 2026-02-21T08:06:26.7957637Z z 1024(0x400) 2026-02-21T08:06:26.7957819Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7958028Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7958161Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7958314Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7958484Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7958690Z FBarrier Max Size: 32 2026-02-21T08:06:26.7958831Z ISA 2 2026-02-21T08:06:26.7958984Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T08:06:26.7959170Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7959399Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7959583Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7959795Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7959967Z Fast f16: TRUE 2026-02-21T08:06:26.7960135Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7960286Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7960417Z x 1024(0x400) 2026-02-21T08:06:26.7960560Z y 1024(0x400) 2026-02-21T08:06:26.7960701Z z 1024(0x400) 2026-02-21T08:06:26.7960894Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7961034Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7961164Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7961310Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7961453Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7961607Z FBarrier Max Size: 32 2026-02-21T08:06:26.7961742Z ******* 2026-02-21T08:06:26.7961843Z Agent 9 2026-02-21T08:06:26.7961939Z ******* 2026-02-21T08:06:26.7962059Z Name: gfx942 2026-02-21T08:06:26.7962215Z Uuid: GPU-fac84a106f8362ee 2026-02-21T08:06:26.7962378Z Marketing Name: AMD Instinct MI325X 2026-02-21T08:06:26.7962546Z Vendor Name: AMD 2026-02-21T08:06:26.7962753Z Feature: KERNEL_DISPATCH 2026-02-21T08:06:26.7962948Z Profile: BASE_PROFILE 2026-02-21T08:06:26.7963111Z Float Round Mode: NEAR 2026-02-21T08:06:26.7963273Z Max Queue Number: 128(0x80) 2026-02-21T08:06:26.7963432Z Queue Min Size: 64(0x40) 2026-02-21T08:06:26.7963591Z Queue Max Size: 131072(0x20000) 2026-02-21T08:06:26.7963748Z Queue Type: MULTI 2026-02-21T08:06:26.7963897Z Node: 8 2026-02-21T08:06:26.7964050Z Device Type: GPU 2026-02-21T08:06:26.7964184Z Cache Info: 2026-02-21T08:06:26.7964308Z L1: 32(0x20) KB 2026-02-21T08:06:26.7964456Z L2: 4096(0x1000) KB 2026-02-21T08:06:26.7964598Z L3: 262144(0x40000) KB 2026-02-21T08:06:26.7964748Z Chip ID: 29861(0x74a5) 2026-02-21T08:06:26.7964902Z ASIC Revision: 1(0x1) 2026-02-21T08:06:26.7965066Z Cacheline Size: 128(0x80) 2026-02-21T08:06:26.7965225Z Max Clock Freq. (MHz): 2100 2026-02-21T08:06:26.7965383Z BDFID: 58624 2026-02-21T08:06:26.7965538Z Internal Node ID: 8 2026-02-21T08:06:26.7965699Z Compute Unit: 304 2026-02-21T08:06:26.7965856Z SIMDs per CU: 4 2026-02-21T08:06:26.7966011Z Shader Engines: 32 2026-02-21T08:06:26.7966178Z Shader Arrs. per Eng.: 1 2026-02-21T08:06:26.7966346Z WatchPts on Addr. Ranges:4 2026-02-21T08:06:26.7966519Z Coherent Host Access: FALSE 2026-02-21T08:06:26.7966661Z Memory Properties: 2026-02-21T08:06:26.7966783Z Features: KERNEL_DISPATCH 2026-02-21T08:06:26.7966936Z Fast F16 Operation: TRUE 2026-02-21T08:06:26.7967097Z Wavefront Size: 64(0x40) 2026-02-21T08:06:26.7967262Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7967407Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7967588Z x 1024(0x400) 2026-02-21T08:06:26.7967727Z y 1024(0x400) 2026-02-21T08:06:26.7967887Z z 1024(0x400) 2026-02-21T08:06:26.7968090Z Max Waves Per CU: 32(0x20) 2026-02-21T08:06:26.7968261Z Max Work-item Per CU: 2048(0x800) 2026-02-21T08:06:26.7968429Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7968595Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7968743Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7968933Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7969074Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7969230Z Max fbarriers/Workgrp: 32 2026-02-21T08:06:26.7969405Z Packet Processor uCode:: 185 2026-02-21T08:06:26.7969590Z SDMA engine uCode:: 24 2026-02-21T08:06:26.7969753Z IOMMU Support:: None 2026-02-21T08:06:26.7969891Z Pool Info: 2026-02-21T08:06:26.7969996Z Pool 1 2026-02-21T08:06:26.7970131Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T08:06:26.7970331Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7970526Z Allocatable: TRUE 2026-02-21T08:06:26.7970693Z Alloc Granule: 4KB 2026-02-21T08:06:26.7970874Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7971081Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7971248Z Accessible by all: FALSE 2026-02-21T08:06:26.7971391Z Pool 2 2026-02-21T08:06:26.7971526Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T08:06:26.7971691Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7971850Z Allocatable: TRUE 2026-02-21T08:06:26.7972013Z Alloc Granule: 4KB 2026-02-21T08:06:26.7972187Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7972358Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7972525Z Accessible by all: FALSE 2026-02-21T08:06:26.7972666Z Pool 3 2026-02-21T08:06:26.7972801Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T08:06:26.7972957Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7973114Z Allocatable: TRUE 2026-02-21T08:06:26.7973279Z Alloc Granule: 4KB 2026-02-21T08:06:26.7973447Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7973624Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7973788Z Accessible by all: FALSE 2026-02-21T08:06:26.7973933Z Pool 4 2026-02-21T08:06:26.7974060Z Segment: GROUP 2026-02-21T08:06:26.7974215Z Size: 64(0x40) KB 2026-02-21T08:06:26.7974373Z Allocatable: FALSE 2026-02-21T08:06:26.7974534Z Alloc Granule: 0KB 2026-02-21T08:06:26.7974707Z Alloc Recommended Granule:0KB 2026-02-21T08:06:26.7974877Z Alloc Alignment: 0KB 2026-02-21T08:06:26.7975043Z Accessible by all: FALSE 2026-02-21T08:06:26.7975183Z ISA Info: 2026-02-21T08:06:26.7975287Z ISA 1 2026-02-21T08:06:26.7975460Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T08:06:26.7975637Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7975813Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7975980Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7976152Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7976314Z Fast f16: TRUE 2026-02-21T08:06:26.7976477Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7976628Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7976763Z x 1024(0x400) 2026-02-21T08:06:26.7976908Z y 1024(0x400) 2026-02-21T08:06:26.7977049Z z 1024(0x400) 2026-02-21T08:06:26.7977205Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7977347Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7977504Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7977651Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7977794Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7977952Z FBarrier Max Size: 32 2026-02-21T08:06:26.7978136Z ISA 2 2026-02-21T08:06:26.7978298Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T08:06:26.7978516Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7978725Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7978893Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7979095Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7979265Z Fast f16: TRUE 2026-02-21T08:06:26.7979429Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7979581Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7979712Z x 1024(0x400) 2026-02-21T08:06:26.7979963Z y 1024(0x400) 2026-02-21T08:06:26.7980103Z z 1024(0x400) 2026-02-21T08:06:26.7980281Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7980423Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7980551Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7980697Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7980851Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7981008Z FBarrier Max Size: 32 2026-02-21T08:06:26.7981147Z ******* 2026-02-21T08:06:26.7981247Z Agent 10 2026-02-21T08:06:26.7981342Z ******* 2026-02-21T08:06:26.7981463Z Name: gfx942 2026-02-21T08:06:26.7981621Z Uuid: GPU-56bb44b4843f18b0 2026-02-21T08:06:26.7981783Z Marketing Name: AMD Instinct MI325X 2026-02-21T08:06:26.7981950Z Vendor Name: AMD 2026-02-21T08:06:26.7982109Z Feature: KERNEL_DISPATCH 2026-02-21T08:06:26.7982270Z Profile: BASE_PROFILE 2026-02-21T08:06:26.7982429Z Float Round Mode: NEAR 2026-02-21T08:06:26.7982625Z Max Queue Number: 128(0x80) 2026-02-21T08:06:26.7982786Z Queue Min Size: 64(0x40) 2026-02-21T08:06:26.7982942Z Queue Max Size: 131072(0x20000) 2026-02-21T08:06:26.7983102Z Queue Type: MULTI 2026-02-21T08:06:26.7983252Z Node: 9 2026-02-21T08:06:26.7983404Z Device Type: GPU 2026-02-21T08:06:26.7983540Z Cache Info: 2026-02-21T08:06:26.7983662Z L1: 32(0x20) KB 2026-02-21T08:06:26.7983806Z L2: 4096(0x1000) KB 2026-02-21T08:06:26.7983955Z L3: 262144(0x40000) KB 2026-02-21T08:06:26.7984102Z Chip ID: 29861(0x74a5) 2026-02-21T08:06:26.7984260Z ASIC Revision: 1(0x1) 2026-02-21T08:06:26.7984425Z Cacheline Size: 128(0x80) 2026-02-21T08:06:26.7984584Z Max Clock Freq. (MHz): 2100 2026-02-21T08:06:26.7984682Z BDFID: 38144 2026-02-21T08:06:26.7984748Z Internal Node ID: 9 2026-02-21T08:06:26.7984813Z Compute Unit: 304 2026-02-21T08:06:26.7984884Z SIMDs per CU: 4 2026-02-21T08:06:26.7984951Z Shader Engines: 32 2026-02-21T08:06:26.7985021Z Shader Arrs. per Eng.: 1 2026-02-21T08:06:26.7985091Z WatchPts on Addr. Ranges:4 2026-02-21T08:06:26.7985164Z Coherent Host Access: FALSE 2026-02-21T08:06:26.7985210Z Memory Properties: 2026-02-21T08:06:26.7985264Z Features: KERNEL_DISPATCH 2026-02-21T08:06:26.7985334Z Fast F16 Operation: TRUE 2026-02-21T08:06:26.7985399Z Wavefront Size: 64(0x40) 2026-02-21T08:06:26.7985468Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7985519Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7985575Z x 1024(0x400) 2026-02-21T08:06:26.7985630Z y 1024(0x400) 2026-02-21T08:06:26.7985684Z z 1024(0x400) 2026-02-21T08:06:26.7985753Z Max Waves Per CU: 32(0x20) 2026-02-21T08:06:26.7985821Z Max Work-item Per CU: 2048(0x800) 2026-02-21T08:06:26.7985885Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7985934Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7985991Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7986047Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7986106Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7986175Z Max fbarriers/Workgrp: 32 2026-02-21T08:06:26.7986249Z Packet Processor uCode:: 185 2026-02-21T08:06:26.7986318Z SDMA engine uCode:: 24 2026-02-21T08:06:26.7986389Z IOMMU Support:: None 2026-02-21T08:06:26.7986432Z Pool Info: 2026-02-21T08:06:26.7986475Z Pool 1 2026-02-21T08:06:26.7986550Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T08:06:26.7986613Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7986682Z Allocatable: TRUE 2026-02-21T08:06:26.7986787Z Alloc Granule: 4KB 2026-02-21T08:06:26.7986865Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7986933Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7987002Z Accessible by all: FALSE 2026-02-21T08:06:26.7987047Z Pool 2 2026-02-21T08:06:26.7987118Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T08:06:26.7987180Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7987249Z Allocatable: TRUE 2026-02-21T08:06:26.7987314Z Alloc Granule: 4KB 2026-02-21T08:06:26.7987388Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7987457Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7987527Z Accessible by all: FALSE 2026-02-21T08:06:26.7987568Z Pool 3 2026-02-21T08:06:26.7987674Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T08:06:26.7987733Z Size: 268419072(0xfffc000) KB 2026-02-21T08:06:26.7987798Z Allocatable: TRUE 2026-02-21T08:06:26.7987864Z Alloc Granule: 4KB 2026-02-21T08:06:26.7987940Z Alloc Recommended Granule:2048KB 2026-02-21T08:06:26.7988007Z Alloc Alignment: 4KB 2026-02-21T08:06:26.7988077Z Accessible by all: FALSE 2026-02-21T08:06:26.7988119Z Pool 4 2026-02-21T08:06:26.7988183Z Segment: GROUP 2026-02-21T08:06:26.7988242Z Size: 64(0x40) KB 2026-02-21T08:06:26.7988311Z Allocatable: FALSE 2026-02-21T08:06:26.7988380Z Alloc Granule: 0KB 2026-02-21T08:06:26.7988452Z Alloc Recommended Granule:0KB 2026-02-21T08:06:26.7988519Z Alloc Alignment: 0KB 2026-02-21T08:06:26.7988591Z Accessible by all: FALSE 2026-02-21T08:06:26.7988633Z ISA Info: 2026-02-21T08:06:26.7988672Z ISA 1 2026-02-21T08:06:26.7988750Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T08:06:26.7988821Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7988888Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7988962Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7989034Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7989094Z Fast f16: TRUE 2026-02-21T08:06:26.7989164Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7989218Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7989275Z x 1024(0x400) 2026-02-21T08:06:26.7989332Z y 1024(0x400) 2026-02-21T08:06:26.7989449Z z 1024(0x400) 2026-02-21T08:06:26.7989519Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7989570Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7989632Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7989690Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7989781Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7989854Z FBarrier Max Size: 32 2026-02-21T08:06:26.7989900Z ISA 2 2026-02-21T08:06:26.7989984Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T08:06:26.7990056Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T08:06:26.7990127Z Profiles: HSA_PROFILE_BASE 2026-02-21T08:06:26.7990198Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7990269Z Default Rounding Mode: NEAR 2026-02-21T08:06:26.7990333Z Fast f16: TRUE 2026-02-21T08:06:26.7990402Z Workgroup Max Size: 1024(0x400) 2026-02-21T08:06:26.7990452Z Workgroup Max Size per Dimension: 2026-02-21T08:06:26.7990514Z x 1024(0x400) 2026-02-21T08:06:26.7990569Z y 1024(0x400) 2026-02-21T08:06:26.7990653Z z 1024(0x400) 2026-02-21T08:06:26.7990720Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T08:06:26.7990770Z Grid Max Size per Dimension: 2026-02-21T08:06:26.7990827Z x 4294967295(0xffffffff) 2026-02-21T08:06:26.7990883Z y 4294967295(0xffffffff) 2026-02-21T08:06:26.7990942Z z 4294967295(0xffffffff) 2026-02-21T08:06:26.7991010Z FBarrier Max Size: 32 2026-02-21T08:06:26.7991051Z *** Done *** 2026-02-21T08:06:26.8230260Z ##[group]Run set -x 2026-02-21T08:06:26.8230384Z set -x 2026-02-21T08:06:26.8230430Z apt-get update 2026-02-21T08:06:26.8230478Z apt-get install -y git 2026-02-21T08:06:26.8230673Z shell: bash -l {0} 2026-02-21T08:06:26.8230713Z env: 2026-02-21T08:06:26.8230787Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:26.8230829Z ##[endgroup] 2026-02-21T08:06:26.9187670Z + apt-get update 2026-02-21T08:06:27.0285165Z Get:1 https://repo.radeon.com/amdgpu/6.4.4/ubuntu noble InRelease [5465 B] 2026-02-21T08:06:27.0349441Z Get:2 https://repo.radeon.com/rocm/apt/6.4.4 noble InRelease [2605 B] 2026-02-21T08:06:27.1615600Z Get:3 https://repo.radeon.com/amdgpu/6.4.4/ubuntu noble/main amd64 Packages [14.6 kB] 2026-02-21T08:06:27.2100417Z Get:4 http://security.ubuntu.com/ubuntu noble-security InRelease [126 kB] 2026-02-21T08:06:27.2550172Z Get:5 https://repo.radeon.com/rocm/apt/6.4.4 noble/main amd64 Packages [60.5 kB] 2026-02-21T08:06:27.2867807Z Get:6 http://archive.ubuntu.com/ubuntu noble InRelease [256 kB] 2026-02-21T08:06:27.3590950Z Get:7 http://security.ubuntu.com/ubuntu noble-security/universe amd64 Packages [1207 kB] 2026-02-21T08:06:27.4693224Z Get:8 http://security.ubuntu.com/ubuntu noble-security/multiverse amd64 Packages [34.8 kB] 2026-02-21T08:06:27.4695877Z Get:9 http://security.ubuntu.com/ubuntu noble-security/restricted amd64 Packages [3196 kB] 2026-02-21T08:06:27.5181901Z Get:10 http://security.ubuntu.com/ubuntu noble-security/main amd64 Packages [1857 kB] 2026-02-21T08:06:27.8715248Z Get:11 http://archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB] 2026-02-21T08:06:28.0108343Z Get:12 http://archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB] 2026-02-21T08:06:28.1507810Z Get:13 http://archive.ubuntu.com/ubuntu noble/restricted amd64 Packages [117 kB] 2026-02-21T08:06:28.1839001Z Get:14 http://archive.ubuntu.com/ubuntu noble/multiverse amd64 Packages [331 kB] 2026-02-21T08:06:28.2796584Z Get:15 http://archive.ubuntu.com/ubuntu noble/universe amd64 Packages [19.3 MB] 2026-02-21T08:06:28.8452508Z Get:16 http://archive.ubuntu.com/ubuntu noble/main amd64 Packages [1808 kB] 2026-02-21T08:06:28.8666471Z Get:17 http://archive.ubuntu.com/ubuntu noble-updates/universe amd64 Packages [2016 kB] 2026-02-21T08:06:28.8878293Z Get:18 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages [2240 kB] 2026-02-21T08:06:28.9154355Z Get:19 http://archive.ubuntu.com/ubuntu noble-updates/restricted amd64 Packages [3381 kB] 2026-02-21T08:06:28.9622347Z Get:20 http://archive.ubuntu.com/ubuntu noble-updates/multiverse amd64 Packages [38.1 kB] 2026-02-21T08:06:28.9628201Z Get:21 http://archive.ubuntu.com/ubuntu noble-backports/universe amd64 Packages [34.6 kB] 2026-02-21T08:06:28.9628814Z Get:22 http://archive.ubuntu.com/ubuntu noble-backports/main amd64 Packages [49.5 kB] 2026-02-21T08:06:29.4088776Z Fetched 36.3 MB in 2s (14.7 MB/s) 2026-02-21T08:06:29.8538881Z Reading package lists... 2026-02-21T08:06:29.8652777Z W: https://repo.radeon.com/amdgpu/6.4.4/ubuntu/dists/noble/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. 2026-02-21T08:06:29.8653796Z W: https://repo.radeon.com/rocm/apt/6.4.4/dists/noble/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. 2026-02-21T08:06:29.8659736Z + apt-get install -y git 2026-02-21T08:06:30.3020570Z Reading package lists... 2026-02-21T08:06:30.4440502Z Building dependency tree... 2026-02-21T08:06:30.4441351Z Reading state information... 2026-02-21T08:06:30.5419366Z The following additional packages will be installed: 2026-02-21T08:06:30.5420925Z git-man less libcbor0.10 libcurl3t64-gnutls liberror-perl libfido2-1 2026-02-21T08:06:30.5422473Z libxmuu1 openssh-client xauth 2026-02-21T08:06:30.5425755Z Suggested packages: 2026-02-21T08:06:30.5426176Z gettext-base git-daemon-run | git-daemon-sysvinit git-doc git-email git-gui 2026-02-21T08:06:30.5427310Z gitk gitweb git-cvs git-mediawiki git-svn keychain libpam-ssh monkeysphere 2026-02-21T08:06:30.5427558Z ssh-askpass 2026-02-21T08:06:30.5641183Z The following NEW packages will be installed: 2026-02-21T08:06:30.5645533Z git git-man less libcbor0.10 libcurl3t64-gnutls liberror-perl libfido2-1 2026-02-21T08:06:30.5647021Z libxmuu1 openssh-client xauth 2026-02-21T08:06:30.7894174Z 0 upgraded, 10 newly installed, 0 to remove and 101 not upgraded. 2026-02-21T08:06:30.7894696Z Need to get 6330 kB of archives. 2026-02-21T08:06:30.7895109Z After this operation, 29.8 MB of additional disk space will be used. 2026-02-21T08:06:30.7895819Z Get:1 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 less amd64 590-2ubuntu2.1 [142 kB] 2026-02-21T08:06:31.2801027Z Get:2 http://archive.ubuntu.com/ubuntu noble/main amd64 libcbor0.10 amd64 0.10.2-1.2ubuntu2 [25.8 kB] 2026-02-21T08:06:31.2924893Z Get:3 http://archive.ubuntu.com/ubuntu noble/main amd64 libfido2-1 amd64 1.14.0-1build3 [83.5 kB] 2026-02-21T08:06:31.3397239Z Get:4 http://archive.ubuntu.com/ubuntu noble/main amd64 libxmuu1 amd64 2:1.1.3-3build2 [8958 B] 2026-02-21T08:06:31.3441066Z Get:5 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 openssh-client amd64 1:9.6p1-3ubuntu13.14 [906 kB] 2026-02-21T08:06:31.6257104Z Get:6 http://archive.ubuntu.com/ubuntu noble/main amd64 xauth amd64 1:1.1.2-1build1 [25.6 kB] 2026-02-21T08:06:31.6287906Z Get:7 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libcurl3t64-gnutls amd64 8.5.0-2ubuntu10.6 [333 kB] 2026-02-21T08:06:31.6660475Z Get:8 http://archive.ubuntu.com/ubuntu noble/main amd64 liberror-perl all 0.17029-2 [25.6 kB] 2026-02-21T08:06:31.6677790Z Get:9 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git-man all 1:2.43.0-1ubuntu7.3 [1100 kB] 2026-02-21T08:06:31.7542840Z Get:10 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git amd64 1:2.43.0-1ubuntu7.3 [3680 kB] 2026-02-21T08:06:31.9598989Z debconf: delaying package configuration, since apt-utils is not installed 2026-02-21T08:06:31.9777922Z Fetched 6330 kB in 1s (4814 kB/s) 2026-02-21T08:06:31.9895156Z Selecting previously unselected package less. 2026-02-21T08:06:31.9924868Z (Reading database ... 2026-02-21T08:06:31.9931246Z (Reading database ... 5% 2026-02-21T08:06:31.9942298Z (Reading database ... 10% 2026-02-21T08:06:31.9949373Z (Reading database ... 15% 2026-02-21T08:06:31.9955502Z (Reading database ... 20% 2026-02-21T08:06:31.9961323Z (Reading database ... 25% 2026-02-21T08:06:31.9970874Z (Reading database ... 30% 2026-02-21T08:06:31.9972994Z (Reading database ... 35% 2026-02-21T08:06:31.9973149Z (Reading database ... 40% 2026-02-21T08:06:31.9973351Z (Reading database ... 45% 2026-02-21T08:06:31.9973506Z (Reading database ... 50% 2026-02-21T08:06:31.9973664Z (Reading database ... 55% 2026-02-21T08:06:31.9973818Z (Reading database ... 60% 2026-02-21T08:06:31.9973974Z (Reading database ... 65% 2026-02-21T08:06:31.9974129Z (Reading database ... 70% 2026-02-21T08:06:31.9974282Z (Reading database ... 75% 2026-02-21T08:06:31.9974446Z (Reading database ... 80% 2026-02-21T08:06:31.9974604Z (Reading database ... 85% 2026-02-21T08:06:31.9974762Z (Reading database ... 90% 2026-02-21T08:06:31.9974922Z (Reading database ... 95% 2026-02-21T08:06:31.9975084Z (Reading database ... 100% 2026-02-21T08:06:31.9975347Z (Reading database ... 28634 files and directories currently installed.) 2026-02-21T08:06:31.9979532Z Preparing to unpack .../0-less_590-2ubuntu2.1_amd64.deb ... 2026-02-21T08:06:31.9991478Z Unpacking less (590-2ubuntu2.1) ... 2026-02-21T08:06:32.0084888Z Selecting previously unselected package libcbor0.10:amd64. 2026-02-21T08:06:32.0100044Z Preparing to unpack .../1-libcbor0.10_0.10.2-1.2ubuntu2_amd64.deb ... 2026-02-21T08:06:32.0104100Z Unpacking libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ... 2026-02-21T08:06:32.0188921Z Selecting previously unselected package libfido2-1:amd64. 2026-02-21T08:06:32.0207078Z Preparing to unpack .../2-libfido2-1_1.14.0-1build3_amd64.deb ... 2026-02-21T08:06:32.0209103Z Unpacking libfido2-1:amd64 (1.14.0-1build3) ... 2026-02-21T08:06:32.0306556Z Selecting previously unselected package libxmuu1:amd64. 2026-02-21T08:06:32.0321809Z Preparing to unpack .../3-libxmuu1_2%3a1.1.3-3build2_amd64.deb ... 2026-02-21T08:06:32.0323981Z Unpacking libxmuu1:amd64 (2:1.1.3-3build2) ... 2026-02-21T08:06:32.0394543Z Selecting previously unselected package openssh-client. 2026-02-21T08:06:32.0410812Z Preparing to unpack .../4-openssh-client_1%3a9.6p1-3ubuntu13.14_amd64.deb ... 2026-02-21T08:06:32.0441039Z Unpacking openssh-client (1:9.6p1-3ubuntu13.14) ... 2026-02-21T08:06:32.0607361Z Selecting previously unselected package xauth. 2026-02-21T08:06:32.0622317Z Preparing to unpack .../5-xauth_1%3a1.1.2-1build1_amd64.deb ... 2026-02-21T08:06:32.0623895Z Unpacking xauth (1:1.1.2-1build1) ... 2026-02-21T08:06:32.0691034Z Selecting previously unselected package libcurl3t64-gnutls:amd64. 2026-02-21T08:06:32.0705520Z Preparing to unpack .../6-libcurl3t64-gnutls_8.5.0-2ubuntu10.6_amd64.deb ... 2026-02-21T08:06:32.0707285Z Unpacking libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ... 2026-02-21T08:06:32.0793522Z Selecting previously unselected package liberror-perl. 2026-02-21T08:06:32.0808586Z Preparing to unpack .../7-liberror-perl_0.17029-2_all.deb ... 2026-02-21T08:06:32.0810230Z Unpacking liberror-perl (0.17029-2) ... 2026-02-21T08:06:32.0886856Z Selecting previously unselected package git-man. 2026-02-21T08:06:32.0902765Z Preparing to unpack .../8-git-man_1%3a2.43.0-1ubuntu7.3_all.deb ... 2026-02-21T08:06:32.0904274Z Unpacking git-man (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:32.1007913Z Selecting previously unselected package git. 2026-02-21T08:06:32.1022664Z Preparing to unpack .../9-git_1%3a2.43.0-1ubuntu7.3_amd64.deb ... 2026-02-21T08:06:32.1047645Z Unpacking git (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:32.1680502Z Setting up libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ... 2026-02-21T08:06:32.1685402Z Setting up libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ... 2026-02-21T08:06:32.1690000Z Setting up less (590-2ubuntu2.1) ... 2026-02-21T08:06:32.1717215Z Setting up liberror-perl (0.17029-2) ... 2026-02-21T08:06:32.1721648Z Setting up git-man (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:32.1726328Z Setting up libfido2-1:amd64 (1.14.0-1build3) ... 2026-02-21T08:06:32.1730798Z Setting up libxmuu1:amd64 (2:1.1.3-3build2) ... 2026-02-21T08:06:32.1735981Z Setting up openssh-client (1:9.6p1-3ubuntu13.14) ... 2026-02-21T08:06:32.1989138Z Setting up git (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:32.2020178Z Setting up xauth (1:1.1.2-1build1) ... 2026-02-21T08:06:32.2025878Z Processing triggers for libc-bin (2.39-0ubuntu8.6) ... 2026-02-21T08:06:32.2476291Z ##[group]Run actions/checkout@v6 2026-02-21T08:06:32.2476426Z with: 2026-02-21T08:06:32.2476519Z repository: pytorch/helion 2026-02-21T08:06:32.2476696Z token: *** 2026-02-21T08:06:32.2476785Z ssh-strict: true 2026-02-21T08:06:32.2476872Z ssh-user: git 2026-02-21T08:06:32.2476970Z persist-credentials: true 2026-02-21T08:06:32.2477074Z clean: true 2026-02-21T08:06:32.2477170Z sparse-checkout-cone-mode: true 2026-02-21T08:06:32.2477283Z fetch-depth: 1 2026-02-21T08:06:32.2477370Z fetch-tags: false 2026-02-21T08:06:32.2477458Z show-progress: true 2026-02-21T08:06:32.2477548Z lfs: false 2026-02-21T08:06:32.2477628Z submodules: false 2026-02-21T08:06:32.2477721Z set-safe-directory: true 2026-02-21T08:06:32.2477986Z env: 2026-02-21T08:06:32.2478068Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:32.2478176Z ##[endgroup] 2026-02-21T08:06:32.2502266Z ##[command]/usr/bin/docker exec 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:32.4892468Z Syncing repository: pytorch/helion 2026-02-21T08:06:32.4893426Z ##[group]Getting Git version info 2026-02-21T08:06:32.4893628Z Working directory is '/__w/helion/helion' 2026-02-21T08:06:32.4893901Z [command]/usr/bin/git version 2026-02-21T08:06:32.4894019Z git version 2.43.0 2026-02-21T08:06:32.4895009Z ##[endgroup] 2026-02-21T08:06:32.4904512Z Temporarily overriding HOME='/__w/_temp/bd5d00ae-b4c0-4db9-886a-94e3a2619ff1' before making global git config changes 2026-02-21T08:06:32.4904847Z Adding repository directory to the temporary git global config as a safe directory 2026-02-21T08:06:32.4907161Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion 2026-02-21T08:06:32.4929907Z Deleting the contents of '/__w/helion/helion' 2026-02-21T08:06:32.4931579Z ##[group]Initializing the repository 2026-02-21T08:06:32.4932751Z [command]/usr/bin/git init /__w/helion/helion 2026-02-21T08:06:32.4953495Z hint: Using 'master' as the name for the initial branch. This default branch name 2026-02-21T08:06:32.4953778Z hint: is subject to change. To configure the initial branch name to use in all 2026-02-21T08:06:32.4953997Z hint: of your new repositories, which will suppress this warning, call: 2026-02-21T08:06:32.4954154Z hint: 2026-02-21T08:06:32.4954307Z hint: git config --global init.defaultBranch 2026-02-21T08:06:32.4954450Z hint: 2026-02-21T08:06:32.4954584Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2026-02-21T08:06:32.4954806Z hint: 'development'. The just-created branch can be renamed via this command: 2026-02-21T08:06:32.4954992Z hint: 2026-02-21T08:06:32.4955083Z hint: git branch -m 2026-02-21T08:06:32.4955739Z Initialized empty Git repository in /__w/helion/helion/.git/ 2026-02-21T08:06:32.4961049Z [command]/usr/bin/git remote add origin https://github.com/pytorch/helion 2026-02-21T08:06:32.4979794Z ##[endgroup] 2026-02-21T08:06:32.4979997Z ##[group]Disabling automatic garbage collection 2026-02-21T08:06:32.4981285Z [command]/usr/bin/git config --local gc.auto 0 2026-02-21T08:06:32.4997946Z ##[endgroup] 2026-02-21T08:06:32.4998136Z ##[group]Setting up auth 2026-02-21T08:06:32.4998921Z Removing SSH command configuration 2026-02-21T08:06:32.5001471Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2026-02-21T08:06:32.5017141Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2026-02-21T08:06:32.5150674Z Removing HTTP extra header 2026-02-21T08:06:32.5151542Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2026-02-21T08:06:32.5164128Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2026-02-21T08:06:32.5279888Z Removing includeIf entries pointing to credentials config files 2026-02-21T08:06:32.5282016Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2026-02-21T08:06:32.5296110Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2026-02-21T08:06:32.5422323Z [command]/usr/bin/git config --file /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config http.https://github.com/.extraheader AUTHORIZATION: basic *** 2026-02-21T08:06:32.5440483Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config 2026-02-21T08:06:32.5458919Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config 2026-02-21T08:06:32.5474714Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config 2026-02-21T08:06:32.5489063Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config 2026-02-21T08:06:32.5500881Z ##[endgroup] 2026-02-21T08:06:32.5501066Z ##[group]Fetching the repository 2026-02-21T08:06:32.5504618Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +874a7d0cadab18218a84ad3579d329dc95c51820:refs/remotes/origin/main 2026-02-21T08:06:33.0736891Z From https://github.com/pytorch/helion 2026-02-21T08:06:33.0737206Z * [new ref] 874a7d0cadab18218a84ad3579d329dc95c51820 -> origin/main 2026-02-21T08:06:33.0753384Z [command]/usr/bin/git branch --list --remote origin/main 2026-02-21T08:06:33.0773358Z origin/main 2026-02-21T08:06:33.0779133Z [command]/usr/bin/git rev-parse refs/remotes/origin/main 2026-02-21T08:06:33.0794774Z 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T08:06:33.0797217Z ##[endgroup] 2026-02-21T08:06:33.0797381Z ##[group]Determining the checkout info 2026-02-21T08:06:33.0800326Z ##[endgroup] 2026-02-21T08:06:33.0802059Z [command]/usr/bin/git sparse-checkout disable 2026-02-21T08:06:33.0821783Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2026-02-21T08:06:33.0836642Z ##[group]Checking out the ref 2026-02-21T08:06:33.0838701Z [command]/usr/bin/git checkout --progress --force -B main refs/remotes/origin/main 2026-02-21T08:06:33.1011840Z Switched to a new branch 'main' 2026-02-21T08:06:33.1012306Z branch 'main' set up to track 'origin/main'. 2026-02-21T08:06:33.1014055Z ##[endgroup] 2026-02-21T08:06:33.1040433Z [command]/usr/bin/git log -1 --format=%H 2026-02-21T08:06:33.1052039Z 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T08:06:33.1193214Z ##[group]Run actions/setup-python@v6 2026-02-21T08:06:33.1193358Z with: 2026-02-21T08:06:33.1193447Z python-version: 3.12 2026-02-21T08:06:33.1193544Z check-latest: false 2026-02-21T08:06:33.1193685Z token: *** 2026-02-21T08:06:33.1193776Z update-environment: true 2026-02-21T08:06:33.1193883Z allow-prereleases: false 2026-02-21T08:06:33.1193995Z freethreaded: false 2026-02-21T08:06:33.1194086Z env: 2026-02-21T08:06:33.1194172Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:33.1194283Z ##[endgroup] 2026-02-21T08:06:33.1196577Z ##[command]/usr/bin/docker exec 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:33.3174070Z ##[group]Installed versions 2026-02-21T08:06:33.3179507Z Version 3.12 was not found in the local cache 2026-02-21T08:06:33.8491681Z Version 3.12 is available for downloading 2026-02-21T08:06:33.8492316Z Download from "https://github.com/actions/python-versions/releases/download/3.12.12-18393146713/python-3.12.12-linux-24.04-x64.tar.gz" 2026-02-21T08:06:34.8870495Z Extract downloaded archive 2026-02-21T08:06:34.8965100Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/3b2cb34e-d9a4-4ea8-ac0a-6dc84bdd17d8 -f /__w/_temp/6fe56298-1f17-4909-b978-5a49907e4cf3 2026-02-21T08:06:36.1661639Z Execute installation script 2026-02-21T08:06:36.1749346Z Check if Python hostedtoolcache folder exist... 2026-02-21T08:06:36.1750744Z Creating Python hostedtoolcache folder... 2026-02-21T08:06:36.1757295Z Create Python 3.12.12 folder 2026-02-21T08:06:36.1760418Z Copy Python binaries to hostedtoolcache folder 2026-02-21T08:06:36.6031754Z Create additional symlinks (Required for the UsePythonVersion Azure Pipelines task and the setup-python GitHub Action) 2026-02-21T08:06:36.6047920Z Upgrading pip... 2026-02-21T08:06:37.5936524Z Looking in links: /tmp/tmpzhzlgfrp 2026-02-21T08:06:37.5937235Z Requirement already satisfied: pip in /__w/_tool/Python/3.12.12/x64/lib/python3.12/site-packages (25.0.1) 2026-02-21T08:06:37.5974333Z ##[error]WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. 2026-02-21T08:06:38.0154447Z ##[error]WARNING: The directory '/github/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag. 2026-02-21T08:06:38.1388143Z Collecting pip 2026-02-21T08:06:38.1764781Z Downloading pip-26.0.1-py3-none-any.whl.metadata (4.7 kB) 2026-02-21T08:06:38.1818550Z Downloading pip-26.0.1-py3-none-any.whl (1.8 MB) 2026-02-21T08:06:38.2049445Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 162.6 MB/s eta 0:00:00 2026-02-21T08:06:38.2049651Z 2026-02-21T08:06:38.2122413Z Installing collected packages: pip 2026-02-21T08:06:38.2123620Z Attempting uninstall: pip 2026-02-21T08:06:38.2132261Z Found existing installation: pip 25.0.1 2026-02-21T08:06:38.2280191Z Uninstalling pip-25.0.1: 2026-02-21T08:06:38.2308249Z Successfully uninstalled pip-25.0.1 2026-02-21T08:06:38.6990413Z Successfully installed pip-26.0.1 2026-02-21T08:06:38.7271499Z Create complete file 2026-02-21T08:06:38.7491632Z Successfully set up CPython (3.12.12) 2026-02-21T08:06:38.7492015Z ##[endgroup] 2026-02-21T08:06:38.8098284Z ##[group]Run astral-sh/setup-uv@v7 2026-02-21T08:06:38.8098469Z with: 2026-02-21T08:06:38.8098597Z activate-environment: false 2026-02-21T08:06:38.8098789Z working-directory: /home/runner/_work/helion/helion 2026-02-21T08:06:38.8113532Z github-token: *** 2026-02-21T08:06:38.8113661Z enable-cache: auto 2026-02-21T08:06:38.8114020Z cache-dependency-glob: **/*requirements*.txt **/*requirements*.in **/*constraints*.txt **/*constraints*.in **/pyproject.toml **/uv.lock **/*.py.lock 2026-02-21T08:06:38.8114456Z restore-cache: true 2026-02-21T08:06:38.8114590Z save-cache: true 2026-02-21T08:06:38.8114712Z prune-cache: true 2026-02-21T08:06:38.8114842Z cache-python: false 2026-02-21T08:06:38.8114983Z ignore-nothing-to-cache: false 2026-02-21T08:06:38.8115148Z ignore-empty-workdir: false 2026-02-21T08:06:38.8115314Z add-problem-matchers: true 2026-02-21T08:06:38.8115468Z resolution-strategy: highest 2026-02-21T08:06:38.8115618Z env: 2026-02-21T08:06:38.8115764Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:38.8115941Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:38.8116169Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:38.8116390Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:38.8116587Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:38.8116789Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:38.8116988Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T08:06:38.8117168Z ##[endgroup] 2026-02-21T08:06:38.8121922Z ##[command]/usr/bin/docker exec 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:39.0298113Z (node:701) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2026-02-21T08:06:39.0298475Z (Use `node --trace-deprecation ...` to show where the warning was created) 2026-02-21T08:06:39.0335857Z Trying to find version for uv in: /__w/helion/helion/uv.toml 2026-02-21T08:06:39.0336039Z Could not find file: /__w/helion/helion/uv.toml 2026-02-21T08:06:39.0336213Z Trying to find version for uv in: /__w/helion/helion/pyproject.toml 2026-02-21T08:06:39.0373196Z Could not determine uv version from uv.toml or pyproject.toml. Falling back to latest. 2026-02-21T08:06:39.0373486Z Getting latest version from GitHub API... 2026-02-21T08:06:39.2686829Z manifest-file not provided, reading from local file. 2026-02-21T08:06:39.2709579Z manifest-file does not contain version 0.10.4, arch x86_64, platform unknown-linux-gnu. Falling back to GitHub releases. 2026-02-21T08:06:39.2710749Z Downloading uv from "https://github.com/astral-sh/uv/releases/download/0.10.4/uv-x86_64-unknown-linux-gnu.tar.gz" ... 2026-02-21T08:06:40.2341718Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/416381ee-2d9f-4ce0-98e4-51a9429e21dc -f /__w/_temp/2cdffd75-6a4f-4df0-a6ed-6b2378cec22a 2026-02-21T08:06:40.4986843Z Added /github/home/.local/bin to the path 2026-02-21T08:06:40.4989351Z Added /__w/_tool/uv/0.10.4/x86_64 to the path 2026-02-21T08:06:40.4989847Z Set UV_PYTHON_INSTALL_DIR to /github/home/.local/share/uv/python 2026-02-21T08:06:40.4990348Z Added /github/home/.local/share/uv/python to the path 2026-02-21T08:06:40.5001804Z Successfully installed uv version 0.10.4 2026-02-21T08:06:40.6342867Z ##[group]Run uv venv --python 3.12 2026-02-21T08:06:40.6343053Z uv venv --python 3.12 2026-02-21T08:06:40.6343335Z shell: bash -l {0} 2026-02-21T08:06:40.6343425Z env: 2026-02-21T08:06:40.6343512Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:40.6343655Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:40.6343818Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:40.6343982Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:40.6344127Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:40.6344264Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:40.6344408Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T08:06:40.6344565Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:40.6344716Z ##[endgroup] 2026-02-21T08:06:40.7164082Z Using CPython 3.12.12 interpreter at: /__w/_tool/Python/3.12.12/x64/bin/python3.12 2026-02-21T08:06:40.7164314Z Creating virtual environment at: .venv 2026-02-21T08:06:40.7171937Z Activate with: source .venv/bin/activate 2026-02-21T08:06:40.7230159Z ##[group]Run source .venv/bin/activate 2026-02-21T08:06:40.7230342Z source .venv/bin/activate 2026-02-21T08:06:40.7230706Z uv pip install -U "torch==2.9.*" --index-url https://download.pytorch.org/whl/rocm6.4 2026-02-21T08:06:40.7231055Z shell: bash -l {0} 2026-02-21T08:06:40.7231164Z env: 2026-02-21T08:06:40.7231267Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:40.7231431Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:40.7231629Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:40.7231826Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:40.7231997Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:40.7232163Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:40.7232340Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T08:06:40.7232533Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:40.7232715Z ##[endgroup] 2026-02-21T08:06:42.7054904Z Resolved 11 packages in 1.91s 2026-02-21T08:06:42.7824116Z Downloading networkx (2.0MiB) 2026-02-21T08:06:42.7824484Z Downloading sympy (6.0MiB) 2026-02-21T08:06:42.8882806Z Downloading torch (4.2GiB) 2026-02-21T08:06:42.8883057Z Downloading pytorch-triton-rocm (261.8MiB) 2026-02-21T08:06:45.3117466Z Downloaded networkx 2026-02-21T08:06:45.7762499Z Downloaded sympy 2026-02-21T08:06:49.3716728Z Downloaded pytorch-triton-rocm 2026-02-21T08:07:50.0456391Z Downloaded torch 2026-02-21T08:07:50.0456620Z Prepared 11 packages in 1m 07s 2026-02-21T08:07:50.0654753Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:50.0655076Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:50.0655411Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:55.7777469Z Installed 11 packages in 5.73s 2026-02-21T08:07:55.7781566Z + filelock==3.20.0 2026-02-21T08:07:55.7783143Z + fsspec==2025.12.0 2026-02-21T08:07:55.7783498Z + jinja2==3.1.6 2026-02-21T08:07:55.7783711Z + markupsafe==3.0.2 2026-02-21T08:07:55.7783915Z + mpmath==1.3.0 2026-02-21T08:07:55.7784108Z + networkx==3.6.1 2026-02-21T08:07:55.7784364Z + pytorch-triton-rocm==3.5.1 2026-02-21T08:07:55.7784628Z + setuptools==70.2.0 2026-02-21T08:07:55.7784838Z + sympy==1.14.0 2026-02-21T08:07:55.7785031Z + torch==2.9.1+rocm6.4 2026-02-21T08:07:55.7785257Z + typing-extensions==4.15.0 2026-02-21T08:07:55.7978618Z ##[group]Run source .venv/bin/activate 2026-02-21T08:07:55.7978815Z source .venv/bin/activate 2026-02-21T08:07:55.7978992Z SETUPTOOLS_SCM_PRETEND_VERSION="0.0.0" uv pip install -e .'[dev]' 2026-02-21T08:07:55.7979208Z python -c "import helion; print(helion.__name__)" 2026-02-21T08:07:55.7979734Z shell: bash -l {0} 2026-02-21T08:07:55.7979831Z env: 2026-02-21T08:07:55.7979927Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:07:55.7980058Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:55.7980222Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:07:55.7980380Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:55.7980521Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:55.7980668Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:55.7980809Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T08:07:55.7980984Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:07:55.7981129Z ##[endgroup] 2026-02-21T08:07:56.6291367Z Resolved 30 packages in 547ms 2026-02-21T08:07:56.6300828Z Building helion @ file:///__w/helion/helion 2026-02-21T08:07:56.6370043Z Downloading virtualenv (5.6MiB) 2026-02-21T08:07:56.6425396Z Downloading scipy (33.4MiB) 2026-02-21T08:07:56.6426372Z Downloading pygments (1.2MiB) 2026-02-21T08:07:56.6458179Z Downloading numpy (15.8MiB) 2026-02-21T08:07:56.6459811Z Downloading scikit-learn (8.5MiB) 2026-02-21T08:07:56.7307943Z Built helion @ file:///__w/helion/helion 2026-02-21T08:07:56.7452331Z Downloaded virtualenv 2026-02-21T08:07:56.7537267Z Downloaded pygments 2026-02-21T08:07:56.9269436Z Downloaded scikit-learn 2026-02-21T08:07:56.9314119Z Downloaded numpy 2026-02-21T08:07:57.2801948Z Downloaded scipy 2026-02-21T08:07:57.2803371Z Prepared 27 packages in 651ms 2026-02-21T08:07:57.2813521Z Uninstalled 1 package in 0.92ms 2026-02-21T08:07:57.2820008Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:57.2820330Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:57.2820641Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:57.3648757Z Installed 29 packages in 83ms 2026-02-21T08:07:57.3649476Z + cfgv==3.5.0 2026-02-21T08:07:57.3649884Z + distlib==0.4.0 2026-02-21T08:07:57.3651113Z + expecttest==0.3.0 2026-02-21T08:07:57.3651308Z + filecheck==1.0.3 2026-02-21T08:07:57.3651422Z - filelock==3.20.0 2026-02-21T08:07:57.3651517Z + filelock==3.24.3 2026-02-21T08:07:57.3651641Z + helion==0.0.0 (from file:///__w/helion/helion) 2026-02-21T08:07:57.3651800Z + hypothesis==6.151.9 2026-02-21T08:07:57.3651924Z + identify==2.6.16 2026-02-21T08:07:57.3652024Z + iniconfig==2.3.0 2026-02-21T08:07:57.3652114Z + joblib==1.5.3 2026-02-21T08:07:57.3652217Z + markdown-it-py==4.0.0 2026-02-21T08:07:57.3652378Z + mdurl==0.1.2 2026-02-21T08:07:57.3652472Z + nodeenv==1.10.0 2026-02-21T08:07:57.3652558Z + numpy==2.4.2 2026-02-21T08:07:57.3652650Z + packaging==26.0 2026-02-21T08:07:57.3652746Z + platformdirs==4.9.2 2026-02-21T08:07:57.3652876Z + pluggy==1.6.0 2026-02-21T08:07:57.3652967Z + pre-commit==4.5.1 2026-02-21T08:07:57.3653065Z + psutil==7.2.2 2026-02-21T08:07:57.3653748Z + pygments==2.19.2 2026-02-21T08:07:57.3653854Z + pytest==9.0.2 2026-02-21T08:07:57.3653944Z + pytest-timeout==2.4.0 2026-02-21T08:07:57.3654046Z + pyyaml==6.0.3 2026-02-21T08:07:57.3654132Z + rich==14.3.3 2026-02-21T08:07:57.3654225Z + scikit-learn==1.8.0 2026-02-21T08:07:57.3654323Z + scipy==1.17.0 2026-02-21T08:07:57.3654418Z + sortedcontainers==2.4.0 2026-02-21T08:07:57.3654528Z + threadpoolctl==3.6.0 2026-02-21T08:07:57.3654624Z + virtualenv==20.38.0 2026-02-21T08:08:12.5966938Z helion 2026-02-21T08:08:14.0216728Z ##[group]Run set -x 2026-02-21T08:08:14.0216885Z set -x 2026-02-21T08:08:14.0216984Z source .venv/bin/activate 2026-02-21T08:08:14.0217111Z uv pip install pip 2026-02-21T08:08:14.0217240Z uv pip install quack-kernels --no-deps 2026-02-21T08:08:14.0217402Z mkdir -p benchmarks/ && pushd benchmarks/ 2026-02-21T08:08:14.0217588Z git clone https://github.com/pytorch-labs/tritonbench/ 2026-02-21T08:08:14.0217766Z pushd tritonbench/ 2026-02-21T08:08:14.0217895Z git submodule update --init --recursive 2026-02-21T08:08:14.0218039Z uv pip install -r requirements.txt 2026-02-21T08:08:14.0218175Z python install.py --liger 2026-02-21T08:08:14.0218295Z uv pip install -e . --no-deps 2026-02-21T08:08:14.0218415Z popd 2026-02-21T08:08:14.0218501Z popd 2026-02-21T08:08:14.0218731Z shell: bash -l {0} 2026-02-21T08:08:14.0218818Z env: 2026-02-21T08:08:14.0218931Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:08:14.0219060Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:14.0219227Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:08:14.0219386Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:14.0219527Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:14.0219669Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:14.0219807Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T08:08:14.0219971Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:08:14.0220114Z ##[endgroup] 2026-02-21T08:08:14.0578917Z + source .venv/bin/activate 2026-02-21T08:08:14.0579264Z ++ '[' -z '' ']' 2026-02-21T08:08:14.0579393Z ++ '[' -n x ']' 2026-02-21T08:08:14.0579535Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T08:08:14.0579755Z ++ '[' .venv/bin/activate = /__w/_temp/8b34363f-b31b-4bac-8d07-2e9aab508267.sh ']' 2026-02-21T08:08:14.0580062Z ++ deactivate nondestructive 2026-02-21T08:08:14.0580207Z ++ unset -f pydoc 2026-02-21T08:08:14.0580321Z ++ '[' -z '' ']' 2026-02-21T08:08:14.0580430Z ++ '[' -z '' ']' 2026-02-21T08:08:14.0580538Z ++ hash -r 2026-02-21T08:08:14.0580640Z ++ '[' -z '' ']' 2026-02-21T08:08:14.0580752Z ++ unset VIRTUAL_ENV 2026-02-21T08:08:14.0580881Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T08:08:14.0581035Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T08:08:14.0581202Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T08:08:14.0581379Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T08:08:14.0581515Z ++ '[' linux-gnu = msys ']' 2026-02-21T08:08:14.0581646Z ++ export VIRTUAL_ENV 2026-02-21T08:08:14.0581764Z ++ '[' -z '' ']' 2026-02-21T08:08:14.0581878Z ++ unset SCRIPT_PATH 2026-02-21T08:08:14.0582357Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:08:14.0583216Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:08:14.0583694Z ++ export PATH 2026-02-21T08:08:14.0583812Z ++ '[' xhelion '!=' x ']' 2026-02-21T08:08:14.0583949Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T08:08:14.0584091Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T08:08:14.0584670Z ++ '[' -z '' ']' 2026-02-21T08:08:14.0584773Z ++ '[' -z '' ']' 2026-02-21T08:08:14.0584883Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T08:08:14.0585004Z ++ PS1='(helion) ' 2026-02-21T08:08:14.0585124Z ++ export PS1 2026-02-21T08:08:14.0585228Z ++ alias pydoc 2026-02-21T08:08:14.0585338Z ++ true 2026-02-21T08:08:14.0585436Z ++ hash -r 2026-02-21T08:08:14.0585549Z + uv pip install pip 2026-02-21T08:08:14.1044019Z Resolved 1 package in 42ms 2026-02-21T08:08:14.1084769Z Downloading pip (1.7MiB) 2026-02-21T08:08:14.1437398Z Downloaded pip 2026-02-21T08:08:14.1437547Z Prepared 1 package in 39ms 2026-02-21T08:08:14.1605257Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:08:14.1605584Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:08:14.1605897Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:08:14.1746640Z Installed 1 package in 30ms 2026-02-21T08:08:14.1747162Z + pip==26.0.1 2026-02-21T08:08:14.1869287Z + uv pip install quack-kernels --no-deps 2026-02-21T08:08:14.2113847Z Resolved 1 package in 18ms 2026-02-21T08:08:14.2379151Z Prepared 1 package in 26ms 2026-02-21T08:08:14.2532502Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:08:14.2532980Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:08:14.2533417Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:08:14.2545921Z Installed 1 package in 16ms 2026-02-21T08:08:14.2546106Z + quack-kernels==0.2.10 2026-02-21T08:08:14.2666968Z + mkdir -p benchmarks/ 2026-02-21T08:08:14.2676690Z + pushd benchmarks/ 2026-02-21T08:08:14.2678883Z /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:08:14.2679245Z + git clone https://github.com/pytorch-labs/tritonbench/ 2026-02-21T08:08:14.2688753Z Cloning into 'tritonbench'... 2026-02-21T08:08:14.9256902Z + pushd tritonbench/ 2026-02-21T08:08:14.9257274Z + git submodule update --init --recursive 2026-02-21T08:08:14.9257771Z /__w/helion/helion/benchmarks/tritonbench /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:08:14.9380445Z Submodule 'submodules/ThunderKittens' (https://github.com/HazyResearch/ThunderKittens.git) registered for path 'submodules/ThunderKittens' 2026-02-21T08:08:14.9381594Z Submodule 'submodules/aiter' (https://github.com/ROCm/aiter.git) registered for path 'submodules/aiter' 2026-02-21T08:08:14.9382494Z Submodule 'submodules/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/cutlass' 2026-02-21T08:08:14.9383545Z Submodule 'submodules/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/flash-attention' 2026-02-21T08:08:14.9384860Z Submodule 'submodules/generative-recommenders' (https://github.com/facebookresearch/generative-recommenders.git) registered for path 'submodules/generative-recommenders' 2026-02-21T08:08:14.9386141Z Submodule 'submodules/xformers' (https://github.com/facebookresearch/xformers.git) registered for path 'submodules/xformers' 2026-02-21T08:08:14.9398423Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/ThunderKittens'... 2026-02-21T08:08:16.2046436Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter'... 2026-02-21T08:08:22.2022356Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/cutlass'... 2026-02-21T08:08:24.8406348Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention'... 2026-02-21T08:08:25.4940945Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders'... 2026-02-21T08:08:25.9129179Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers'... 2026-02-21T08:08:26.8389395Z Submodule path 'submodules/ThunderKittens': checked out '25f7568450b412a1984a4f619fb28373df06fa1b' 2026-02-21T08:08:27.0674596Z Submodule path 'submodules/aiter': checked out '1f5b378dcc9d9b0bcd9456c8c767b7424a5e8190' 2026-02-21T08:08:27.0690797Z Submodule '3rdparty/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/aiter/3rdparty/composable_kernel' 2026-02-21T08:08:27.0706819Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter/3rdparty/composable_kernel'... 2026-02-21T08:08:29.7583619Z Submodule path 'submodules/aiter/3rdparty/composable_kernel': checked out 'e31a7a4f29b371c32ea9daf9211b6ae1fed2fa40' 2026-02-21T08:08:30.1055331Z Submodule path 'submodules/cutlass': checked out 'ad7b2f5e84fcfa124cb02b91d5bd26d238c0459e' 2026-02-21T08:08:30.1605805Z Submodule path 'submodules/flash-attention': checked out '43375aab2893018dfb7950db1cfa623c14946ad6' 2026-02-21T08:08:30.1616825Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/flash-attention/csrc/composable_kernel' 2026-02-21T08:08:30.1617444Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/flash-attention/csrc/cutlass' 2026-02-21T08:08:30.1631971Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/composable_kernel'... 2026-02-21T08:08:33.0968610Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/cutlass'... 2026-02-21T08:08:35.7971284Z Submodule path 'submodules/flash-attention/csrc/composable_kernel': checked out 'e8709c24f403173ad21a2da907d1347957e324fb' 2026-02-21T08:08:36.1649429Z Submodule path 'submodules/flash-attention/csrc/cutlass': checked out 'b1d6e2c9b334dfa811e4183dfbd02419249e4b52' 2026-02-21T08:08:36.1864044Z Submodule path 'submodules/generative-recommenders': checked out '88512dbd71b053226bc4ef8ec1630e3db53e55e5' 2026-02-21T08:08:36.1874867Z Submodule 'generative_recommenders/ops/cpp/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass' 2026-02-21T08:08:36.1887930Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass'... 2026-02-21T08:08:39.0701999Z Submodule path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass': checked out 'dc4817921edda44a549197ff3a9dcf5df0636e7b' 2026-02-21T08:08:39.1204716Z Submodule path 'submodules/xformers': checked out '8fc8ec5a4d6498ff81c0c418b89bbaf133ae3a44' 2026-02-21T08:08:39.1213592Z Submodule 'third_party/composable_kernel_tiled' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/composable_kernel_tiled' 2026-02-21T08:08:39.1214472Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/cutlass' 2026-02-21T08:08:39.1215295Z Submodule 'third_party/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/xformers/third_party/flash-attention' 2026-02-21T08:08:39.1231537Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/composable_kernel_tiled'... 2026-02-21T08:08:41.6554037Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/cutlass'... 2026-02-21T08:08:44.2050770Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention'... 2026-02-21T08:08:45.1218340Z Submodule path 'submodules/xformers/third_party/composable_kernel_tiled': checked out '4f54fa30583704f34da2ac50372d524cae6bad7d' 2026-02-21T08:08:45.4552652Z Submodule path 'submodules/xformers/third_party/cutlass': checked out 'e9627ce55b42fd2599f58cd4396da9380954def0' 2026-02-21T08:08:45.4983453Z Submodule path 'submodules/xformers/third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5' 2026-02-21T08:08:45.5000532Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel' 2026-02-21T08:08:45.5002075Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/cutlass' 2026-02-21T08:08:45.5024656Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/composable_kernel'... 2026-02-21T08:08:48.2494124Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/cutlass'... 2026-02-21T08:08:51.0770868Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33' 2026-02-21T08:08:51.4054233Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420' 2026-02-21T08:08:51.4087577Z + uv pip install -r requirements.txt 2026-02-21T08:08:51.4146378Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:08:51.6581470Z Resolved 30 packages in 242ms 2026-02-21T08:08:51.6697141Z Downloading pillow (6.7MiB) 2026-02-21T08:08:51.6697298Z Downloading fonttools (4.7MiB) 2026-02-21T08:08:51.6697427Z Downloading hf-xet (3.2MiB) 2026-02-21T08:08:51.6697774Z Downloading transformers (10.3MiB) 2026-02-21T08:08:51.6698478Z Downloading tokenizers (3.0MiB) 2026-02-21T08:08:51.6728518Z Downloading matplotlib (8.3MiB) 2026-02-21T08:08:51.6729608Z Downloading kiwisolver (1.4MiB) 2026-02-21T08:08:51.7278960Z Downloaded kiwisolver 2026-02-21T08:08:51.7751171Z Downloaded tokenizers 2026-02-21T08:08:51.7808552Z Downloaded hf-xet 2026-02-21T08:08:51.8445598Z Downloaded pillow 2026-02-21T08:08:51.8450742Z Downloaded fonttools 2026-02-21T08:08:51.8688435Z Downloaded matplotlib 2026-02-21T08:08:51.9832275Z Downloaded transformers 2026-02-21T08:08:51.9832643Z Prepared 23 packages in 324ms 2026-02-21T08:08:52.0023705Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:08:52.0024562Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:08:52.0025406Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:08:52.2505049Z Installed 23 packages in 263ms 2026-02-21T08:08:52.2505303Z + certifi==2026.1.4 2026-02-21T08:08:52.2505501Z + charset-normalizer==3.4.4 2026-02-21T08:08:52.2505654Z + contourpy==1.3.3 2026-02-21T08:08:52.2505784Z + cycler==0.12.1 2026-02-21T08:08:52.2505907Z + fonttools==4.61.1 2026-02-21T08:08:52.2506037Z + hf-xet==1.2.0 2026-02-21T08:08:52.2506172Z + huggingface-hub==0.36.2 2026-02-21T08:08:52.2506315Z + idna==3.11 2026-02-21T08:08:52.2506432Z + kiwisolver==1.4.9 2026-02-21T08:08:52.2506566Z + matplotlib==3.10.8 2026-02-21T08:08:52.2506703Z + nvidia-ml-py==13.590.48 2026-02-21T08:08:52.2506864Z + pillow==12.1.1 2026-02-21T08:08:52.2506991Z + pyparsing==3.3.2 2026-02-21T08:08:52.2507127Z + python-dateutil==2.9.0.post0 2026-02-21T08:08:52.2507278Z + regex==2026.2.19 2026-02-21T08:08:52.2507397Z + requests==2.32.5 2026-02-21T08:08:52.2507524Z + safetensors==0.7.0 2026-02-21T08:08:52.2507648Z + six==1.17.0 2026-02-21T08:08:52.2507768Z + tabulate==0.9.0 2026-02-21T08:08:52.2507891Z + tokenizers==0.21.4 2026-02-21T08:08:52.2508018Z + tqdm==4.67.3 2026-02-21T08:08:52.2508138Z + transformers==4.53.0 2026-02-21T08:08:52.2508276Z + urllib3==2.6.3 2026-02-21T08:08:52.2786816Z + python install.py --liger 2026-02-21T08:08:56.7276050Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:08:56.7299382Z Audited 6 packages in 2ms 2026-02-21T08:08:56.7680197Z INFO:__main__:[tritonbench] installing liger-kernels... 2026-02-21T08:08:56.7720881Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:08:56.8515798Z Resolved 1 package in 78ms 2026-02-21T08:08:56.9000623Z Prepared 1 package in 48ms 2026-02-21T08:08:56.9176776Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:08:56.9177434Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:08:56.9178059Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:08:56.9214070Z Installed 1 package in 21ms 2026-02-21T08:08:56.9214632Z + liger-kernel-nightly==0.7.0.dev20260219183429 2026-02-21T08:08:56.9296970Z INFO:__main__:[tritonbench] installation complete! 2026-02-21T08:08:57.7295511Z + uv pip install -e . --no-deps 2026-02-21T08:08:57.7612542Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:08:57.7635190Z Resolved 1 package in 1ms 2026-02-21T08:08:57.7639596Z Building tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench 2026-02-21T08:08:58.4768113Z Built tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench 2026-02-21T08:08:58.4965981Z Prepared 1 package in 733ms 2026-02-21T08:08:58.4970596Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:08:58.4971072Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:08:58.4971546Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:08:58.4974906Z Installed 1 package in 0.80ms 2026-02-21T08:08:58.4975154Z + tritonbench==0.0.1 (from file:///__w/helion/helion/benchmarks/tritonbench) 2026-02-21T08:08:58.5324764Z + popd 2026-02-21T08:08:58.5325002Z + popd 2026-02-21T08:08:58.5325383Z /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:08:58.5325617Z /__w/helion/helion 2026-02-21T08:08:58.5410480Z ##[group]Run rm -rf /tmp/torchinductor_*/ || true 2026-02-21T08:08:58.5410838Z rm -rf /tmp/torchinductor_*/ || true 2026-02-21T08:08:58.5411068Z  2026-02-21T08:08:58.5411207Z source .venv/bin/activate 2026-02-21T08:08:58.5411382Z  2026-02-21T08:08:58.5411539Z TEST_REPORTS_DIR=$(pwd)/test/test-reports 2026-02-21T08:08:58.5411766Z mkdir -p "$TEST_REPORTS_DIR" 2026-02-21T08:08:58.5411955Z echo "$TEST_REPORTS_DIR" 2026-02-21T08:08:58.5412165Z  2026-02-21T08:08:58.5412317Z KERNEL_LIST="int4_gemm,flash_attention" 2026-02-21T08:08:58.5412554Z for kernel in ${KERNEL_LIST//,/ }; do 2026-02-21T08:08:58.5412776Z  echo "==========================================" 2026-02-21T08:08:58.5413026Z  echo "Running benchmark for kernel: $kernel" 2026-02-21T08:08:58.5413265Z  echo "==========================================" 2026-02-21T08:08:58.5413487Z  2026-02-21T08:08:58.5430935Z  # Get available implementations and baseline for this kernel 2026-02-21T08:08:58.5431418Z  KERNEL_INFO=$(python benchmarks/run.py --list-impls-for-benchmark-ci --op $kernel | grep "^$kernel:") 2026-02-21T08:08:58.5431863Z  IMPLS=$(echo "$KERNEL_INFO" | sed -n 's/.*impls=\([^ ]*\).*/\1/p') 2026-02-21T08:08:58.5432202Z  BASELINE=$(echo "$KERNEL_INFO" | sed -n 's/.*baseline=\([^ ]*\).*/\1/p') 2026-02-21T08:08:58.5432465Z  2026-02-21T08:08:58.5432612Z  if [[ -z "$IMPLS" ]]; then 2026-02-21T08:08:58.5432892Z  echo "Warning: No implementations found for kernel $kernel, skipping..." 2026-02-21T08:08:58.5433180Z  continue 2026-02-21T08:08:58.5433341Z  fi 2026-02-21T08:08:58.5433505Z  if [[ -z "$BASELINE" ]]; then 2026-02-21T08:08:58.5433785Z  echo "Warning: No baseline found for kernel $kernel, skipping..." 2026-02-21T08:08:58.5434062Z  continue 2026-02-21T08:08:58.5434210Z  fi 2026-02-21T08:08:58.5434358Z  echo "Using baseline: $BASELINE" 2026-02-21T08:08:58.5434829Z  echo "Available implementations for $kernel: $IMPLS" 2026-02-21T08:08:58.5435052Z  2026-02-21T08:08:58.5435215Z  # Do autotuning but do not record the results 2026-02-21T08:08:58.5435434Z  python benchmarks/run.py \ 2026-02-21T08:08:58.5435625Z  --op $kernel \ 2026-02-21T08:08:58.5435810Z  --metrics speedup,accuracy \ 2026-02-21T08:08:58.5436033Z  --latency-measure-mode triton_do_bench \ 2026-02-21T08:08:58.5436251Z  --cudagraph \ 2026-02-21T08:08:58.5436415Z  --only $IMPLS \ 2026-02-21T08:08:58.5436619Z  --only-match-mode prefix-with-baseline \ 2026-02-21T08:08:58.5436835Z  --baseline $BASELINE \ 2026-02-21T08:08:58.5437018Z  --atol 1e-2 \ 2026-02-21T08:08:58.5437177Z  --rtol 1e-2 \ 2026-02-21T08:08:58.5437372Z  --input-sample-mode equally-spaced-k \ 2026-02-21T08:08:58.5437589Z  --keep-going \ 2026-02-21T08:08:58.5437743Z   2026-02-21T08:08:58.5437875Z  2026-02-21T08:08:58.5437999Z  # Relax the GPU 2026-02-21T08:08:58.5438160Z  sleep 2m 2026-02-21T08:08:58.5438294Z  2026-02-21T08:08:58.5438448Z  # Run again with cache and record results 2026-02-21T08:08:58.5438763Z  HELION_PRINT_OUTPUT_CODE=1 HELION_ASSERT_CACHE_HIT=1 python benchmarks/run.py \ 2026-02-21T08:08:58.5439064Z  --op $kernel \ 2026-02-21T08:08:58.5439243Z  --metrics speedup,accuracy \ 2026-02-21T08:08:58.5439462Z  --latency-measure-mode triton_do_bench \ 2026-02-21T08:08:58.5439674Z  --cudagraph \ 2026-02-21T08:08:58.5439833Z  --only $IMPLS \ 2026-02-21T08:08:58.5440177Z  --only-match-mode prefix-with-baseline \ 2026-02-21T08:08:58.5440391Z  --baseline $BASELINE \ 2026-02-21T08:08:58.5440579Z  --atol 1e-2 \ 2026-02-21T08:08:58.5440740Z  --rtol 1e-2 \ 2026-02-21T08:08:58.5440926Z  --input-sample-mode equally-spaced-k \ 2026-02-21T08:08:58.5441178Z  --output "$TEST_REPORTS_DIR/helionbench.json" \ 2026-02-21T08:08:58.5441408Z  --append-to-output \ 2026-02-21T08:08:58.5441590Z  --keep-going \ 2026-02-21T08:08:58.5441743Z   2026-02-21T08:08:58.5441873Z  2026-02-21T08:08:58.5442047Z  echo "✅ Completed benchmark for kernel: $kernel" 2026-02-21T08:08:58.5442266Z done 2026-02-21T08:08:58.5442398Z  2026-02-21T08:08:58.5442728Z if [[ ! -s "$TEST_REPORTS_DIR/helionbench.json" ]]; then 2026-02-21T08:08:58.5442998Z  echo "❌ helionbench.json is missing or empty" 2026-02-21T08:08:58.5443205Z  exit 1 2026-02-21T08:08:58.5443352Z fi 2026-02-21T08:08:58.5443506Z cat "$TEST_REPORTS_DIR/helionbench.json" 2026-02-21T08:08:58.5443900Z shell: bash -l {0} 2026-02-21T08:08:58.5444037Z env: 2026-02-21T08:08:58.5444170Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:08:58.5444373Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:58.5444617Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:08:58.5444868Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:58.5445083Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:58.5445305Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:58.5445532Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T08:08:58.5445779Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:08:58.5446009Z ##[endgroup] 2026-02-21T08:08:58.6572516Z /__w/helion/helion/test/test-reports 2026-02-21T08:08:58.6573541Z ========================================== 2026-02-21T08:08:58.6573927Z Running benchmark for kernel: int4_gemm 2026-02-21T08:08:58.6574265Z ========================================== 2026-02-21T08:09:10.1835146Z Using baseline: preprocessed_eager_int4_gemm 2026-02-21T08:09:10.1836055Z Available implementations for int4_gemm: helion_int4_gemm_tritonbench,preprocessed_torch_compile_int4_gemm,preprocessed_triton_int4_gemm 2026-02-21T08:09:20.5589727Z Applying custom args for int4_gemm: {'num_inputs': 10} 2026-02-21T08:09:20.5904775Z Running int4_gemm benchmark with Helion implementation... 2026-02-21T08:09:20.5905066Z 2026-02-21T08:09:20.8084776Z Equally-spaced-k mode: Selected 10 equally spaced inputs (total available: 32) 2026-02-21T08:09:20.8085244Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 3, 7, 10, 14, 17, 21, 24, 28, 31] 2026-02-21T08:09:20.8090253Z 2026-02-21T08:09:20.8098307Z 0%| | 0/10 [00:00::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:39:15.2139116Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}> 2026-02-21T08:39:15.2142872Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T08:39:15.2143271Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}> 2026-02-21T08:39:15.2143623Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [16, 16], isTransposed = true}> 2026-02-21T08:39:15.2143922Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:39:15.2144171Z #smem = #ttg.shared_memory 2026-02-21T08:39:15.2144412Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:39:15.2144981Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:39:15.2145407Z %cst = arith.constant dense<16> : tensor<16x1xi64, #mma> 2026-02-21T08:39:15.2145578Z %cst_0 = arith.constant dense<0> : tensor<16x1xi64, #mma> 2026-02-21T08:39:15.2145756Z %cst_1 = arith.constant dense<7168> : tensor<16x1xi64, #mma> 2026-02-21T08:39:15.2146537Z %cst_2 = arith.constant dense<7168> : tensor<1x256xi64, #mma> 2026-02-21T08:39:15.2146713Z %cst_3 = arith.constant dense<0> : tensor<1x256xi64, #mma> 2026-02-21T08:39:15.2146899Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:39:15.2147073Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:39:15.2147254Z %cst_6 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T08:39:15.2147408Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:39:15.2147566Z %cst_7 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma> 2026-02-21T08:39:15.2147730Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T08:39:15.2147855Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:39:15.2147979Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:39:15.2148100Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:39:15.2148219Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:39:15.2148338Z %c28_i32 = arith.constant 28 : i32 2026-02-21T08:39:15.2148485Z %cst_8 = arith.constant dense<0> : tensor<1x256xi8, #blocked2> 2026-02-21T08:39:15.2148804Z %cst_9 = arith.constant dense<7168> : tensor<1x256xi64, #blocked2> 2026-02-21T08:39:15.2148991Z %cst_10 = arith.constant dense<0> : tensor<1x256xi64, #blocked2> 2026-02-21T08:39:15.2149171Z %cst_11 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked> 2026-02-21T08:39:15.2149322Z %c112_i32 = arith.constant 112 : i32 2026-02-21T08:39:15.2149447Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:39:15.2149563Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:39:15.2149686Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:39:15.2149804Z %c4864_i32 = arith.constant 4864 : i32 2026-02-21T08:39:15.2149993Z %cst_12 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:15.2150191Z %0 = tt.get_program_id x : i32 2026-02-21T08:39:15.2150395Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:39:15.2150673Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:39:15.2150945Z %3 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:39:15.2151202Z %4 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:39:15.2151406Z %5 = tt.splat %arg1 : !tt.ptr -> tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T08:39:15.2151660Z %6 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:39:15.2152001Z %7 = arith.extsi %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:39:15.2152366Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:39:15.2152785Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:39:15.2153194Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:39:15.2153450Z %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:39:15.2153655Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T08:39:15.2153852Z %13 = arith.cmpi eq, %10, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:39:15.2154048Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T08:39:15.2154261Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x256x!tt.ptr, #mma> 2026-02-21T08:39:15.2154716Z %16 = arith.extsi %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:39:15.2155024Z %17 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:39:15.2155329Z %18 = arith.extsi %17 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:39:15.2155589Z scf.for %arg3 = %0 to %c28_i32 step %c4864_i32 : i32 { 2026-02-21T08:39:15.2155745Z %19 = arith.divsi %arg3, %c112_i32 : i32 2026-02-21T08:39:15.2155868Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T08:39:15.2155993Z %21 = arith.subi %c1_i32, %20 : i32 2026-02-21T08:39:15.2156111Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T08:39:15.2156237Z %23 = arith.remsi %arg3, %c112_i32 : i32 2026-02-21T08:39:15.2156357Z %24 = arith.remsi %23, %22 : i32 2026-02-21T08:39:15.2156475Z %25 = arith.addi %20, %24 : i32 2026-02-21T08:39:15.2156590Z %26 = arith.divsi %23, %22 : i32 2026-02-21T08:39:15.2156712Z %27 = arith.muli %25, %c16_i32 : i32 2026-02-21T08:39:15.2156931Z %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:39:15.2157157Z %29 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:39:15.2157334Z %30 = arith.muli %26, %c256_i32 : i32 2026-02-21T08:39:15.2157559Z %31 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T08:39:15.2157821Z %32 = arith.muli %31, %cst_6 : tensor<16x1xi32, #blocked1> 2026-02-21T08:39:15.2158021Z %33 = tt.broadcast %32 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:39:15.2158198Z %34 = arith.extsi %30 : i32 to i64 2026-02-21T08:39:15.2158374Z %35 = tt.splat %34 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:39:15.2158647Z %36 = arith.addi %35, %7 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:39:15.2158935Z %37 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x256xi64, #blocked2> 2026-02-21T08:39:15.2159203Z %38 = arith.cmpi sge, %37, %cst_10 : tensor<1x256xi64, #blocked2> 2026-02-21T08:39:15.2159383Z %39 = arith.cmpi slt, %37, %cst_9 : tensor<1x256xi64, #blocked2> 2026-02-21T08:39:15.2159553Z %40 = arith.andi %38, %39 : tensor<1x256xi1, #blocked2> 2026-02-21T08:39:15.2159791Z %41 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c2_i32 iter_args(%arg5 = %cst_7) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T08:39:15.2160015Z %64 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:39:15.2160189Z %65 = tt.splat %64 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:39:15.2160412Z %66 = arith.addi %65, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:39:15.2160690Z %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:39:15.2160969Z %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:39:15.2161164Z %69 = arith.addi %33, %68 : tensor<16x2xi32, #blocked1> 2026-02-21T08:39:15.2161378Z %70 = tt.addptr %4, %69 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:39:15.2161588Z %71 = tt.load %70 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:39:15.2161860Z %72 = ttg.convert_layout %71 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:39:15.2162274Z %73 = arith.extf %72 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:39:15.2162721Z %74 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:39:15.2162853Z %75 = arith.muli %74, %c7168_i64 : i64 2026-02-21T08:39:15.2162996Z %76 = tt.splat %75 : i64 -> tensor<1x256xi64, #blocked2> 2026-02-21T08:39:15.2163158Z %77 = arith.addi %76, %37 : tensor<1x256xi64, #blocked2> 2026-02-21T08:39:15.2163355Z %78 = tt.addptr %5, %77 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi64, #blocked2> 2026-02-21T08:39:15.2163573Z %79 = tt.load %78, %40, %cst_8 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T08:39:15.2163839Z %80 = ttg.convert_layout %79 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:15.2164126Z %81 = arith.shli %80, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:15.2164370Z %82 = arith.shrsi %81, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:15.2164613Z %83 = arith.shrsi %80, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:15.2164912Z %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T08:39:15.2165296Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T08:39:15.2165582Z %86 = tt.broadcast %84 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T08:39:15.2165828Z %87 = arith.select %12, %86, %cst_11 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T08:39:15.2166070Z %88 = tt.broadcast %85 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T08:39:15.2166311Z %89 = arith.select %14, %88, %87 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T08:39:15.2166545Z %90 = tt.reshape %89 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T08:39:15.2166769Z %91 = arith.sitofp %90 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T08:39:15.2167026Z %92 = ttg.local_alloc %91 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T08:39:15.2167353Z %93 = ttg.local_load %92 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:39:15.2167840Z %94 = tt.dot %73, %93, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T08:39:15.2168215Z %95 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T08:39:15.2168341Z %96 = arith.muli %95, %c2_i32 : i32 2026-02-21T08:39:15.2168518Z %97 = tt.splat %96 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:39:15.2168746Z %98 = arith.addi %97, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:39:15.2169021Z %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:39:15.2169313Z %100 = tt.broadcast %99 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:39:15.2169511Z %101 = arith.addi %33, %100 : tensor<16x2xi32, #blocked1> 2026-02-21T08:39:15.2169719Z %102 = tt.addptr %4, %101 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:39:15.2169928Z %103 = tt.load %102 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:39:15.2170202Z %104 = ttg.convert_layout %103 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:39:15.2170646Z %105 = arith.extf %104 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:39:15.2170930Z %106 = arith.extsi %95 : i32 to i64 2026-02-21T08:39:15.2171062Z %107 = arith.muli %106, %c7168_i64 : i64 2026-02-21T08:39:15.2171210Z %108 = tt.splat %107 : i64 -> tensor<1x256xi64, #blocked2> 2026-02-21T08:39:15.2171376Z %109 = arith.addi %108, %37 : tensor<1x256xi64, #blocked2> 2026-02-21T08:39:15.2171580Z %110 = tt.addptr %5, %109 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi64, #blocked2> 2026-02-21T08:39:15.2171775Z %111 = arith.cmpi slt, %106, %c4096_i64 : i64 2026-02-21T08:39:15.2171929Z %112 = tt.splat %111 : i1 -> tensor<1x256xi1, #blocked2> 2026-02-21T08:39:15.2172088Z %113 = arith.andi %112, %40 : tensor<1x256xi1, #blocked2> 2026-02-21T08:39:15.2172264Z %114 = tt.load %110, %113, %cst_8 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T08:39:15.2172534Z %115 = ttg.convert_layout %114 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:15.2172824Z %116 = arith.shli %115, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:15.2173128Z %117 = arith.shrsi %116, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:15.2173369Z %118 = arith.shrsi %115, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:15.2173668Z %119 = tt.expand_dims %117 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T08:39:15.2174016Z %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T08:39:15.2174310Z %121 = tt.broadcast %119 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T08:39:15.2174568Z %122 = arith.select %12, %121, %cst_11 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T08:39:15.2174821Z %123 = tt.broadcast %120 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T08:39:15.2175067Z %124 = arith.select %14, %123, %122 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T08:39:15.2175305Z %125 = tt.reshape %124 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T08:39:15.2175537Z %126 = arith.sitofp %125 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T08:39:15.2175800Z %127 = ttg.local_alloc %126 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T08:39:15.2176132Z %128 = ttg.local_load %127 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:39:15.2176612Z %129 = tt.dot %105, %128, %94, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T08:39:15.2176974Z scf.yield %129 : tensor<16x256xf32, #mma> 2026-02-21T08:39:15.2177132Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:39:15.2177328Z %42 = arith.truncf %41 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma> 2026-02-21T08:39:15.2177503Z %43 = arith.extsi %27 : i32 to i64 2026-02-21T08:39:15.2177670Z %44 = tt.splat %43 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:39:15.2177882Z %45 = arith.addi %44, %16 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:39:15.2178143Z %46 = tt.expand_dims %45 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma> 2026-02-21T08:39:15.2178386Z %47 = arith.muli %46, %cst_1 : tensor<16x1xi64, #mma> 2026-02-21T08:39:15.2178597Z %48 = tt.broadcast %47 : tensor<16x1xi64, #mma> -> tensor<16x256xi64, #mma> 2026-02-21T08:39:15.2178805Z %49 = tt.splat %34 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:39:15.2179019Z %50 = arith.addi %49, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:39:15.2179286Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T08:39:15.2179550Z %52 = tt.broadcast %51 : tensor<1x256xi64, #mma> -> tensor<16x256xi64, #mma> 2026-02-21T08:39:15.2179729Z %53 = arith.addi %48, %52 : tensor<16x256xi64, #mma> 2026-02-21T08:39:15.2179920Z %54 = tt.addptr %15, %53 : tensor<16x256x!tt.ptr, #mma>, tensor<16x256xi64, #mma> 2026-02-21T08:39:15.2180119Z %55 = arith.cmpi sge, %46, %cst_0 : tensor<16x1xi64, #mma> 2026-02-21T08:39:15.2180285Z %56 = arith.cmpi slt, %46, %cst : tensor<16x1xi64, #mma> 2026-02-21T08:39:15.2180441Z %57 = arith.andi %55, %56 : tensor<16x1xi1, #mma> 2026-02-21T08:39:15.2180609Z %58 = tt.broadcast %57 : tensor<16x1xi1, #mma> -> tensor<16x256xi1, #mma> 2026-02-21T08:39:15.2180795Z %59 = arith.cmpi sge, %51, %cst_3 : tensor<1x256xi64, #mma> 2026-02-21T08:39:15.2180990Z %60 = arith.cmpi slt, %51, %cst_2 : tensor<1x256xi64, #mma> 2026-02-21T08:39:15.2181151Z %61 = arith.andi %59, %60 : tensor<1x256xi1, #mma> 2026-02-21T08:39:15.2181320Z %62 = tt.broadcast %61 : tensor<1x256xi1, #mma> -> tensor<16x256xi1, #mma> 2026-02-21T08:39:15.2181495Z %63 = arith.andi %58, %62 : tensor<16x256xi1, #mma> 2026-02-21T08:39:15.2181655Z tt.store %54, %42, %63 : tensor<16x256x!tt.ptr, #mma> 2026-02-21T08:39:15.2181864Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T08:39:15.2182055Z tt.return 2026-02-21T08:39:15.2182138Z } 2026-02-21T08:39:15.2182225Z } 2026-02-21T08:39:15.2182271Z 2026-02-21T08:39:15.2182304Z {-# 2026-02-21T08:39:15.2182392Z external_resources: { 2026-02-21T08:39:15.2182502Z mlir_reproducer: { 2026-02-21T08:39:15.2183518Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:39:15.2184539Z disable_threading: false, 2026-02-21T08:39:15.2184652Z verify_each: true 2026-02-21T08:39:15.2184744Z } 2026-02-21T08:39:15.2184822Z } 2026-02-21T08:39:15.2184895Z #-} 2026-02-21T08:39:15.2185188Z /tmp/torchinductor_root/t3/ct32ivh7iogp24wx3jtpjtpf3jmhln2btps5madsnxbgqvcxyyhy.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:39:15.2185942Z /tmp/torchinductor_root/t3/ct32ivh7iogp24wx3jtpjtpf3jmhln2btps5madsnxbgqvcxyyhy.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:39:15.2186512Z [41s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:39:15.2187309Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[2, 4], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T08:39:15.2188063Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:39:15.2188234Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:39:18.1613090Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:39:18.1616000Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [2, 1, 0]}> 2026-02-21T08:39:18.1616822Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T08:39:18.1617533Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}> 2026-02-21T08:39:18.1618156Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [16, 16], isTransposed = true}> 2026-02-21T08:39:18.1618788Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:39:18.1619190Z #smem = #ttg.shared_memory 2026-02-21T08:39:18.1620414Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:39:18.1621466Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:39:18.1622358Z %cst = arith.constant dense<16> : tensor<16x1xi64, #mma> 2026-02-21T08:39:18.1622794Z %cst_0 = arith.constant dense<0> : tensor<16x1xi64, #mma> 2026-02-21T08:39:18.1623163Z %cst_1 = arith.constant dense<7168> : tensor<16x1xi64, #mma> 2026-02-21T08:39:18.1623532Z %cst_2 = arith.constant dense<7168> : tensor<1x512xi64, #mma> 2026-02-21T08:39:18.1623900Z %cst_3 = arith.constant dense<0> : tensor<1x512xi64, #mma> 2026-02-21T08:39:18.1624276Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:39:18.1624665Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:39:18.1625057Z %cst_6 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T08:39:18.1625353Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:39:18.1625603Z %cst_7 = arith.constant dense<0.000000e+00> : tensor<16x512xf32, #mma> 2026-02-21T08:39:18.1625870Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T08:39:18.1626072Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:39:18.1626263Z %c14_i32 = arith.constant 14 : i32 2026-02-21T08:39:18.1626499Z %cst_8 = arith.constant dense<0> : tensor<1x512xi8, #blocked2> 2026-02-21T08:39:18.1626797Z %cst_9 = arith.constant dense<7168> : tensor<1x512xi64, #blocked2> 2026-02-21T08:39:18.1627092Z %cst_10 = arith.constant dense<0> : tensor<1x512xi64, #blocked2> 2026-02-21T08:39:18.1627386Z %cst_11 = arith.constant dense<0> : tensor<1x2x512xi8, #blocked> 2026-02-21T08:39:18.1627622Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:39:18.1627822Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:39:18.1628000Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:39:18.1628288Z %cst_12 = arith.constant dense<4> : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:18.1628606Z %0 = tt.get_program_id x : i32 2026-02-21T08:39:18.1628791Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T08:39:18.1628975Z %2 = arith.minsi %1, %c14_i32 : i32 2026-02-21T08:39:18.1629293Z %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:39:18.1629738Z %4 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:39:18.1630181Z %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:39:18.1630895Z %6 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T08:39:18.1631312Z %7 = arith.muli %6, %cst_6 : tensor<16x1xi32, #blocked1> 2026-02-21T08:39:18.1631619Z %8 = tt.broadcast %7 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:39:18.1631969Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:39:18.1632285Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T08:39:18.1632690Z %11 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:39:18.1633240Z %12 = arith.extsi %11 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:39:18.1633824Z %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:39:18.1634501Z %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:39:18.1645419Z %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:39:18.1645670Z %16 = arith.cmpi eq, %15, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:39:18.1645872Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked> 2026-02-21T08:39:18.1646072Z %18 = arith.cmpi eq, %15, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:39:18.1646265Z %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked> 2026-02-21T08:39:18.1646477Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<16x512x!tt.ptr, #mma> 2026-02-21T08:39:18.1646752Z %21 = arith.extsi %3 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:39:18.1647079Z %22 = tt.expand_dims %21 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma> 2026-02-21T08:39:18.1647311Z %23 = arith.muli %22, %cst_1 : tensor<16x1xi64, #mma> 2026-02-21T08:39:18.1647488Z %24 = tt.broadcast %23 : tensor<16x1xi64, #mma> -> tensor<16x512xi64, #mma> 2026-02-21T08:39:18.1647728Z %25 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:39:18.1648032Z %26 = arith.extsi %25 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:39:18.1648288Z %27 = arith.cmpi sge, %22, %cst_0 : tensor<16x1xi64, #mma> 2026-02-21T08:39:18.1648447Z %28 = arith.cmpi slt, %22, %cst : tensor<16x1xi64, #mma> 2026-02-21T08:39:18.1648603Z %29 = arith.andi %27, %28 : tensor<16x1xi1, #mma> 2026-02-21T08:39:18.1648771Z %30 = tt.broadcast %29 : tensor<16x1xi1, #mma> -> tensor<16x512xi1, #mma> 2026-02-21T08:39:18.1648952Z scf.for %arg3 = %0 to %2 step %c1_i32 : i32 { 2026-02-21T08:39:18.1649091Z %31 = arith.muli %arg3, %c512_i32 : i32 2026-02-21T08:39:18.1649217Z %32 = arith.extsi %31 : i32 to i64 2026-02-21T08:39:18.1649391Z %33 = tt.splat %32 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:39:18.1649614Z %34 = arith.addi %33, %12 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:39:18.1649898Z %35 = tt.expand_dims %34 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x512xi64, #blocked2> 2026-02-21T08:39:18.1650163Z %36 = arith.cmpi sge, %35, %cst_10 : tensor<1x512xi64, #blocked2> 2026-02-21T08:39:18.1650339Z %37 = arith.cmpi slt, %35, %cst_9 : tensor<1x512xi64, #blocked2> 2026-02-21T08:39:18.1650562Z %38 = arith.andi %36, %37 : tensor<1x512xi1, #blocked2> 2026-02-21T08:39:18.1650798Z %39 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c1_i32 iter_args(%arg5 = %cst_7) -> (tensor<16x512xf32, #mma>) : i32 { 2026-02-21T08:39:18.1651031Z %52 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:39:18.1651198Z %53 = tt.splat %52 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:39:18.1651425Z %54 = arith.addi %53, %4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:39:18.1651706Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:39:18.1651982Z %56 = tt.broadcast %55 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:39:18.1652178Z %57 = arith.addi %8, %56 : tensor<16x2xi32, #blocked1> 2026-02-21T08:39:18.1652376Z %58 = tt.addptr %9, %57 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:39:18.1652591Z %59 = tt.load %58 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:39:18.1652865Z %60 = ttg.convert_layout %59 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:39:18.1653304Z %61 = arith.extf %60 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:39:18.1653591Z %62 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:39:18.1653716Z %63 = arith.muli %62, %c7168_i64 : i64 2026-02-21T08:39:18.1653861Z %64 = tt.splat %63 : i64 -> tensor<1x512xi64, #blocked2> 2026-02-21T08:39:18.1654024Z %65 = arith.addi %64, %35 : tensor<1x512xi64, #blocked2> 2026-02-21T08:39:18.1654224Z %66 = tt.addptr %10, %65 : tensor<1x512x!tt.ptr, #blocked2>, tensor<1x512xi64, #blocked2> 2026-02-21T08:39:18.1654444Z %67 = tt.load %66, %38, %cst_8 : tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T08:39:18.1654704Z %68 = ttg.convert_layout %67 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:18.1654994Z %69 = arith.shli %68, %cst_12 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:18.1655236Z %70 = arith.shrsi %69, %cst_12 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:18.1655499Z %71 = arith.shrsi %68, %cst_12 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:39:18.1655797Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T08:39:18.1656155Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T08:39:18.1656447Z %74 = tt.broadcast %72 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T08:39:18.1656702Z %75 = arith.select %17, %74, %cst_11 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T08:39:18.1656949Z %76 = tt.broadcast %73 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T08:39:18.1657191Z %77 = arith.select %19, %76, %75 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T08:39:18.1657425Z %78 = tt.reshape %77 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked2> 2026-02-21T08:39:18.1657648Z %79 = arith.sitofp %78 : tensor<2x512xi8, #blocked2> to tensor<2x512xf32, #blocked2> 2026-02-21T08:39:18.1657902Z %80 = ttg.local_alloc %79 : (tensor<2x512xf32, #blocked2>) -> !ttg.memdesc<2x512xf32, #shared, #smem> 2026-02-21T08:39:18.1658228Z %81 = ttg.local_load %80 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:39:18.1658751Z %82 = tt.dot %61, %81, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x512xf32, #mma> 2026-02-21T08:39:18.1659111Z scf.yield %82 : tensor<16x512xf32, #mma> 2026-02-21T08:39:18.1659280Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T08:39:18.1659570Z %40 = arith.truncf %39 : tensor<16x512xf32, #mma> to tensor<16x512xbf16, #mma> 2026-02-21T08:39:18.1659825Z %41 = tt.splat %32 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:39:18.1660069Z %42 = arith.addi %41, %26 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:39:18.1660412Z %43 = tt.expand_dims %42 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T08:39:18.1660680Z %44 = tt.broadcast %43 : tensor<1x512xi64, #mma> -> tensor<16x512xi64, #mma> 2026-02-21T08:39:18.1660869Z %45 = arith.addi %24, %44 : tensor<16x512xi64, #mma> 2026-02-21T08:39:18.1661055Z %46 = tt.addptr %20, %45 : tensor<16x512x!tt.ptr, #mma>, tensor<16x512xi64, #mma> 2026-02-21T08:39:18.1661308Z %47 = arith.cmpi sge, %43, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T08:39:18.1661472Z %48 = arith.cmpi slt, %43, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T08:39:18.1661635Z %49 = arith.andi %47, %48 : tensor<1x512xi1, #mma> 2026-02-21T08:39:18.1661808Z %50 = tt.broadcast %49 : tensor<1x512xi1, #mma> -> tensor<16x512xi1, #mma> 2026-02-21T08:39:18.1661986Z %51 = arith.andi %30, %50 : tensor<16x512xi1, #mma> 2026-02-21T08:39:18.1662146Z tt.store %46, %40, %51 : tensor<16x512x!tt.ptr, #mma> 2026-02-21T08:39:18.1662352Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32} 2026-02-21T08:39:18.1662545Z tt.return 2026-02-21T08:39:18.1662633Z } 2026-02-21T08:39:18.1662721Z } 2026-02-21T08:39:18.1662767Z 2026-02-21T08:39:18.1662802Z {-# 2026-02-21T08:39:18.1662891Z external_resources: { 2026-02-21T08:39:18.1663000Z mlir_reproducer: { 2026-02-21T08:39:18.1664011Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:39:18.1665026Z disable_threading: false, 2026-02-21T08:39:18.1665140Z verify_each: true 2026-02-21T08:39:18.1665236Z } 2026-02-21T08:39:18.1665316Z } 2026-02-21T08:39:18.1665389Z #-} 2026-02-21T08:39:18.1665677Z /tmp/torchinductor_root/43/c43dkbophld72d3ragneufisw3cdtnrk2n357ly3meonnfakpgjg.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:39:18.1666433Z /tmp/torchinductor_root/43/c43dkbophld72d3ragneufisw3cdtnrk2n357ly3meonnfakpgjg.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:39:18.1666992Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:39:18.1667844Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[1, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T08:39:18.1668557Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:39:18.1668729Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:39:21.5743264Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.3 configs/s 2026-02-21T08:39:21.5755537Z [47s] Adaptive compile timeout: 30s (90% percentile=12.6s, bounds=[30.0s, 30s]) 2026-02-21T08:39:22.1725236Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 919.9 configs/s 2026-02-21T08:39:22.7944532Z [48s] Initial random population of 100, 5 starting points: 2026-02-21T08:39:22.7944762Z error=4 2026-02-21T08:39:22.7944851Z timeout=1 2026-02-21T08:39:22.7944940Z ok=95 2026-02-21T08:39:22.7945027Z min=0.1533 2026-02-21T08:39:22.7945110Z mid=1.2324 2026-02-21T08:39:22.7945198Z max=16.4603 2026-02-21T08:39:22.7945293Z best={'block_sizes': [64, 16, 32], 2026-02-21T08:39:22.7945467Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T08:39:22.7945606Z 'l2_groupings': [32], 2026-02-21T08:39:22.7945715Z 'load_eviction_policies': ['', ''], 2026-02-21T08:39:22.7946171Z 'loop_orders': [[0, 1]], 2026-02-21T08:39:22.7946286Z 'matrix_instr_nonkdim': 0, 2026-02-21T08:39:22.7946397Z 'num_sm_multiplier': 1, 2026-02-21T08:39:22.7946509Z 'num_stages': 4, 2026-02-21T08:39:22.7946602Z 'num_warps': 8, 2026-02-21T08:39:22.7946707Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:39:22.7946917Z 'range_flattens': [None, True], 2026-02-21T08:39:22.7947035Z 'range_multi_buffers': [True, True], 2026-02-21T08:39:22.7947157Z 'range_num_stages': [0, 1], 2026-02-21T08:39:22.7947266Z 'range_unroll_factors': [2, 2], 2026-02-21T08:39:22.7947386Z 'range_warp_specializes': [], 2026-02-21T08:39:22.7947491Z 'waves_per_eu': 3} 2026-02-21T08:39:22.7986893Z [48s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:39:24.0068564Z [49s] Generation 1 starting: 102 neighbors, 5 active search path(s) 2026-02-21T08:40:05.4727834Z [91s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, False], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:40:05.9819837Z [91s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:40:06.6799846Z [92s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:40:06.6816429Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 105/105 0.4 configs/s 2026-02-21T08:40:07.1877369Z /tmp/torchinductor_root/ui/cuidcmuoourevkgc4xs7zqtizjfdsxlfzqlgpapqy2ekhhslavhx.py:54:87: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T08:40:07.1879689Z b_tile = tl.load(B + (indices_3[:, None] * 7168 + indices_2[None, :] * 1), None) 2026-02-21T08:40:07.1885487Z ^ 2026-02-21T08:40:07.1887768Z /tmp/torchinductor_root/ui/cuidcmuoourevkgc4xs7zqtizjfdsxlfzqlgpapqy2ekhhslavhx.py:67:45: note: - use: %348 = "ttg.convert_layout"(<>) : (tensor<256x16xi8, #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}>>) -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}>}>> 2026-02-21T08:40:07.1888997Z 2026-02-21T08:40:07.1889100Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T08:40:07.1889246Z ^ 2026-02-21T08:40:07.1890228Z /tmp/torchinductor_root/ui/cuidcmuoourevkgc4xs7zqtizjfdsxlfzqlgpapqy2ekhhslavhx.py:66:45: note: - use: %349 = "ttg.convert_layout"(<>) : (tensor<256x16xi8, #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}>>) -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}>}>> 2026-02-21T08:40:07.1891216Z 2026-02-21T08:40:07.1891288Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T08:40:07.1892069Z ^ 2026-02-21T08:40:07.1892271Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T08:40:07.1924311Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T08:40:07.1924912Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T08:40:07.1925486Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:40:07.1926059Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:40:07.1926633Z #blocked4 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [16], order = [0]}> 2026-02-21T08:40:07.1927093Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 8], order = [1, 0]}> 2026-02-21T08:40:07.1927455Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 16], order = [0, 1]}> 2026-02-21T08:40:07.1927795Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T08:40:07.1928162Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [16, 1, 1], order = [0, 1, 2]}> 2026-02-21T08:40:07.1928569Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [16, 1], order = [0, 1]}> 2026-02-21T08:40:07.1928922Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [16, 1, 1], order = [0, 1, 2]}> 2026-02-21T08:40:07.1929292Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:40:07.1930116Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:40:07.1930714Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:40:07.1931178Z %cst = arith.constant dense<7168> : tensor<1x16xi64, #blocked> 2026-02-21T08:40:07.1931393Z %cst_0 = arith.constant dense<0> : tensor<1x16xi64, #blocked> 2026-02-21T08:40:07.1931594Z %cst_1 = arith.constant dense<16> : tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.1931800Z %cst_2 = arith.constant dense<0> : tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.1932000Z %cst_3 = arith.constant dense<7168> : tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.1932215Z %cst_4 = arith.constant dense<0> : tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.1932434Z %c14336_i32 = arith.constant 14336 : i32 2026-02-21T08:40:07.1932639Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:40:07.1932841Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:40:07.1932982Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:40:07.1949294Z %c304_i32 = arith.constant 304 : i32 2026-02-21T08:40:07.1949601Z %cst_5 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked3> 2026-02-21T08:40:07.1949848Z %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked3> 2026-02-21T08:40:07.1950088Z %cst_7 = arith.constant dense<4> : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.1950429Z %cst_8 = arith.constant dense<7168> : tensor<256x1xi32, #blocked1> 2026-02-21T08:40:07.1950712Z %cst_9 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T08:40:07.1950918Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:40:07.1951190Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.1951418Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:40:07.1951560Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:40:07.1952100Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:40:07.1952242Z %c448_i32 = arith.constant 448 : i32 2026-02-21T08:40:07.1952388Z %0 = tt.get_program_id x : i32 2026-02-21T08:40:07.1952589Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked4> 2026-02-21T08:40:07.1952844Z %2 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked4> 2026-02-21T08:40:07.1953084Z %3 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #blocked4> 2026-02-21T08:40:07.1953335Z %4 = tt.splat %arg0 : !tt.ptr -> tensor<16x512x!tt.ptr, #blocked5> 2026-02-21T08:40:07.1953595Z %5 = tt.splat %arg1 : !tt.ptr -> tensor<256x16x!tt.ptr, #blocked> 2026-02-21T08:40:07.1953835Z %6 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked4> 2026-02-21T08:40:07.1954143Z %7 = ttg.convert_layout %6 : tensor<2xi32, #blocked4> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T08:40:07.1954673Z %8 = tt.expand_dims %7 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi32, #blocked6> 2026-02-21T08:40:07.1955116Z %9 = ttg.convert_layout %8 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #blocked7> 2026-02-21T08:40:07.1955554Z %10 = ttg.convert_layout %9 : tensor<1x2xi32, #blocked7> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T08:40:07.1955976Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x2x1xi32, #blocked8> 2026-02-21T08:40:07.1956340Z %12 = ttg.convert_layout %11 : tensor<1x2x1xi32, #blocked8> -> tensor<1x2x1xi32, #blocked3> 2026-02-21T08:40:07.1956590Z %13 = arith.cmpi eq, %12, %cst_6 : tensor<1x2x1xi32, #blocked3> 2026-02-21T08:40:07.1957135Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked3> -> tensor<256x2x16xi1, #blocked3> 2026-02-21T08:40:07.1957418Z %15 = ttg.convert_layout %14 : tensor<256x2x16xi1, #blocked3> -> tensor<256x2x16xi1, #blocked2> 2026-02-21T08:40:07.1957680Z %16 = arith.cmpi eq, %12, %cst_5 : tensor<1x2x1xi32, #blocked3> 2026-02-21T08:40:07.1957917Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked3> -> tensor<256x2x16xi1, #blocked3> 2026-02-21T08:40:07.1958192Z %18 = ttg.convert_layout %17 : tensor<256x2x16xi1, #blocked3> -> tensor<256x2x16xi1, #blocked2> 2026-02-21T08:40:07.1958476Z %19 = tt.splat %arg2 : !tt.ptr -> tensor<16x16x!tt.ptr, #blocked> 2026-02-21T08:40:07.1958763Z %20 = arith.extsi %1 : tensor<16xi32, #blocked4> to tensor<16xi64, #blocked4> 2026-02-21T08:40:07.1959041Z %21 = arith.subi %c448_i32, %0 : i32 2026-02-21T08:40:07.1959207Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T08:40:07.1959361Z %22 = arith.subi %c304_i32, %c1_i32_11 : i32 2026-02-21T08:40:07.1959535Z %23 = arith.addi %21, %22 : i32 2026-02-21T08:40:07.1959709Z %24 = arith.divui %23, %c304_i32 : i32 2026-02-21T08:40:07.1959858Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T08:40:07.1960003Z %25 = arith.remsi %24, %c2_i32_12 : i32 2026-02-21T08:40:07.1960145Z %26 = arith.subi %24, %25 : i32 2026-02-21T08:40:07.1960276Z %27 = arith.muli %26, %c304_i32 : i32 2026-02-21T08:40:07.1960465Z %28 = arith.addi %0, %27 : i32 2026-02-21T08:40:07.1960665Z %29 = arith.muli %c304_i32, %c2_i32_12 : i32 2026-02-21T08:40:07.1960846Z scf.for %arg3 = %0 to %28 step %29 : i32 { 2026-02-21T08:40:07.1961007Z %30 = arith.divsi %arg3, %c14336_i32 : i32 2026-02-21T08:40:07.1961153Z %31 = arith.muli %30, %c32_i32 : i32 2026-02-21T08:40:07.1961297Z %32 = arith.subi %c1_i32, %31 : i32 2026-02-21T08:40:07.1961434Z %33 = arith.minsi %32, %c32_i32 : i32 2026-02-21T08:40:07.1961583Z %34 = arith.remsi %arg3, %c14336_i32 : i32 2026-02-21T08:40:07.1961726Z %35 = arith.remsi %34, %33 : i32 2026-02-21T08:40:07.1961871Z %36 = arith.addi %31, %35 : i32 2026-02-21T08:40:07.1962010Z %37 = arith.divsi %34, %33 : i32 2026-02-21T08:40:07.1962295Z %38 = arith.muli %36, %c16_i32 : i32 2026-02-21T08:40:07.1962462Z %39 = tt.splat %38 : i32 -> tensor<16xi32, #blocked4> 2026-02-21T08:40:07.1962773Z %40 = arith.addi %39, %1 : tensor<16xi32, #blocked4> 2026-02-21T08:40:07.1962939Z %41 = arith.muli %37, %c16_i32 : i32 2026-02-21T08:40:07.1963095Z %42 = tt.splat %41 : i32 -> tensor<16xi32, #blocked4> 2026-02-21T08:40:07.1963274Z %43 = arith.addi %42, %1 : tensor<16xi32, #blocked4> 2026-02-21T08:40:07.1963544Z %44 = ttg.convert_layout %40 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T08:40:07.1963951Z %45 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<16x1xi32, #blocked9> 2026-02-21T08:40:07.1964401Z %46 = ttg.convert_layout %45 : tensor<16x1xi32, #blocked9> -> tensor<16x1xi32, #blocked1> 2026-02-21T08:40:07.1964644Z %47 = arith.muli %46, %cst_9 : tensor<16x1xi32, #blocked1> 2026-02-21T08:40:07.1964884Z %48 = tt.broadcast %47 : tensor<16x1xi32, #blocked1> -> tensor<16x512xi32, #blocked1> 2026-02-21T08:40:07.1965245Z %49 = ttg.convert_layout %48 : tensor<16x512xi32, #blocked1> -> tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.1965791Z %50 = ttg.convert_layout %43 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T08:40:07.1966375Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x16xi32, #blocked6> 2026-02-21T08:40:07.1966836Z %52 = ttg.convert_layout %51 : tensor<1x16xi32, #blocked6> -> tensor<1x16xi32, #blocked> 2026-02-21T08:40:07.1967234Z %53 = tt.broadcast %52 : tensor<1x16xi32, #blocked> -> tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.1967462Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:40:07.1967947Z %54 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c512_i32 iter_args(%arg5 = %cst_10) -> (tensor<16x16xf32, #blocked>) : i32 { 2026-02-21T08:40:07.1968377Z %140 = tt.splat %arg4 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T08:40:07.1968682Z %141 = arith.addi %140, %2 : tensor<256xi32, #blocked4> 2026-02-21T08:40:07.1968947Z %142 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:40:07.1969212Z %143 = tt.splat %142 : i32 -> tensor<512xi32, #blocked4> 2026-02-21T08:40:07.1969480Z %144 = arith.addi %143, %3 : tensor<512xi32, #blocked4> 2026-02-21T08:40:07.1969808Z %145 = ttg.convert_layout %144 : tensor<512xi32, #blocked4> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T08:40:07.1970209Z %146 = tt.expand_dims %145 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6> 2026-02-21T08:40:07.1970573Z %147 = ttg.convert_layout %146 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked5> 2026-02-21T08:40:07.1970868Z %148 = tt.broadcast %147 : tensor<1x512xi32, #blocked5> -> tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.1971116Z %149 = arith.addi %49, %148 : tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.1971377Z %150 = tt.addptr %4, %149 : tensor<16x512x!tt.ptr, #blocked5>, tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.1971635Z %151 = tt.load %150 : tensor<16x512x!tt.ptr, #blocked5> 2026-02-21T08:40:07.1971870Z %152 = arith.extf %151 : tensor<16x512xbf16, #blocked5> to tensor<16x512xf32, #blocked5> 2026-02-21T08:40:07.1972209Z %153 = ttg.convert_layout %141 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T08:40:07.1972622Z %154 = tt.expand_dims %153 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<256x1xi32, #blocked9> 2026-02-21T08:40:07.1972973Z %155 = ttg.convert_layout %154 : tensor<256x1xi32, #blocked9> -> tensor<256x1xi32, #blocked1> 2026-02-21T08:40:07.1973227Z %156 = arith.muli %155, %cst_8 : tensor<256x1xi32, #blocked1> 2026-02-21T08:40:07.1973560Z %157 = tt.broadcast %156 : tensor<256x1xi32, #blocked1> -> tensor<256x16xi32, #blocked1> 2026-02-21T08:40:07.1973847Z %158 = ttg.convert_layout %157 : tensor<256x16xi32, #blocked1> -> tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.1974092Z %159 = arith.addi %158, %53 : tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.1974332Z %160 = tt.addptr %5, %159 : tensor<256x16x!tt.ptr, #blocked>, tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.1974576Z %161 = tt.load %160 : tensor<256x16x!tt.ptr, #blocked> 2026-02-21T08:40:07.1974772Z %162 = arith.shli %161, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.1974973Z %163 = arith.shrsi %162, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.1975169Z %164 = arith.shrsi %161, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.1975470Z %165 = ttg.convert_layout %163 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:40:07.1975893Z %166 = tt.expand_dims %165 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10> 2026-02-21T08:40:07.1976269Z %167 = ttg.convert_layout %166 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11> 2026-02-21T08:40:07.1976631Z %168 = ttg.convert_layout %164 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:40:07.1977045Z %169 = tt.expand_dims %168 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10> 2026-02-21T08:40:07.1977567Z %170 = ttg.convert_layout %169 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11> 2026-02-21T08:40:07.1977942Z %171 = tt.broadcast %167 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11> 2026-02-21T08:40:07.1978242Z %172 = ttg.convert_layout %171 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.1978668Z %173 = arith.select %15, %172, %cst_4 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.1979051Z %174 = tt.broadcast %170 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11> 2026-02-21T08:40:07.1979422Z %175 = ttg.convert_layout %174 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.1979737Z %176 = arith.select %18, %175, %173 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.1980031Z %177 = tt.reshape %176 : tensor<256x2x16xi8, #blocked2> -> tensor<512x16xi8, #blocked> 2026-02-21T08:40:07.1980309Z %178 = arith.sitofp %177 : tensor<512x16xi8, #blocked> to tensor<512x16xf32, #blocked> 2026-02-21T08:40:07.1980668Z %179 = ttg.convert_layout %152 : tensor<16x512xf32, #blocked5> -> tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T08:40:07.1981083Z %180 = ttg.convert_layout %178 : tensor<512x16xf32, #blocked> -> tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T08:40:07.1981435Z %181 = ttg.convert_layout %arg5 : tensor<16x16xf32, #blocked> -> tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.1981912Z %182 = tt.dot %179, %180, %181, inputPrecision = tf32 : tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.1982366Z %c1_i32_15 = arith.constant 1 : i32 2026-02-21T08:40:07.1982528Z %183 = arith.muli %c256_i32, %c1_i32_15 : i32 2026-02-21T08:40:07.1982678Z %184 = arith.addi %arg4, %183 : i32 2026-02-21T08:40:07.1982844Z %185 = tt.splat %184 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T08:40:07.1983030Z %186 = arith.addi %185, %2 : tensor<256xi32, #blocked4> 2026-02-21T08:40:07.1983202Z %187 = arith.muli %184, %c2_i32 : i32 2026-02-21T08:40:07.1983434Z %188 = tt.splat %187 : i32 -> tensor<512xi32, #blocked4> 2026-02-21T08:40:07.1983622Z %189 = arith.addi %188, %3 : tensor<512xi32, #blocked4> 2026-02-21T08:40:07.1983909Z %190 = ttg.convert_layout %189 : tensor<512xi32, #blocked4> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T08:40:07.1984303Z %191 = tt.expand_dims %190 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6> 2026-02-21T08:40:07.1984657Z %192 = ttg.convert_layout %191 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked5> 2026-02-21T08:40:07.1984990Z %193 = tt.broadcast %192 : tensor<1x512xi32, #blocked5> -> tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.1985291Z %194 = arith.addi %49, %193 : tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.1985548Z %195 = tt.addptr %4, %194 : tensor<16x512x!tt.ptr, #blocked5>, tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.1985798Z %196 = tt.load %195 : tensor<16x512x!tt.ptr, #blocked5> 2026-02-21T08:40:07.1986038Z %197 = arith.extf %196 : tensor<16x512xbf16, #blocked5> to tensor<16x512xf32, #blocked5> 2026-02-21T08:40:07.1986373Z %198 = ttg.convert_layout %186 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T08:40:07.1986776Z %199 = tt.expand_dims %198 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<256x1xi32, #blocked9> 2026-02-21T08:40:07.1987136Z %200 = ttg.convert_layout %199 : tensor<256x1xi32, #blocked9> -> tensor<256x1xi32, #blocked1> 2026-02-21T08:40:07.1987388Z %201 = arith.muli %200, %cst_8 : tensor<256x1xi32, #blocked1> 2026-02-21T08:40:07.1987674Z %202 = tt.broadcast %201 : tensor<256x1xi32, #blocked1> -> tensor<256x16xi32, #blocked1> 2026-02-21T08:40:07.1987958Z %203 = ttg.convert_layout %202 : tensor<256x16xi32, #blocked1> -> tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.1988208Z %204 = arith.addi %203, %53 : tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.1988499Z %205 = tt.addptr %5, %204 : tensor<256x16x!tt.ptr, #blocked>, tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.1988763Z %206 = tt.load %205 : tensor<256x16x!tt.ptr, #blocked> 2026-02-21T08:40:07.1988971Z %207 = arith.shli %206, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.1989301Z %208 = arith.shrsi %207, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.1989668Z %209 = arith.shrsi %206, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.1990155Z %210 = ttg.convert_layout %208 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:40:07.1990627Z %211 = tt.expand_dims %210 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10> 2026-02-21T08:40:07.1991043Z %212 = ttg.convert_layout %211 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11> 2026-02-21T08:40:07.1991447Z %213 = ttg.convert_layout %209 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:40:07.1991873Z %214 = tt.expand_dims %213 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10> 2026-02-21T08:40:07.1992253Z %215 = ttg.convert_layout %214 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11> 2026-02-21T08:40:07.1992561Z %216 = tt.broadcast %212 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11> 2026-02-21T08:40:07.1992970Z %217 = ttg.convert_layout %216 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.1993290Z %218 = arith.select %15, %217, %cst_4 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.1993724Z %219 = tt.broadcast %215 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11> 2026-02-21T08:40:07.1994051Z %220 = ttg.convert_layout %219 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.1994534Z %221 = arith.select %18, %220, %218 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.1995010Z %222 = tt.reshape %221 : tensor<256x2x16xi8, #blocked2> -> tensor<512x16xi8, #blocked> 2026-02-21T08:40:07.1995440Z %223 = arith.sitofp %222 : tensor<512x16xi8, #blocked> to tensor<512x16xf32, #blocked> 2026-02-21T08:40:07.1995911Z %224 = ttg.convert_layout %197 : tensor<16x512xf32, #blocked5> -> tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T08:40:07.1996330Z %225 = ttg.convert_layout %223 : tensor<512x16xf32, #blocked> -> tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T08:40:07.1996921Z %226 = ttg.convert_layout %182 : tensor<16x16xf32, #blocked> -> tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.1997582Z %227 = tt.dot %224, %225, %226, inputPrecision = tf32 : tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.1998254Z scf.yield %227 : tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.1998471Z } {tt.flatten} 2026-02-21T08:40:07.1998660Z %55 = arith.truncf %54 : tensor<16x16xf32, #blocked> to tensor<16x16xbf16, #blocked> 2026-02-21T08:40:07.1998874Z %56 = arith.extsi %38 : i32 to i64 2026-02-21T08:40:07.1999044Z %57 = arith.extsi %41 : i32 to i64 2026-02-21T08:40:07.1999207Z %58 = tt.splat %56 : i64 -> tensor<16xi64, #blocked4> 2026-02-21T08:40:07.1999394Z %59 = arith.addi %58, %20 : tensor<16xi64, #blocked4> 2026-02-21T08:40:07.2035106Z %60 = ttg.convert_layout %59 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T08:40:07.2035752Z %61 = tt.expand_dims %60 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<16x1xi64, #blocked9> 2026-02-21T08:40:07.2036097Z %62 = ttg.convert_layout %61 : tensor<16x1xi64, #blocked9> -> tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.2036337Z %63 = arith.muli %62, %cst_3 : tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.2036568Z %64 = tt.broadcast %63 : tensor<16x1xi64, #blocked1> -> tensor<16x16xi64, #blocked1> 2026-02-21T08:40:07.2036843Z %65 = ttg.convert_layout %64 : tensor<16x16xi64, #blocked1> -> tensor<16x16xi64, #blocked> 2026-02-21T08:40:07.2037086Z %66 = tt.splat %57 : i64 -> tensor<16xi64, #blocked4> 2026-02-21T08:40:07.2037322Z %67 = arith.addi %66, %20 : tensor<16xi64, #blocked4> 2026-02-21T08:40:07.2037599Z %68 = ttg.convert_layout %67 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T08:40:07.2037984Z %69 = tt.expand_dims %68 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x16xi64, #blocked6> 2026-02-21T08:40:07.2038316Z %70 = ttg.convert_layout %69 : tensor<1x16xi64, #blocked6> -> tensor<1x16xi64, #blocked> 2026-02-21T08:40:07.2038587Z %71 = tt.broadcast %70 : tensor<1x16xi64, #blocked> -> tensor<16x16xi64, #blocked> 2026-02-21T08:40:07.2038815Z %72 = arith.addi %65, %71 : tensor<16x16xi64, #blocked> 2026-02-21T08:40:07.2039047Z %73 = tt.addptr %19, %72 : tensor<16x16x!tt.ptr, #blocked>, tensor<16x16xi64, #blocked> 2026-02-21T08:40:07.2039298Z %74 = arith.cmpi sge, %62, %cst_2 : tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.2039498Z %75 = arith.cmpi slt, %62, %cst_1 : tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.2039730Z %76 = arith.andi %74, %75 : tensor<16x1xi1, #blocked1> 2026-02-21T08:40:07.2040075Z %77 = tt.broadcast %76 : tensor<16x1xi1, #blocked1> -> tensor<16x16xi1, #blocked1> 2026-02-21T08:40:07.2040535Z %78 = ttg.convert_layout %77 : tensor<16x16xi1, #blocked1> -> tensor<16x16xi1, #blocked> 2026-02-21T08:40:07.2040777Z %79 = arith.cmpi sge, %70, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T08:40:07.2040973Z %80 = arith.cmpi slt, %70, %cst : tensor<1x16xi64, #blocked> 2026-02-21T08:40:07.2041236Z %81 = arith.andi %79, %80 : tensor<1x16xi1, #blocked> 2026-02-21T08:40:07.2041509Z %82 = tt.broadcast %81 : tensor<1x16xi1, #blocked> -> tensor<16x16xi1, #blocked> 2026-02-21T08:40:07.2041769Z %83 = arith.andi %78, %82 : tensor<16x16xi1, #blocked> 2026-02-21T08:40:07.2042031Z tt.store %73, %55, %83 : tensor<16x16x!tt.ptr, #blocked> 2026-02-21T08:40:07.2042208Z %c1_i32_13 = arith.constant 1 : i32 2026-02-21T08:40:07.2042367Z %84 = arith.muli %c304_i32, %c1_i32_13 : i32 2026-02-21T08:40:07.2042550Z %85 = arith.addi %arg3, %84 : i32 2026-02-21T08:40:07.2042825Z %86 = arith.divsi %85, %c14336_i32 : i32 2026-02-21T08:40:07.2042994Z %87 = arith.muli %86, %c32_i32 : i32 2026-02-21T08:40:07.2043180Z %88 = arith.subi %c1_i32, %87 : i32 2026-02-21T08:40:07.2043323Z %89 = arith.minsi %88, %c32_i32 : i32 2026-02-21T08:40:07.2043472Z %90 = arith.remsi %85, %c14336_i32 : i32 2026-02-21T08:40:07.2043615Z %91 = arith.remsi %90, %89 : i32 2026-02-21T08:40:07.2043749Z %92 = arith.addi %87, %91 : i32 2026-02-21T08:40:07.2043888Z %93 = arith.divsi %90, %89 : i32 2026-02-21T08:40:07.2044022Z %94 = arith.muli %92, %c16_i32 : i32 2026-02-21T08:40:07.2044182Z %95 = tt.splat %94 : i32 -> tensor<16xi32, #blocked4> 2026-02-21T08:40:07.2044358Z %96 = arith.addi %95, %1 : tensor<16xi32, #blocked4> 2026-02-21T08:40:07.2044519Z %97 = arith.muli %93, %c16_i32 : i32 2026-02-21T08:40:07.2044676Z %98 = tt.splat %97 : i32 -> tensor<16xi32, #blocked4> 2026-02-21T08:40:07.2044913Z %99 = arith.addi %98, %1 : tensor<16xi32, #blocked4> 2026-02-21T08:40:07.2045201Z %100 = ttg.convert_layout %96 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T08:40:07.2045592Z %101 = tt.expand_dims %100 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<16x1xi32, #blocked9> 2026-02-21T08:40:07.2045941Z %102 = ttg.convert_layout %101 : tensor<16x1xi32, #blocked9> -> tensor<16x1xi32, #blocked1> 2026-02-21T08:40:07.2046186Z %103 = arith.muli %102, %cst_9 : tensor<16x1xi32, #blocked1> 2026-02-21T08:40:07.2046424Z %104 = tt.broadcast %103 : tensor<16x1xi32, #blocked1> -> tensor<16x512xi32, #blocked1> 2026-02-21T08:40:07.2046713Z %105 = ttg.convert_layout %104 : tensor<16x512xi32, #blocked1> -> tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2047050Z %106 = ttg.convert_layout %99 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T08:40:07.2047439Z %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x16xi32, #blocked6> 2026-02-21T08:40:07.2047788Z %108 = ttg.convert_layout %107 : tensor<1x16xi32, #blocked6> -> tensor<1x16xi32, #blocked> 2026-02-21T08:40:07.2048061Z %109 = tt.broadcast %108 : tensor<1x16xi32, #blocked> -> tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2048278Z %c512_i32_14 = arith.constant 512 : i32 2026-02-21T08:40:07.2048546Z %110 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c512_i32_14 iter_args(%arg5 = %cst_10) -> (tensor<16x16xf32, #blocked>) : i32 { 2026-02-21T08:40:07.2048843Z %140 = tt.splat %arg4 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T08:40:07.2049073Z %141 = arith.addi %140, %2 : tensor<256xi32, #blocked4> 2026-02-21T08:40:07.2049263Z %142 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:40:07.2049430Z %143 = tt.splat %142 : i32 -> tensor<512xi32, #blocked4> 2026-02-21T08:40:07.2049616Z %144 = arith.addi %143, %3 : tensor<512xi32, #blocked4> 2026-02-21T08:40:07.2049902Z %145 = ttg.convert_layout %144 : tensor<512xi32, #blocked4> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T08:40:07.2050367Z %146 = tt.expand_dims %145 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6> 2026-02-21T08:40:07.2050928Z %147 = ttg.convert_layout %146 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked5> 2026-02-21T08:40:07.2051276Z %148 = tt.broadcast %147 : tensor<1x512xi32, #blocked5> -> tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2051518Z %149 = arith.addi %105, %148 : tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2051773Z %150 = tt.addptr %4, %149 : tensor<16x512x!tt.ptr, #blocked5>, tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2052162Z %151 = tt.load %150 : tensor<16x512x!tt.ptr, #blocked5> 2026-02-21T08:40:07.2052553Z %152 = arith.extf %151 : tensor<16x512xbf16, #blocked5> to tensor<16x512xf32, #blocked5> 2026-02-21T08:40:07.2053042Z %153 = ttg.convert_layout %141 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T08:40:07.2053450Z %154 = tt.expand_dims %153 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<256x1xi32, #blocked9> 2026-02-21T08:40:07.2053888Z %155 = ttg.convert_layout %154 : tensor<256x1xi32, #blocked9> -> tensor<256x1xi32, #blocked1> 2026-02-21T08:40:07.2054142Z %156 = arith.muli %155, %cst_8 : tensor<256x1xi32, #blocked1> 2026-02-21T08:40:07.2054425Z %157 = tt.broadcast %156 : tensor<256x1xi32, #blocked1> -> tensor<256x16xi32, #blocked1> 2026-02-21T08:40:07.2054917Z %158 = ttg.convert_layout %157 : tensor<256x16xi32, #blocked1> -> tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2055171Z %159 = arith.addi %158, %109 : tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2055496Z %160 = tt.addptr %5, %159 : tensor<256x16x!tt.ptr, #blocked>, tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2055744Z %161 = tt.load %160 : tensor<256x16x!tt.ptr, #blocked> 2026-02-21T08:40:07.2055997Z %162 = arith.shli %161, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.2056193Z %163 = arith.shrsi %162, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.2056390Z %164 = arith.shrsi %161, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.2056826Z %165 = ttg.convert_layout %163 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:40:07.2057453Z %166 = tt.expand_dims %165 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10> 2026-02-21T08:40:07.2057832Z %167 = ttg.convert_layout %166 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11> 2026-02-21T08:40:07.2058325Z %168 = ttg.convert_layout %164 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:40:07.2058964Z %169 = tt.expand_dims %168 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10> 2026-02-21T08:40:07.2059342Z %170 = ttg.convert_layout %169 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11> 2026-02-21T08:40:07.2059726Z %171 = tt.broadcast %167 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11> 2026-02-21T08:40:07.2060041Z %172 = ttg.convert_layout %171 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2060399Z %173 = arith.select %15, %172, %cst_4 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2060765Z %174 = tt.broadcast %170 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11> 2026-02-21T08:40:07.2061083Z %175 = ttg.convert_layout %174 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2061537Z %176 = arith.select %18, %175, %173 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2061835Z %177 = tt.reshape %176 : tensor<256x2x16xi8, #blocked2> -> tensor<512x16xi8, #blocked> 2026-02-21T08:40:07.2062189Z %178 = arith.sitofp %177 : tensor<512x16xi8, #blocked> to tensor<512x16xf32, #blocked> 2026-02-21T08:40:07.2062540Z %179 = ttg.convert_layout %152 : tensor<16x512xf32, #blocked5> -> tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T08:40:07.2062998Z %180 = ttg.convert_layout %178 : tensor<512x16xf32, #blocked> -> tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T08:40:07.2063385Z %181 = ttg.convert_layout %arg5 : tensor<16x16xf32, #blocked> -> tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.2063889Z %182 = tt.dot %179, %180, %181, inputPrecision = tf32 : tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.2064316Z %c1_i32_15 = arith.constant 1 : i32 2026-02-21T08:40:07.2064496Z %183 = arith.muli %c256_i32, %c1_i32_15 : i32 2026-02-21T08:40:07.2064666Z %184 = arith.addi %arg4, %183 : i32 2026-02-21T08:40:07.2064832Z %185 = tt.splat %184 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T08:40:07.2065049Z %186 = arith.addi %185, %2 : tensor<256xi32, #blocked4> 2026-02-21T08:40:07.2065240Z %187 = arith.muli %184, %c2_i32 : i32 2026-02-21T08:40:07.2065404Z %188 = tt.splat %187 : i32 -> tensor<512xi32, #blocked4> 2026-02-21T08:40:07.2065594Z %189 = arith.addi %188, %3 : tensor<512xi32, #blocked4> 2026-02-21T08:40:07.2065961Z %190 = ttg.convert_layout %189 : tensor<512xi32, #blocked4> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T08:40:07.2066446Z %191 = tt.expand_dims %190 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6> 2026-02-21T08:40:07.2066866Z %192 = ttg.convert_layout %191 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked5> 2026-02-21T08:40:07.2067151Z %193 = tt.broadcast %192 : tensor<1x512xi32, #blocked5> -> tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2067463Z %194 = arith.addi %105, %193 : tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2067867Z %195 = tt.addptr %4, %194 : tensor<16x512x!tt.ptr, #blocked5>, tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2068148Z %196 = tt.load %195 : tensor<16x512x!tt.ptr, #blocked5> 2026-02-21T08:40:07.2068386Z %197 = arith.extf %196 : tensor<16x512xbf16, #blocked5> to tensor<16x512xf32, #blocked5> 2026-02-21T08:40:07.2068804Z %198 = ttg.convert_layout %186 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T08:40:07.2069203Z %199 = tt.expand_dims %198 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<256x1xi32, #blocked9> 2026-02-21T08:40:07.2069553Z %200 = ttg.convert_layout %199 : tensor<256x1xi32, #blocked9> -> tensor<256x1xi32, #blocked1> 2026-02-21T08:40:07.2069804Z %201 = arith.muli %200, %cst_8 : tensor<256x1xi32, #blocked1> 2026-02-21T08:40:07.2070043Z %202 = tt.broadcast %201 : tensor<256x1xi32, #blocked1> -> tensor<256x16xi32, #blocked1> 2026-02-21T08:40:07.2070327Z %203 = ttg.convert_layout %202 : tensor<256x16xi32, #blocked1> -> tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2070622Z %204 = arith.addi %203, %109 : tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2070894Z %205 = tt.addptr %5, %204 : tensor<256x16x!tt.ptr, #blocked>, tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2071140Z %206 = tt.load %205 : tensor<256x16x!tt.ptr, #blocked> 2026-02-21T08:40:07.2071386Z %207 = arith.shli %206, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.2071640Z %208 = arith.shrsi %207, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.2071921Z %209 = arith.shrsi %206, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.2072216Z %210 = ttg.convert_layout %208 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:40:07.2072780Z %211 = tt.expand_dims %210 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10> 2026-02-21T08:40:07.2073151Z %212 = ttg.convert_layout %211 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11> 2026-02-21T08:40:07.2073514Z %213 = ttg.convert_layout %209 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:40:07.2073931Z %214 = tt.expand_dims %213 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10> 2026-02-21T08:40:07.2074385Z %215 = ttg.convert_layout %214 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11> 2026-02-21T08:40:07.2074894Z %216 = tt.broadcast %212 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11> 2026-02-21T08:40:07.2075413Z %217 = ttg.convert_layout %216 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2075914Z %218 = arith.select %15, %217, %cst_4 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2076389Z %219 = tt.broadcast %215 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11> 2026-02-21T08:40:07.2076872Z %220 = ttg.convert_layout %219 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2077458Z %221 = arith.select %18, %220, %218 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2077905Z %222 = tt.reshape %221 : tensor<256x2x16xi8, #blocked2> -> tensor<512x16xi8, #blocked> 2026-02-21T08:40:07.2078310Z %223 = arith.sitofp %222 : tensor<512x16xi8, #blocked> to tensor<512x16xf32, #blocked> 2026-02-21T08:40:07.2078854Z %224 = ttg.convert_layout %197 : tensor<16x512xf32, #blocked5> -> tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T08:40:07.2079524Z %225 = ttg.convert_layout %223 : tensor<512x16xf32, #blocked> -> tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T08:40:07.2080106Z %226 = ttg.convert_layout %182 : tensor<16x16xf32, #blocked> -> tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.2080874Z %227 = tt.dot %224, %225, %226, inputPrecision = tf32 : tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.2081516Z scf.yield %227 : tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.2081746Z } {tt.flatten} 2026-02-21T08:40:07.2082025Z %111 = arith.truncf %110 : tensor<16x16xf32, #blocked> to tensor<16x16xbf16, #blocked> 2026-02-21T08:40:07.2082372Z %112 = arith.extsi %94 : i32 to i64 2026-02-21T08:40:07.2082640Z %113 = arith.extsi %97 : i32 to i64 2026-02-21T08:40:07.2082906Z %114 = tt.splat %112 : i64 -> tensor<16xi64, #blocked4> 2026-02-21T08:40:07.2083166Z %115 = arith.addi %114, %20 : tensor<16xi64, #blocked4> 2026-02-21T08:40:07.2083453Z %116 = ttg.convert_layout %115 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T08:40:07.2083850Z %117 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<16x1xi64, #blocked9> 2026-02-21T08:40:07.2084266Z %118 = ttg.convert_layout %117 : tensor<16x1xi64, #blocked9> -> tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.2084575Z %119 = arith.muli %118, %cst_3 : tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.2084811Z %120 = tt.broadcast %119 : tensor<16x1xi64, #blocked1> -> tensor<16x16xi64, #blocked1> 2026-02-21T08:40:07.2085143Z %121 = ttg.convert_layout %120 : tensor<16x16xi64, #blocked1> -> tensor<16x16xi64, #blocked> 2026-02-21T08:40:07.2085485Z %122 = tt.splat %113 : i64 -> tensor<16xi64, #blocked4> 2026-02-21T08:40:07.2085715Z %123 = arith.addi %122, %20 : tensor<16xi64, #blocked4> 2026-02-21T08:40:07.2085998Z %124 = ttg.convert_layout %123 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T08:40:07.2086393Z %125 = tt.expand_dims %124 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x16xi64, #blocked6> 2026-02-21T08:40:07.2086834Z %126 = ttg.convert_layout %125 : tensor<1x16xi64, #blocked6> -> tensor<1x16xi64, #blocked> 2026-02-21T08:40:07.2087250Z %127 = tt.broadcast %126 : tensor<1x16xi64, #blocked> -> tensor<16x16xi64, #blocked> 2026-02-21T08:40:07.2087626Z %128 = arith.addi %121, %127 : tensor<16x16xi64, #blocked> 2026-02-21T08:40:07.2088023Z %129 = tt.addptr %19, %128 : tensor<16x16x!tt.ptr, #blocked>, tensor<16x16xi64, #blocked> 2026-02-21T08:40:07.2088410Z %130 = arith.cmpi sge, %118, %cst_2 : tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.2088738Z %131 = arith.cmpi slt, %118, %cst_1 : tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.2089042Z %132 = arith.andi %130, %131 : tensor<16x1xi1, #blocked1> 2026-02-21T08:40:07.2089380Z %133 = tt.broadcast %132 : tensor<16x1xi1, #blocked1> -> tensor<16x16xi1, #blocked1> 2026-02-21T08:40:07.2089817Z %134 = ttg.convert_layout %133 : tensor<16x16xi1, #blocked1> -> tensor<16x16xi1, #blocked> 2026-02-21T08:40:07.2090195Z %135 = arith.cmpi sge, %126, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T08:40:07.2090505Z %136 = arith.cmpi slt, %126, %cst : tensor<1x16xi64, #blocked> 2026-02-21T08:40:07.2090867Z %137 = arith.andi %135, %136 : tensor<1x16xi1, #blocked> 2026-02-21T08:40:07.2091204Z %138 = tt.broadcast %137 : tensor<1x16xi1, #blocked> -> tensor<16x16xi1, #blocked> 2026-02-21T08:40:07.2091541Z %139 = arith.andi %134, %138 : tensor<16x16xi1, #blocked> 2026-02-21T08:40:07.2091839Z tt.store %129, %111, %139 : tensor<16x16x!tt.ptr, #blocked> 2026-02-21T08:40:07.2092096Z } 2026-02-21T08:40:07.2092285Z scf.for %arg3 = %28 to %c448_i32 step %c304_i32 : i32 { 2026-02-21T08:40:07.2092540Z %30 = arith.divsi %arg3, %c14336_i32 : i32 2026-02-21T08:40:07.2092771Z %31 = arith.muli %30, %c32_i32 : i32 2026-02-21T08:40:07.2092984Z %32 = arith.subi %c1_i32, %31 : i32 2026-02-21T08:40:07.2093192Z %33 = arith.minsi %32, %c32_i32 : i32 2026-02-21T08:40:07.2093403Z %34 = arith.remsi %arg3, %c14336_i32 : i32 2026-02-21T08:40:07.2093628Z %35 = arith.remsi %34, %33 : i32 2026-02-21T08:40:07.2093836Z %36 = arith.addi %31, %35 : i32 2026-02-21T08:40:07.2094045Z %37 = arith.divsi %34, %33 : i32 2026-02-21T08:40:07.2094255Z %38 = arith.muli %36, %c16_i32 : i32 2026-02-21T08:40:07.2094504Z %39 = tt.splat %38 : i32 -> tensor<16xi32, #blocked4> 2026-02-21T08:40:07.2094765Z %40 = arith.addi %39, %1 : tensor<16xi32, #blocked4> 2026-02-21T08:40:07.2094997Z %41 = arith.muli %37, %c16_i32 : i32 2026-02-21T08:40:07.2095241Z %42 = tt.splat %41 : i32 -> tensor<16xi32, #blocked4> 2026-02-21T08:40:07.2095491Z %43 = arith.addi %42, %1 : tensor<16xi32, #blocked4> 2026-02-21T08:40:07.2095937Z %44 = ttg.convert_layout %40 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T08:40:07.2096549Z %45 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<16x1xi32, #blocked9> 2026-02-21T08:40:07.2097068Z %46 = ttg.convert_layout %45 : tensor<16x1xi32, #blocked9> -> tensor<16x1xi32, #blocked1> 2026-02-21T08:40:07.2097436Z %47 = arith.muli %46, %cst_9 : tensor<16x1xi32, #blocked1> 2026-02-21T08:40:07.2097775Z %48 = tt.broadcast %47 : tensor<16x1xi32, #blocked1> -> tensor<16x512xi32, #blocked1> 2026-02-21T08:40:07.2098269Z %49 = ttg.convert_layout %48 : tensor<16x512xi32, #blocked1> -> tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2098788Z %50 = ttg.convert_layout %43 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T08:40:07.2099393Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x16xi32, #blocked6> 2026-02-21T08:40:07.2099940Z %52 = ttg.convert_layout %51 : tensor<1x16xi32, #blocked6> -> tensor<1x16xi32, #blocked> 2026-02-21T08:40:07.2100301Z %53 = tt.broadcast %52 : tensor<1x16xi32, #blocked> -> tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2100517Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:40:07.2100864Z %54 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c512_i32 iter_args(%arg5 = %cst_10) -> (tensor<16x16xf32, #blocked>) : i32 { 2026-02-21T08:40:07.2101164Z %84 = tt.splat %arg4 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T08:40:07.2101357Z %85 = arith.addi %84, %2 : tensor<256xi32, #blocked4> 2026-02-21T08:40:07.2101525Z %86 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:40:07.2101693Z %87 = tt.splat %86 : i32 -> tensor<512xi32, #blocked4> 2026-02-21T08:40:07.2101873Z %88 = arith.addi %87, %3 : tensor<512xi32, #blocked4> 2026-02-21T08:40:07.2102160Z %89 = ttg.convert_layout %88 : tensor<512xi32, #blocked4> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T08:40:07.2102553Z %90 = tt.expand_dims %89 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6> 2026-02-21T08:40:07.2102899Z %91 = ttg.convert_layout %90 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked5> 2026-02-21T08:40:07.2103248Z %92 = tt.broadcast %91 : tensor<1x512xi32, #blocked5> -> tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2103484Z %93 = arith.addi %49, %92 : tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2103733Z %94 = tt.addptr %4, %93 : tensor<16x512x!tt.ptr, #blocked5>, tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2103982Z %95 = tt.load %94 : tensor<16x512x!tt.ptr, #blocked5> 2026-02-21T08:40:07.2104211Z %96 = arith.extf %95 : tensor<16x512xbf16, #blocked5> to tensor<16x512xf32, #blocked5> 2026-02-21T08:40:07.2104542Z %97 = ttg.convert_layout %85 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T08:40:07.2104936Z %98 = tt.expand_dims %97 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<256x1xi32, #blocked9> 2026-02-21T08:40:07.2105284Z %99 = ttg.convert_layout %98 : tensor<256x1xi32, #blocked9> -> tensor<256x1xi32, #blocked1> 2026-02-21T08:40:07.2105538Z %100 = arith.muli %99, %cst_8 : tensor<256x1xi32, #blocked1> 2026-02-21T08:40:07.2105775Z %101 = tt.broadcast %100 : tensor<256x1xi32, #blocked1> -> tensor<256x16xi32, #blocked1> 2026-02-21T08:40:07.2106075Z %102 = ttg.convert_layout %101 : tensor<256x16xi32, #blocked1> -> tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2106322Z %103 = arith.addi %102, %53 : tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2106565Z %104 = tt.addptr %5, %103 : tensor<256x16x!tt.ptr, #blocked>, tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2106814Z %105 = tt.load %104 : tensor<256x16x!tt.ptr, #blocked> 2026-02-21T08:40:07.2107010Z %106 = arith.shli %105, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.2107212Z %107 = arith.shrsi %106, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.2107415Z %108 = arith.shrsi %105, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.2107721Z %109 = ttg.convert_layout %107 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:40:07.2108143Z %110 = tt.expand_dims %109 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10> 2026-02-21T08:40:07.2108561Z %111 = ttg.convert_layout %110 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11> 2026-02-21T08:40:07.2108922Z %112 = ttg.convert_layout %108 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:40:07.2109333Z %113 = tt.expand_dims %112 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10> 2026-02-21T08:40:07.2109713Z %114 = ttg.convert_layout %113 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11> 2026-02-21T08:40:07.2110020Z %115 = tt.broadcast %111 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11> 2026-02-21T08:40:07.2110323Z %116 = ttg.convert_layout %115 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2110640Z %117 = arith.select %15, %116, %cst_4 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2110944Z %118 = tt.broadcast %114 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11> 2026-02-21T08:40:07.2111254Z %119 = ttg.convert_layout %118 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2111564Z %120 = arith.select %18, %119, %117 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2111862Z %121 = tt.reshape %120 : tensor<256x2x16xi8, #blocked2> -> tensor<512x16xi8, #blocked> 2026-02-21T08:40:07.2112136Z %122 = arith.sitofp %121 : tensor<512x16xi8, #blocked> to tensor<512x16xf32, #blocked> 2026-02-21T08:40:07.2112520Z %123 = ttg.convert_layout %96 : tensor<16x512xf32, #blocked5> -> tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T08:40:07.2112932Z %124 = ttg.convert_layout %122 : tensor<512x16xf32, #blocked> -> tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T08:40:07.2113291Z %125 = ttg.convert_layout %arg5 : tensor<16x16xf32, #blocked> -> tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.2113763Z %126 = tt.dot %123, %124, %125, inputPrecision = tf32 : tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.2114171Z %c1_i32_13 = arith.constant 1 : i32 2026-02-21T08:40:07.2114335Z %127 = arith.muli %c256_i32, %c1_i32_13 : i32 2026-02-21T08:40:07.2114488Z %128 = arith.addi %arg4, %127 : i32 2026-02-21T08:40:07.2114656Z %129 = tt.splat %128 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T08:40:07.2114844Z %130 = arith.addi %129, %2 : tensor<256xi32, #blocked4> 2026-02-21T08:40:07.2115018Z %131 = arith.muli %128, %c2_i32 : i32 2026-02-21T08:40:07.2115184Z %132 = tt.splat %131 : i32 -> tensor<512xi32, #blocked4> 2026-02-21T08:40:07.2115372Z %133 = arith.addi %132, %3 : tensor<512xi32, #blocked4> 2026-02-21T08:40:07.2115659Z %134 = ttg.convert_layout %133 : tensor<512xi32, #blocked4> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T08:40:07.2116052Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6> 2026-02-21T08:40:07.2116407Z %136 = ttg.convert_layout %135 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked5> 2026-02-21T08:40:07.2116697Z %137 = tt.broadcast %136 : tensor<1x512xi32, #blocked5> -> tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2116938Z %138 = arith.addi %49, %137 : tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2117190Z %139 = tt.addptr %4, %138 : tensor<16x512x!tt.ptr, #blocked5>, tensor<16x512xi32, #blocked5> 2026-02-21T08:40:07.2117438Z %140 = tt.load %139 : tensor<16x512x!tt.ptr, #blocked5> 2026-02-21T08:40:07.2117729Z %141 = arith.extf %140 : tensor<16x512xbf16, #blocked5> to tensor<16x512xf32, #blocked5> 2026-02-21T08:40:07.2118057Z %142 = ttg.convert_layout %130 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T08:40:07.2118453Z %143 = tt.expand_dims %142 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<256x1xi32, #blocked9> 2026-02-21T08:40:07.2118802Z %144 = ttg.convert_layout %143 : tensor<256x1xi32, #blocked9> -> tensor<256x1xi32, #blocked1> 2026-02-21T08:40:07.2119050Z %145 = arith.muli %144, %cst_8 : tensor<256x1xi32, #blocked1> 2026-02-21T08:40:07.2119286Z %146 = tt.broadcast %145 : tensor<256x1xi32, #blocked1> -> tensor<256x16xi32, #blocked1> 2026-02-21T08:40:07.2119570Z %147 = ttg.convert_layout %146 : tensor<256x16xi32, #blocked1> -> tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2119818Z %148 = arith.addi %147, %53 : tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2120057Z %149 = tt.addptr %5, %148 : tensor<256x16x!tt.ptr, #blocked>, tensor<256x16xi32, #blocked> 2026-02-21T08:40:07.2120296Z %150 = tt.load %149 : tensor<256x16x!tt.ptr, #blocked> 2026-02-21T08:40:07.2120493Z %151 = arith.shli %150, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.2120691Z %152 = arith.shrsi %151, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.2120888Z %153 = arith.shrsi %150, %cst_7 : tensor<256x16xi8, #blocked> 2026-02-21T08:40:07.2121217Z %154 = ttg.convert_layout %152 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:40:07.2121669Z %155 = tt.expand_dims %154 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10> 2026-02-21T08:40:07.2122045Z %156 = ttg.convert_layout %155 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11> 2026-02-21T08:40:07.2122411Z %157 = ttg.convert_layout %153 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:40:07.2122923Z %158 = tt.expand_dims %157 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10> 2026-02-21T08:40:07.2123296Z %159 = ttg.convert_layout %158 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11> 2026-02-21T08:40:07.2123604Z %160 = tt.broadcast %156 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11> 2026-02-21T08:40:07.2123912Z %161 = ttg.convert_layout %160 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2124231Z %162 = arith.select %15, %161, %cst_4 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2124540Z %163 = tt.broadcast %159 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11> 2026-02-21T08:40:07.2124847Z %164 = ttg.convert_layout %163 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2125211Z %165 = arith.select %18, %164, %162 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2> 2026-02-21T08:40:07.2125509Z %166 = tt.reshape %165 : tensor<256x2x16xi8, #blocked2> -> tensor<512x16xi8, #blocked> 2026-02-21T08:40:07.2125784Z %167 = arith.sitofp %166 : tensor<512x16xi8, #blocked> to tensor<512x16xf32, #blocked> 2026-02-21T08:40:07.2126136Z %168 = ttg.convert_layout %141 : tensor<16x512xf32, #blocked5> -> tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T08:40:07.2126553Z %169 = ttg.convert_layout %167 : tensor<512x16xf32, #blocked> -> tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T08:40:07.2126907Z %170 = ttg.convert_layout %126 : tensor<16x16xf32, #blocked> -> tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.2127435Z %171 = tt.dot %168, %169, %170, inputPrecision = tf32 : tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.2127861Z scf.yield %171 : tensor<16x16xf32, #blocked> 2026-02-21T08:40:07.2128099Z } {tt.flatten} 2026-02-21T08:40:07.2128302Z %55 = arith.truncf %54 : tensor<16x16xf32, #blocked> to tensor<16x16xbf16, #blocked> 2026-02-21T08:40:07.2128559Z %56 = arith.extsi %38 : i32 to i64 2026-02-21T08:40:07.2128731Z %57 = arith.extsi %41 : i32 to i64 2026-02-21T08:40:07.2128892Z %58 = tt.splat %56 : i64 -> tensor<16xi64, #blocked4> 2026-02-21T08:40:07.2129074Z %59 = arith.addi %58, %20 : tensor<16xi64, #blocked4> 2026-02-21T08:40:07.2129362Z %60 = ttg.convert_layout %59 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T08:40:07.2129746Z %61 = tt.expand_dims %60 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<16x1xi64, #blocked9> 2026-02-21T08:40:07.2130086Z %62 = ttg.convert_layout %61 : tensor<16x1xi64, #blocked9> -> tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.2130327Z %63 = arith.muli %62, %cst_3 : tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.2130554Z %64 = tt.broadcast %63 : tensor<16x1xi64, #blocked1> -> tensor<16x16xi64, #blocked1> 2026-02-21T08:40:07.2130831Z %65 = ttg.convert_layout %64 : tensor<16x16xi64, #blocked1> -> tensor<16x16xi64, #blocked> 2026-02-21T08:40:07.2131061Z %66 = tt.splat %57 : i64 -> tensor<16xi64, #blocked4> 2026-02-21T08:40:07.2131241Z %67 = arith.addi %66, %20 : tensor<16xi64, #blocked4> 2026-02-21T08:40:07.2131577Z %68 = ttg.convert_layout %67 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T08:40:07.2131958Z %69 = tt.expand_dims %68 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x16xi64, #blocked6> 2026-02-21T08:40:07.2132288Z %70 = ttg.convert_layout %69 : tensor<1x16xi64, #blocked6> -> tensor<1x16xi64, #blocked> 2026-02-21T08:40:07.2132557Z %71 = tt.broadcast %70 : tensor<1x16xi64, #blocked> -> tensor<16x16xi64, #blocked> 2026-02-21T08:40:07.2132782Z %72 = arith.addi %65, %71 : tensor<16x16xi64, #blocked> 2026-02-21T08:40:07.2133014Z %73 = tt.addptr %19, %72 : tensor<16x16x!tt.ptr, #blocked>, tensor<16x16xi64, #blocked> 2026-02-21T08:40:07.2133263Z %74 = arith.cmpi sge, %62, %cst_2 : tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.2133468Z %75 = arith.cmpi slt, %62, %cst_1 : tensor<16x1xi64, #blocked1> 2026-02-21T08:40:07.2133659Z %76 = arith.andi %74, %75 : tensor<16x1xi1, #blocked1> 2026-02-21T08:40:07.2133879Z %77 = tt.broadcast %76 : tensor<16x1xi1, #blocked1> -> tensor<16x16xi1, #blocked1> 2026-02-21T08:40:07.2134143Z %78 = ttg.convert_layout %77 : tensor<16x16xi1, #blocked1> -> tensor<16x16xi1, #blocked> 2026-02-21T08:40:07.2134395Z %79 = arith.cmpi sge, %70, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T08:40:07.2134592Z %80 = arith.cmpi slt, %70, %cst : tensor<1x16xi64, #blocked> 2026-02-21T08:40:07.2134783Z %81 = arith.andi %79, %80 : tensor<1x16xi1, #blocked> 2026-02-21T08:40:07.2134990Z %82 = tt.broadcast %81 : tensor<1x16xi1, #blocked> -> tensor<16x16xi1, #blocked> 2026-02-21T08:40:07.2135209Z %83 = arith.andi %78, %82 : tensor<16x16xi1, #blocked> 2026-02-21T08:40:07.2135399Z tt.store %73, %55, %83 : tensor<16x16x!tt.ptr, #blocked> 2026-02-21T08:40:07.2135570Z } {tt.num_stages = 1 : i32} 2026-02-21T08:40:07.2135698Z tt.return 2026-02-21T08:40:07.2135794Z } 2026-02-21T08:40:07.2135893Z } 2026-02-21T08:40:07.2135949Z 2026-02-21T08:40:07.2135986Z {-# 2026-02-21T08:40:07.2136087Z external_resources: { 2026-02-21T08:40:07.2136225Z mlir_reproducer: { 2026-02-21T08:40:07.2138946Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T08:40:07.2141811Z disable_threading: false, 2026-02-21T08:40:07.2141938Z verify_each: true 2026-02-21T08:40:07.2142073Z } 2026-02-21T08:40:07.2142166Z } 2026-02-21T08:40:07.2142255Z #-} 2026-02-21T08:40:07.2142593Z /tmp/torchinductor_root/ui/cuidcmuoourevkgc4xs7zqtizjfdsxlfzqlgpapqy2ekhhslavhx.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:40:07.2143515Z /tmp/torchinductor_root/ui/cuidcmuoourevkgc4xs7zqtizjfdsxlfzqlgpapqy2ekhhslavhx.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:40:07.2144224Z [93s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:40:07.2145151Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 16, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[0, 0], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T08:40:07.2145991Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:40:07.2146194Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:40:13.5446189Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 105/105 15.6 configs/s 2026-02-21T08:40:16.3077922Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 327.6 2026-02-21T08:40:16.3079040Z configs/s 2026-02-21T08:40:16.7388574Z [102s] Generation 1 complete: 2026-02-21T08:40:16.7388742Z error=5 2026-02-21T08:40:16.7388837Z timeout=3 2026-02-21T08:40:16.7388920Z ok=100 2026-02-21T08:40:16.7389002Z min=0.0853 2026-02-21T08:40:16.7389090Z mid=0.2193 2026-02-21T08:40:16.7389169Z max=3.5782 2026-02-21T08:40:16.7389288Z best={'block_sizes': [128, 16, 16], 2026-02-21T08:40:16.7389432Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T08:40:16.7389572Z 'l2_groupings': [64], 2026-02-21T08:40:16.7389680Z 'load_eviction_policies': ['', ''], 2026-02-21T08:40:16.7389805Z 'loop_orders': [[1, 0]], 2026-02-21T08:40:16.7389911Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:40:16.7390046Z 'num_sm_multiplier': 128, 2026-02-21T08:40:16.7390150Z 'num_stages': 4, 2026-02-21T08:40:16.7390244Z 'num_warps': 2, 2026-02-21T08:40:16.7390368Z 'pid_type': 'persistent_blocked', 2026-02-21T08:40:16.7390487Z 'range_flattens': [False, True], 2026-02-21T08:40:16.7391014Z 'range_multi_buffers': [True, False], 2026-02-21T08:40:16.7391132Z 'range_num_stages': [4, 3], 2026-02-21T08:40:16.7391242Z 'range_unroll_factors': [0, 2], 2026-02-21T08:40:16.7391355Z 'range_warp_specializes': [], 2026-02-21T08:40:16.7391466Z 'waves_per_eu': 1} 2026-02-21T08:40:16.7672278Z [102s] Fitting surrogate: 208 points, 208 targets 2026-02-21T08:40:18.1504711Z [104s] Generation 2 starting: 101 neighbors, 5 active search path(s) 2026-02-21T08:41:00.2619553Z [146s] Timeout after 30s compiling Config(block_sizes=[256, 8, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:41:00.2640044Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 104/104 0.5 configs/s 2026-02-21T08:41:05.9106972Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 104/104 18.6 configs/s 2026-02-21T08:41:15.1900819Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 113.3 2026-02-21T08:41:15.1901519Z configs/s 2026-02-21T08:41:15.8671313Z [161s] Generation 2 complete: 2026-02-21T08:41:15.8671759Z error=12 2026-02-21T08:41:15.8671974Z timeout=1 2026-02-21T08:41:15.8672168Z ok=94 2026-02-21T08:41:15.8672365Z min=0.0854 2026-02-21T08:41:15.8672561Z mid=0.1110 2026-02-21T08:41:15.8672762Z max=1.0793 2026-02-21T08:41:15.8672989Z best={'block_sizes': [128, 16, 16], 2026-02-21T08:41:15.8673368Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T08:41:15.8673729Z 'l2_groupings': [64], 2026-02-21T08:41:15.8674539Z 'load_eviction_policies': ['', ''], 2026-02-21T08:41:15.8674858Z 'loop_orders': [[1, 0]], 2026-02-21T08:41:15.8675133Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:41:15.8675443Z 'num_sm_multiplier': 128, 2026-02-21T08:41:15.8675708Z 'num_stages': 4, 2026-02-21T08:41:15.8675938Z 'num_warps': 2, 2026-02-21T08:41:15.8676193Z 'pid_type': 'persistent_blocked', 2026-02-21T08:41:15.8676505Z 'range_flattens': [False, True], 2026-02-21T08:41:15.8676821Z 'range_multi_buffers': [True, False], 2026-02-21T08:41:15.8677128Z 'range_num_stages': [4, 3], 2026-02-21T08:41:15.8677411Z 'range_unroll_factors': [0, 2], 2026-02-21T08:41:15.8677700Z 'range_warp_specializes': [], 2026-02-21T08:41:15.8677975Z 'waves_per_eu': 1} 2026-02-21T08:41:15.9890909Z [161s] Fitting surrogate: 315 points, 315 targets 2026-02-21T08:41:17.0526749Z [162s] Generation 3 starting: 97 neighbors, 5 active search path(s) 2026-02-21T08:41:39.2361112Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 0.7 configs/s 2026-02-21T08:41:44.9920536Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 98/98 17.5 configs/s 2026-02-21T08:41:51.6694433Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 145.0 2026-02-21T08:41:51.6695051Z configs/s 2026-02-21T08:41:52.2767297Z [198s] Generation 3 complete: 2026-02-21T08:41:52.2767525Z error=4 2026-02-21T08:41:52.2767659Z ok=99 2026-02-21T08:41:52.2767771Z min=0.0855 2026-02-21T08:41:52.2767880Z mid=0.1279 2026-02-21T08:41:52.2767990Z max=2.7258 2026-02-21T08:41:52.2768104Z best={'block_sizes': [128, 16, 16], 2026-02-21T08:41:52.2768304Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:41:52.2768483Z 'l2_groupings': [64], 2026-02-21T08:41:52.2768633Z 'load_eviction_policies': ['', ''], 2026-02-21T08:41:52.2768793Z 'loop_orders': [[1, 0]], 2026-02-21T08:41:52.2768938Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:41:52.2769079Z 'num_stages': 3, 2026-02-21T08:41:52.2769195Z 'num_warps': 2, 2026-02-21T08:41:52.2769346Z 'pid_type': 'flat', 2026-02-21T08:41:52.2769488Z 'range_flattens': [None, True], 2026-02-21T08:41:52.2769957Z 'range_multi_buffers': [None, False], 2026-02-21T08:41:52.2770114Z 'range_num_stages': [0, 3], 2026-02-21T08:41:52.2770260Z 'range_unroll_factors': [0, 2], 2026-02-21T08:41:52.2770409Z 'range_warp_specializes': [], 2026-02-21T08:41:52.2770553Z 'waves_per_eu': 1} 2026-02-21T08:41:52.3934193Z [198s] Fitting surrogate: 418 points, 418 targets 2026-02-21T08:41:53.2677160Z [199s] Generation 4 starting: 79 neighbors, 4 active search path(s) 2026-02-21T08:42:30.0103013Z [235s] Timeout after 30s compiling Config(block_sizes=[128, 8, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:42:30.0122448Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 0.2 configs/s 2026-02-21T08:42:35.4203669Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 14.7 configs/s 2026-02-21T08:42:42.5181111Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 136.9 2026-02-21T08:42:42.5181723Z configs/s 2026-02-21T08:42:43.1133195Z [249s] Generation 4 complete: 2026-02-21T08:42:43.1133544Z error=1 2026-02-21T08:42:43.1133739Z timeout=1 2026-02-21T08:42:43.1133927Z ok=81 2026-02-21T08:42:43.1134107Z min=0.0773 2026-02-21T08:42:43.1134297Z mid=0.0937 2026-02-21T08:42:43.1134469Z max=2.2467 2026-02-21T08:42:43.1134676Z best={'block_sizes': [256, 16, 16], 2026-02-21T08:42:43.1135017Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T08:42:43.1135349Z 'l2_groupings': [4], 2026-02-21T08:42:43.1135922Z 'load_eviction_policies': ['', ''], 2026-02-21T08:42:43.1136204Z 'loop_orders': [[1, 0]], 2026-02-21T08:42:43.1136473Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:42:43.1136753Z 'num_sm_multiplier': 8, 2026-02-21T08:42:43.1136991Z 'num_stages': 2, 2026-02-21T08:42:43.1137196Z 'num_warps': 2, 2026-02-21T08:42:43.1137432Z 'pid_type': 'persistent_blocked', 2026-02-21T08:42:43.1137713Z 'range_flattens': [False, True], 2026-02-21T08:42:43.1137991Z 'range_multi_buffers': [None, None], 2026-02-21T08:42:43.1138267Z 'range_num_stages': [0, 4], 2026-02-21T08:42:43.1138521Z 'range_unroll_factors': [2, 2], 2026-02-21T08:42:43.1138789Z 'range_warp_specializes': [], 2026-02-21T08:42:43.1139036Z 'waves_per_eu': 1} 2026-02-21T08:42:43.2237333Z [249s] Fitting surrogate: 501 points, 501 targets 2026-02-21T08:42:43.9873725Z [249s] Generation 5 starting: 71 neighbors, 4 active search path(s) 2026-02-21T08:43:20.6098488Z [286s] Timeout after 30s compiling Config(block_sizes=[256, 16, 8], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:43:21.4187354Z [287s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[1, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:43:21.4202152Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.5 configs/s 2026-02-21T08:43:25.3356374Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 18.4 configs/s 2026-02-21T08:43:30.2941324Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 193.6 2026-02-21T08:43:30.2942613Z configs/s 2026-02-21T08:43:30.7904353Z [296s] Generation 5 complete: 2026-02-21T08:43:30.7904776Z error=5 2026-02-21T08:43:30.7904988Z timeout=2 2026-02-21T08:43:30.7905187Z ok=68 2026-02-21T08:43:30.7905388Z min=0.0773 2026-02-21T08:43:30.7905592Z mid=0.0975 2026-02-21T08:43:30.7905784Z max=1.3049 2026-02-21T08:43:30.7906008Z best={'block_sizes': [256, 16, 16], 2026-02-21T08:43:30.7906336Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T08:43:30.7906597Z 'l2_groupings': [4], 2026-02-21T08:43:30.7906798Z 'load_eviction_policies': ['', ''], 2026-02-21T08:43:30.7907052Z 'loop_orders': [[1, 0]], 2026-02-21T08:43:30.7907253Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:43:30.7907451Z 'num_sm_multiplier': 8, 2026-02-21T08:43:30.7907639Z 'num_stages': 2, 2026-02-21T08:43:30.7907841Z 'num_warps': 2, 2026-02-21T08:43:30.7908041Z 'pid_type': 'persistent_blocked', 2026-02-21T08:43:30.7908262Z 'range_flattens': [False, True], 2026-02-21T08:43:30.7908496Z 'range_multi_buffers': [None, None], 2026-02-21T08:43:30.7908711Z 'range_num_stages': [0, 3], 2026-02-21T08:43:30.7908912Z 'range_unroll_factors': [2, 2], 2026-02-21T08:43:30.7909121Z 'range_warp_specializes': [], 2026-02-21T08:43:30.7909315Z 'waves_per_eu': 1} 2026-02-21T08:43:30.8651440Z [296s] Fitting surrogate: 576 points, 576 targets 2026-02-21T08:43:31.4261978Z [297s] Generation 6 starting: 51 neighbors, 3 active search path(s) 2026-02-21T08:44:07.8130598Z [333s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[0, 3], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:44:07.8152376Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 0.5 configs/s 2026-02-21T08:44:10.7085870Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 17.9 configs/s 2026-02-21T08:44:14.6560421Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 240.9 2026-02-21T08:44:14.6560797Z configs/s 2026-02-21T08:44:15.1036922Z [341s] Generation 6 complete: 2026-02-21T08:44:15.1048570Z error=3 2026-02-21T08:44:15.1048670Z timeout=1 2026-02-21T08:44:15.1048749Z ok=50 2026-02-21T08:44:15.1048832Z min=0.0756 2026-02-21T08:44:15.1048917Z mid=0.0912 2026-02-21T08:44:15.1049000Z max=0.9757 2026-02-21T08:44:15.1049092Z best={'block_sizes': [256, 16, 16], 2026-02-21T08:44:15.1049245Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:44:15.1049383Z 'l2_groupings': [64], 2026-02-21T08:44:15.1049524Z 'load_eviction_policies': ['', ''], 2026-02-21T08:44:15.1049666Z 'loop_orders': [[0, 1]], 2026-02-21T08:44:15.1049784Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:44:15.1049894Z 'num_sm_multiplier': 128, 2026-02-21T08:44:15.1049996Z 'num_stages': 3, 2026-02-21T08:44:15.1050085Z 'num_warps': 2, 2026-02-21T08:44:15.1050184Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:44:15.1050310Z 'range_flattens': [False, True], 2026-02-21T08:44:15.1050428Z 'range_multi_buffers': [True, False], 2026-02-21T08:44:15.1050542Z 'range_num_stages': [4, 3], 2026-02-21T08:44:15.1050651Z 'range_unroll_factors': [1, 2], 2026-02-21T08:44:15.1050759Z 'range_warp_specializes': [], 2026-02-21T08:44:15.1050864Z 'waves_per_eu': 1} 2026-02-21T08:44:15.1643396Z [341s] Fitting surrogate: 630 points, 630 targets 2026-02-21T08:44:15.6282353Z [341s] Generation 7 starting: 43 neighbors, 2 active search path(s) 2026-02-21T08:44:51.2519870Z [377s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[4, 3], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:44:51.2537454Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43/43 0.4 configs/s 2026-02-21T08:44:53.7176107Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 43/43 17.8 configs/s 2026-02-21T08:44:56.6464352Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 317.3 2026-02-21T08:44:56.6464979Z configs/s 2026-02-21T08:44:57.0868876Z [383s] Generation 7 complete: 2026-02-21T08:44:57.0870725Z error=2 2026-02-21T08:44:57.0871273Z timeout=1 2026-02-21T08:44:57.0871547Z ok=42 2026-02-21T08:44:57.0871766Z min=0.0750 2026-02-21T08:44:57.0872036Z mid=0.1085 2026-02-21T08:44:57.0872237Z max=0.6665 2026-02-21T08:44:57.0872479Z best={'block_sizes': [256, 16, 16], 2026-02-21T08:44:57.0872907Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T08:44:57.0873292Z 'l2_groupings': [64], 2026-02-21T08:44:57.0873573Z 'load_eviction_policies': ['', ''], 2026-02-21T08:44:57.0873899Z 'loop_orders': [[0, 1]], 2026-02-21T08:44:57.0874178Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:44:57.0874464Z 'num_sm_multiplier': 2, 2026-02-21T08:44:57.0874733Z 'num_stages': 3, 2026-02-21T08:44:57.0874959Z 'num_warps': 2, 2026-02-21T08:44:57.0875223Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:44:57.0875552Z 'range_flattens': [None, True], 2026-02-21T08:44:57.0875859Z 'range_multi_buffers': [True, False], 2026-02-21T08:44:57.0876160Z 'range_num_stages': [0, 2], 2026-02-21T08:44:57.0876439Z 'range_unroll_factors': [2, 2], 2026-02-21T08:44:57.0876729Z 'range_warp_specializes': [], 2026-02-21T08:44:57.0876988Z 'waves_per_eu': 1} 2026-02-21T08:44:57.1273348Z [383s] Fitting surrogate: 675 points, 675 targets 2026-02-21T08:44:57.4194181Z [383s] Generation 8 starting: 21 neighbors, 1 active search path(s) 2026-02-21T08:45:06.8082848Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 1.9 configs/s 2026-02-21T08:45:08.1518577Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 17.5 configs/s 2026-02-21T08:45:09.9793407Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 490.8 2026-02-21T08:45:09.9793671Z configs/s 2026-02-21T08:45:10.3296931Z [396s] Generation 8 complete: 2026-02-21T08:45:10.3297142Z error=1 2026-02-21T08:45:10.3297224Z ok=21 2026-02-21T08:45:10.3297304Z min=0.0756 2026-02-21T08:45:10.3297396Z mid=0.0866 2026-02-21T08:45:10.3297481Z max=0.2555 2026-02-21T08:45:10.3297569Z best={'block_sizes': [256, 16, 16], 2026-02-21T08:45:10.3297714Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T08:45:10.3297889Z 'l2_groupings': [64], 2026-02-21T08:45:10.3297995Z 'load_eviction_policies': ['', ''], 2026-02-21T08:45:10.3298562Z 'loop_orders': [[0, 1]], 2026-02-21T08:45:10.3298670Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:45:10.3298774Z 'num_sm_multiplier': 2, 2026-02-21T08:45:10.3298876Z 'num_stages': 3, 2026-02-21T08:45:10.3298963Z 'num_warps': 2, 2026-02-21T08:45:10.3299060Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:45:10.3299184Z 'range_flattens': [None, True], 2026-02-21T08:45:10.3299299Z 'range_multi_buffers': [True, False], 2026-02-21T08:45:10.3299414Z 'range_num_stages': [0, 2], 2026-02-21T08:45:10.3299516Z 'range_unroll_factors': [1, 2], 2026-02-21T08:45:10.3299626Z 'range_warp_specializes': [], 2026-02-21T08:45:10.3299727Z 'waves_per_eu': 1} 2026-02-21T08:45:10.3543433Z [396s] Fitting surrogate: 697 points, 697 targets 2026-02-21T08:45:10.6500432Z [396s] Generation 9 starting: 21 neighbors, 1 active search path(s) 2026-02-21T08:45:25.0135532Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 0.5 configs/s 2026-02-21T08:45:26.3528045Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 17.6 configs/s 2026-02-21T08:45:28.4642244Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 624.5 2026-02-21T08:45:28.4642790Z configs/s 2026-02-21T08:45:28.8174244Z [414s] Generation 9 complete: 2026-02-21T08:45:28.8174406Z error=1 2026-02-21T08:45:28.8174503Z ok=21 2026-02-21T08:45:28.8174599Z min=0.0755 2026-02-21T08:45:28.8174716Z mid=0.0810 2026-02-21T08:45:28.8174812Z max=0.5877 2026-02-21T08:45:28.8174917Z best={'block_sizes': [256, 16, 16], 2026-02-21T08:45:28.8175097Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T08:45:28.8175267Z 'l2_groupings': [64], 2026-02-21T08:45:28.8175397Z 'load_eviction_policies': ['', ''], 2026-02-21T08:45:28.8175550Z 'loop_orders': [[0, 1]], 2026-02-21T08:45:28.8175685Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:45:28.8176191Z 'num_sm_multiplier': 2, 2026-02-21T08:45:28.8176312Z 'num_stages': 3, 2026-02-21T08:45:28.8176419Z 'num_warps': 2, 2026-02-21T08:45:28.8176559Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:45:28.8176715Z 'range_flattens': [None, True], 2026-02-21T08:45:28.8176855Z 'range_multi_buffers': [True, True], 2026-02-21T08:45:28.8176992Z 'range_num_stages': [0, 2], 2026-02-21T08:45:28.8177122Z 'range_unroll_factors': [1, 2], 2026-02-21T08:45:28.8177257Z 'range_warp_specializes': [], 2026-02-21T08:45:28.8177385Z 'waves_per_eu': 1} 2026-02-21T08:45:28.8372929Z [414s] Fitting surrogate: 719 points, 719 targets 2026-02-21T08:45:29.1331236Z [415s] Generation 10 starting: 21 neighbors, 1 active search path(s) 2026-02-21T08:45:43.6403648Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 0.7 configs/s 2026-02-21T08:45:44.8265475Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 21/21 18.5 configs/s 2026-02-21T08:45:45.8218030Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 843.4 2026-02-21T08:45:45.8218601Z configs/s 2026-02-21T08:45:46.2063133Z [432s] Generation 10 complete: 2026-02-21T08:45:46.2063529Z error=2 2026-02-21T08:45:46.2063768Z ok=20 2026-02-21T08:45:46.2063976Z min=0.0756 2026-02-21T08:45:46.2064185Z mid=0.1217 2026-02-21T08:45:46.2064385Z max=0.7168 2026-02-21T08:45:46.2064615Z best={'block_sizes': [256, 16, 16], 2026-02-21T08:45:46.2064985Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T08:45:46.2065377Z 'l2_groupings': [64], 2026-02-21T08:45:46.2065657Z 'load_eviction_policies': ['', ''], 2026-02-21T08:45:46.2065965Z 'loop_orders': [[0, 1]], 2026-02-21T08:45:46.2066252Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:45:46.2066536Z 'num_sm_multiplier': 2, 2026-02-21T08:45:46.2066791Z 'num_stages': 3, 2026-02-21T08:45:46.2067017Z 'num_warps': 2, 2026-02-21T08:45:46.2067274Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:45:46.2067628Z 'range_flattens': [None, True], 2026-02-21T08:45:46.2067923Z 'range_multi_buffers': [True, True], 2026-02-21T08:45:46.2068772Z 'range_num_stages': [0, 2], 2026-02-21T08:45:46.2069046Z 'range_unroll_factors': [1, 2], 2026-02-21T08:45:46.2069339Z 'range_warp_specializes': [], 2026-02-21T08:45:46.2069614Z 'waves_per_eu': 1} 2026-02-21T08:45:46.2245888Z [432s] Fitting surrogate: 741 points, 741 targets 2026-02-21T08:45:46.4611756Z [432s] Generation 11 starting: 17 neighbors, 1 active search path(s) 2026-02-21T08:45:54.6278105Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 1.2 configs/s 2026-02-21T08:45:55.6305270Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 20.0 configs/s 2026-02-21T08:45:56.9024731Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 675.9 2026-02-21T08:45:56.9025313Z configs/s 2026-02-21T08:45:57.2723852Z [443s] Generation 11 complete: 2026-02-21T08:45:57.2724213Z error=3 2026-02-21T08:45:57.2724434Z ok=15 2026-02-21T08:45:57.2724639Z min=0.0755 2026-02-21T08:45:57.2724872Z mid=0.0789 2026-02-21T08:45:57.2725069Z max=0.2450 2026-02-21T08:45:57.2725292Z best={'block_sizes': [256, 16, 16], 2026-02-21T08:45:57.2725662Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:45:57.2726010Z 'l2_groupings': [64], 2026-02-21T08:45:57.2726289Z 'load_eviction_policies': ['', ''], 2026-02-21T08:45:57.2726600Z 'loop_orders': [[0, 1]], 2026-02-21T08:45:57.2726875Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:45:57.2727155Z 'num_sm_multiplier': 2, 2026-02-21T08:45:57.2727408Z 'num_stages': 4, 2026-02-21T08:45:57.2727633Z 'num_warps': 2, 2026-02-21T08:45:57.2727889Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:45:57.2728215Z 'range_flattens': [None, True], 2026-02-21T08:45:57.2728510Z 'range_multi_buffers': [True, True], 2026-02-21T08:45:57.2728816Z 'range_num_stages': [0, 2], 2026-02-21T08:45:57.2729094Z 'range_unroll_factors': [1, 2], 2026-02-21T08:45:57.2729811Z 'range_warp_specializes': [], 2026-02-21T08:45:57.2730088Z 'waves_per_eu': 1} 2026-02-21T08:45:57.2994801Z [443s] Fitting surrogate: 759 points, 759 targets 2026-02-21T08:45:57.3921944Z [443s] Autotuning complete in 443.3s after searching 721 configs. 2026-02-21T08:45:57.3922480Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:45:57.3924571Z @helion.kernel(config=helion.Config(block_sizes=[256, 16, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[0, 2], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T08:45:57.3926399Z 2026-02-21T08:45:57.3926858Z [443s] Code of selected kernel: /tmp/torchinductor_root/6j/c6j4pcigu52vdmxsdp26z6z6tjsrk6hrno6jgs5d7xmqfx4qjwen.py 2026-02-21T08:45:58.3559473Z WARNING:tritonbench.utils.triton_op:Completed input ID 10: 2026-02-21T08:45:58.3559638Z x_val 2026-02-21T08:45:58.3559721Z ------------------- 2026-02-21T08:45:58.3559814Z (16, 1, 7168, 8192) 2026-02-21T08:45:58.3559865Z 2026-02-21T08:45:58.3588488Z 40%|████ | 4/10 [36:37<53:10, 531.68s/it] WARNING:tritonbench.utils.triton_op:Running input ID 14: 2026-02-21T08:45:58.3588685Z x_val 2026-02-21T08:45:58.3588765Z ------------------- 2026-02-21T08:45:58.3588853Z (64, 1, 7168, 8192) 2026-02-21T08:45:58.3591720Z INFO:tritonbench.utils.triton_op:Took 0.15ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:45:59.3402078Z INFO:tritonbench.utils.triton_op:Took 5.25ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:46:00.2092341Z Autotune Choices Stats: 2026-02-21T08:46:00.2094301Z {"num_choices": 31, "num_triton_choices": 30, "best_kernel": "mm", "best_time": 0.05023900046944618, "best_triton_pos": 1, "best_triton_time": 0.060398999601602554, "best_triton_kernel": "triton_mm_67", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8"} 2026-02-21T08:46:00.2101288Z AUTOTUNE mm(64x8192, 8192x7168) 2026-02-21T08:46:00.2101591Z strides: [8192, 1], [7168, 1] 2026-02-21T08:46:00.2101769Z dtypes: torch.bfloat16, torch.bfloat16 2026-02-21T08:46:00.2101903Z mm 0.0502 ms 100.0% 2026-02-21T08:46:00.2102298Z triton_mm_67 0.0604 ms 83.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:46:00.2102992Z triton_mm_62 0.0687 ms 73.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:46:00.2103638Z triton_mm_66 0.0688 ms 73.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:46:00.2104288Z triton_mm_74 0.0694 ms 72.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:46:00.2104929Z triton_mm_63 0.0770 ms 65.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T08:46:00.2105817Z triton_mm_61 0.0772 ms 65.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T08:46:00.2106455Z triton_mm_83 0.0775 ms 64.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:46:00.2107096Z triton_mm_58 0.0812 ms 61.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T08:46:00.2107731Z triton_mm_57 0.0878 ms 57.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T08:46:00.2108232Z SingleProcess AUTOTUNE benchmarking takes 0.5805 seconds and 0.1320 seconds precompiling for 31 choices 2026-02-21T08:46:01.5598348Z INFO:tritonbench.utils.triton_op:Took 0.15ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:46:01.5625750Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:46:01.5626063Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:46:01.5626320Z 'dtype': 'torch.bfloat16', 2026-02-21T08:46:01.5626559Z 'shape': (64, 1, 8192), 2026-02-21T08:46:01.5626781Z 'stride': (8192, 8192, 1)}, 2026-02-21T08:46:01.5627013Z { 'device': 'cuda:0', 2026-02-21T08:46:01.5627226Z 'dtype': 'torch.int32', 2026-02-21T08:46:01.5627442Z 'shape': (8192, 7168), 2026-02-21T08:46:01.5627658Z 'stride': (7168, 1)}), 2026-02-21T08:46:01.5627861Z 'kwargs': {}} 2026-02-21T08:46:01.5657842Z INFO:tritonbench.utils.triton_op:Took 3.41ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:46:01.7487768Z [0s] Autotune random seed: 2134834638 2026-02-21T08:46:01.8436356Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:46:38.9483837Z [37s] Timeout after 30s compiling Config(block_sizes=[128, 1, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:46:42.1619387Z [40s] Timeout after 30s compiling Config(block_sizes=[1024, 32, 8], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T08:46:44.4572538Z [42s] Timeout after 30s compiling Config(block_sizes=[512, 4, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[3, 0], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T08:46:44.4591877Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s 2026-02-21T08:46:46.1443460Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:46:46.1448283Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}> 2026-02-21T08:46:46.1451443Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T08:46:46.1451816Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}> 2026-02-21T08:46:46.1452128Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [16, 16], isTransposed = true}> 2026-02-21T08:46:46.1452385Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:46:46.1452572Z #smem = #ttg.shared_memory 2026-02-21T08:46:46.1452812Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:46:46.1453286Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:46:46.1453689Z %cst = arith.constant dense<64> : tensor<16x1xi64, #mma> 2026-02-21T08:46:46.1453861Z %cst_0 = arith.constant dense<0> : tensor<16x1xi64, #mma> 2026-02-21T08:46:46.1454028Z %cst_1 = arith.constant dense<7168> : tensor<16x1xi64, #mma> 2026-02-21T08:46:46.1454199Z %cst_2 = arith.constant dense<7168> : tensor<1x256xi64, #mma> 2026-02-21T08:46:46.1454363Z %cst_3 = arith.constant dense<0> : tensor<1x256xi64, #mma> 2026-02-21T08:46:46.1454531Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:46.1454699Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:46.1454872Z %cst_6 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T08:46:46.1455021Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:46:46.1455170Z %cst_7 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma> 2026-02-21T08:46:46.1455331Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T08:46:46.1455447Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:46:46.1455587Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:46:46.1455707Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:46:46.1455975Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:46:46.1456085Z %c112_i32 = arith.constant 112 : i32 2026-02-21T08:46:46.1456227Z %cst_8 = arith.constant dense<0> : tensor<1x256xi8, #blocked2> 2026-02-21T08:46:46.1456399Z %cst_9 = arith.constant dense<7168> : tensor<1x256xi64, #blocked2> 2026-02-21T08:46:46.1456574Z %cst_10 = arith.constant dense<0> : tensor<1x256xi64, #blocked2> 2026-02-21T08:46:46.1456748Z %cst_11 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked> 2026-02-21T08:46:46.1456892Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:46:46.1457009Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:46:46.1457122Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:46:46.1457234Z %c4864_i32 = arith.constant 4864 : i32 2026-02-21T08:46:46.1457418Z %cst_12 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:46.1457611Z %0 = tt.get_program_id x : i32 2026-02-21T08:46:46.1457807Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:46.1458080Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:46.1458340Z %3 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:46.1458584Z %4 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:46.1458782Z %5 = tt.splat %arg1 : !tt.ptr -> tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T08:46:46.1459018Z %6 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:46.1459384Z %7 = arith.extsi %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:46.1459740Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:46:46.1460151Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:46:46.1460548Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:46.1460801Z %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:46.1460995Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T08:46:46.1461189Z %13 = arith.cmpi eq, %10, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:46.1461377Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T08:46:46.1461583Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x256x!tt.ptr, #mma> 2026-02-21T08:46:46.1461848Z %16 = arith.extsi %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:46.1462146Z %17 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:46.1462448Z %18 = arith.extsi %17 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:46.1462701Z scf.for %arg3 = %0 to %c112_i32 step %c4864_i32 : i32 { 2026-02-21T08:46:46.1462848Z %19 = arith.divsi %arg3, %c112_i32 : i32 2026-02-21T08:46:46.1462971Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T08:46:46.1463087Z %21 = arith.subi %c4_i32, %20 : i32 2026-02-21T08:46:46.1463201Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T08:46:46.1463318Z %23 = arith.remsi %arg3, %c112_i32 : i32 2026-02-21T08:46:46.1463438Z %24 = arith.remsi %23, %22 : i32 2026-02-21T08:46:46.1463550Z %25 = arith.addi %20, %24 : i32 2026-02-21T08:46:46.1463693Z %26 = arith.divsi %23, %22 : i32 2026-02-21T08:46:46.1463806Z %27 = arith.muli %25, %c16_i32 : i32 2026-02-21T08:46:46.1463973Z %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:46.1464195Z %29 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:46.1464363Z %30 = arith.muli %26, %c256_i32 : i32 2026-02-21T08:46:46.1464588Z %31 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T08:46:46.1464837Z %32 = arith.muli %31, %cst_6 : tensor<16x1xi32, #blocked1> 2026-02-21T08:46:46.1465025Z %33 = tt.broadcast %32 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:46:46.1465200Z %34 = arith.extsi %30 : i32 to i64 2026-02-21T08:46:46.1465375Z %35 = tt.splat %34 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:46.1465619Z %36 = arith.addi %35, %7 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:46.1465900Z %37 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x256xi64, #blocked2> 2026-02-21T08:46:46.1466161Z %38 = arith.cmpi sge, %37, %cst_10 : tensor<1x256xi64, #blocked2> 2026-02-21T08:46:46.1466332Z %39 = arith.cmpi slt, %37, %cst_9 : tensor<1x256xi64, #blocked2> 2026-02-21T08:46:46.1466497Z %40 = arith.andi %38, %39 : tensor<1x256xi1, #blocked2> 2026-02-21T08:46:46.1466728Z %41 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c2_i32 iter_args(%arg5 = %cst_7) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T08:46:46.1466946Z %64 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:46:46.1467146Z %65 = tt.splat %64 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:46.1467361Z %66 = arith.addi %65, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:46.1467638Z %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:46:46.1467910Z %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:46:46.1468101Z %69 = arith.addi %33, %68 : tensor<16x2xi32, #blocked1> 2026-02-21T08:46:46.1468299Z %70 = tt.addptr %4, %69 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:46:46.1468502Z %71 = tt.load %70 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:46.1468769Z %72 = ttg.convert_layout %71 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:46.1469170Z %73 = arith.extf %72 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:46.1469454Z %74 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:46:46.1469575Z %75 = arith.muli %74, %c7168_i64 : i64 2026-02-21T08:46:46.1469715Z %76 = tt.splat %75 : i64 -> tensor<1x256xi64, #blocked2> 2026-02-21T08:46:46.1469872Z %77 = arith.addi %76, %37 : tensor<1x256xi64, #blocked2> 2026-02-21T08:46:46.1470062Z %78 = tt.addptr %5, %77 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi64, #blocked2> 2026-02-21T08:46:46.1470270Z %79 = tt.load %78, %40, %cst_8 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T08:46:46.1470536Z %80 = ttg.convert_layout %79 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:46.1470819Z %81 = arith.shli %80, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:46.1471056Z %82 = arith.shrsi %81, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:46.1471326Z %83 = arith.shrsi %80, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:46.1471615Z %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T08:46:46.1471947Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T08:46:46.1472229Z %86 = tt.broadcast %84 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T08:46:46.1472467Z %87 = arith.select %12, %86, %cst_11 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T08:46:46.1472703Z %88 = tt.broadcast %85 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T08:46:46.1472931Z %89 = arith.select %14, %88, %87 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T08:46:46.1473154Z %90 = tt.reshape %89 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T08:46:46.1473376Z %91 = arith.sitofp %90 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T08:46:46.1473626Z %92 = ttg.local_alloc %91 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T08:46:46.1473948Z %93 = ttg.local_load %92 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:46.1474421Z %94 = tt.dot %73, %93, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T08:46:46.1474771Z %95 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T08:46:46.1474890Z %96 = arith.muli %95, %c2_i32 : i32 2026-02-21T08:46:46.1499071Z %97 = tt.splat %96 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:46.1499383Z %98 = arith.addi %97, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:46.1499656Z %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:46:46.1499931Z %100 = tt.broadcast %99 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:46:46.1500126Z %101 = arith.addi %33, %100 : tensor<16x2xi32, #blocked1> 2026-02-21T08:46:46.1500328Z %102 = tt.addptr %4, %101 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:46:46.1500532Z %103 = tt.load %102 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:46.1500800Z %104 = ttg.convert_layout %103 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:46.1501201Z %105 = arith.extf %104 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:46.1501484Z %106 = arith.extsi %95 : i32 to i64 2026-02-21T08:46:46.1501608Z %107 = arith.muli %106, %c7168_i64 : i64 2026-02-21T08:46:46.1501750Z %108 = tt.splat %107 : i64 -> tensor<1x256xi64, #blocked2> 2026-02-21T08:46:46.1501911Z %109 = arith.addi %108, %37 : tensor<1x256xi64, #blocked2> 2026-02-21T08:46:46.1502108Z %110 = tt.addptr %5, %109 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi64, #blocked2> 2026-02-21T08:46:46.1502304Z %111 = arith.cmpi slt, %106, %c4096_i64 : i64 2026-02-21T08:46:46.1502450Z %112 = tt.splat %111 : i1 -> tensor<1x256xi1, #blocked2> 2026-02-21T08:46:46.1502604Z %113 = arith.andi %112, %40 : tensor<1x256xi1, #blocked2> 2026-02-21T08:46:46.1502776Z %114 = tt.load %110, %113, %cst_8 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T08:46:46.1503034Z %115 = ttg.convert_layout %114 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:46.1503401Z %116 = arith.shli %115, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:46.1503639Z %117 = arith.shrsi %116, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:46.1503878Z %118 = arith.shrsi %115, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:46.1504175Z %119 = tt.expand_dims %117 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T08:46:46.1504512Z %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T08:46:46.1504802Z %121 = tt.broadcast %119 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T08:46:46.1505050Z %122 = arith.select %12, %121, %cst_11 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T08:46:46.1505294Z %123 = tt.broadcast %120 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T08:46:46.1505533Z %124 = arith.select %14, %123, %122 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T08:46:46.1505767Z %125 = tt.reshape %124 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T08:46:46.1506000Z %126 = arith.sitofp %125 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T08:46:46.1506255Z %127 = ttg.local_alloc %126 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T08:46:46.1506578Z %128 = ttg.local_load %127 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:46.1507100Z %129 = tt.dot %105, %128, %94, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T08:46:46.1507450Z scf.yield %129 : tensor<16x256xf32, #mma> 2026-02-21T08:46:46.1507604Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:46:46.1507790Z %42 = arith.truncf %41 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma> 2026-02-21T08:46:46.1507955Z %43 = arith.extsi %27 : i32 to i64 2026-02-21T08:46:46.1508116Z %44 = tt.splat %43 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:46.1508318Z %45 = arith.addi %44, %16 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:46.1508582Z %46 = tt.expand_dims %45 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma> 2026-02-21T08:46:46.1508821Z %47 = arith.muli %46, %cst_1 : tensor<16x1xi64, #mma> 2026-02-21T08:46:46.1508994Z %48 = tt.broadcast %47 : tensor<16x1xi64, #mma> -> tensor<16x256xi64, #mma> 2026-02-21T08:46:46.1509199Z %49 = tt.splat %34 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:46.1509404Z %50 = arith.addi %49, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:46.1509669Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T08:46:46.1509930Z %52 = tt.broadcast %51 : tensor<1x256xi64, #mma> -> tensor<16x256xi64, #mma> 2026-02-21T08:46:46.1510103Z %53 = arith.addi %48, %52 : tensor<16x256xi64, #mma> 2026-02-21T08:46:46.1510284Z %54 = tt.addptr %15, %53 : tensor<16x256x!tt.ptr, #mma>, tensor<16x256xi64, #mma> 2026-02-21T08:46:46.1510475Z %55 = arith.cmpi sge, %46, %cst_0 : tensor<16x1xi64, #mma> 2026-02-21T08:46:46.1510635Z %56 = arith.cmpi slt, %46, %cst : tensor<16x1xi64, #mma> 2026-02-21T08:46:46.1510784Z %57 = arith.andi %55, %56 : tensor<16x1xi1, #mma> 2026-02-21T08:46:46.1510994Z %58 = tt.broadcast %57 : tensor<16x1xi1, #mma> -> tensor<16x256xi1, #mma> 2026-02-21T08:46:46.1511173Z %59 = arith.cmpi sge, %51, %cst_3 : tensor<1x256xi64, #mma> 2026-02-21T08:46:46.1511332Z %60 = arith.cmpi slt, %51, %cst_2 : tensor<1x256xi64, #mma> 2026-02-21T08:46:46.1511483Z %61 = arith.andi %59, %60 : tensor<1x256xi1, #mma> 2026-02-21T08:46:46.1511645Z %62 = tt.broadcast %61 : tensor<1x256xi1, #mma> -> tensor<16x256xi1, #mma> 2026-02-21T08:46:46.1511817Z %63 = arith.andi %58, %62 : tensor<16x256xi1, #mma> 2026-02-21T08:46:46.1511970Z tt.store %54, %42, %63 : tensor<16x256x!tt.ptr, #mma> 2026-02-21T08:46:46.1512172Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T08:46:46.1512354Z tt.return 2026-02-21T08:46:46.1512432Z } 2026-02-21T08:46:46.1512514Z } 2026-02-21T08:46:46.1512558Z 2026-02-21T08:46:46.1512589Z {-# 2026-02-21T08:46:46.1512673Z external_resources: { 2026-02-21T08:46:46.1512772Z mlir_reproducer: { 2026-02-21T08:46:46.1513777Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:46:46.1514778Z disable_threading: false, 2026-02-21T08:46:46.1514887Z verify_each: true 2026-02-21T08:46:46.1514975Z } 2026-02-21T08:46:46.1515048Z } 2026-02-21T08:46:46.1515152Z #-} 2026-02-21T08:46:46.1515433Z /tmp/torchinductor_root/yy/cyyoukks6lw5y4r3bbdrispjowrlw7qo2wiar4qx74wz4k3k6ofn.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:46:46.1516176Z /tmp/torchinductor_root/yy/cyyoukks6lw5y4r3bbdrispjowrlw7qo2wiar4qx74wz4k3k6ofn.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:46:46.1516722Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:46:46.1517508Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[2, 4], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T08:46:46.1518223Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:46:46.1518389Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:46:50.4596240Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:46:50.4601181Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [2, 1, 0]}> 2026-02-21T08:46:50.4601859Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T08:46:50.4602527Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}> 2026-02-21T08:46:50.4603271Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [16, 16], isTransposed = true}> 2026-02-21T08:46:50.4604395Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 32, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:46:50.4604881Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:46:50.4605204Z #smem = #ttg.shared_memory 2026-02-21T08:46:50.4605621Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:46:50.4606497Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:46:50.4607198Z %cst = arith.constant dense<64> : tensor<32x1xi64, #mma> 2026-02-21T08:46:50.4607499Z %cst_0 = arith.constant dense<0> : tensor<32x1xi64, #mma> 2026-02-21T08:46:50.4607803Z %cst_1 = arith.constant dense<7168> : tensor<32x1xi64, #mma> 2026-02-21T08:46:50.4608103Z %cst_2 = arith.constant dense<7168> : tensor<1x512xi64, #mma> 2026-02-21T08:46:50.4608407Z %cst_3 = arith.constant dense<0> : tensor<1x512xi64, #mma> 2026-02-21T08:46:50.4608708Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:50.4609008Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:50.4609320Z %cst_6 = arith.constant dense<8192> : tensor<32x1xi32, #blocked1> 2026-02-21T08:46:50.4609589Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:46:50.4609861Z %cst_7 = arith.constant dense<0.000000e+00> : tensor<32x512xf32, #mma> 2026-02-21T08:46:50.4610143Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T08:46:50.4610353Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:46:50.4610561Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:46:50.4610759Z %c28_i32 = arith.constant 28 : i32 2026-02-21T08:46:50.4611226Z %cst_8 = arith.constant dense<0> : tensor<1x512xi8, #blocked2> 2026-02-21T08:46:50.4611540Z %cst_9 = arith.constant dense<7168> : tensor<1x512xi64, #blocked2> 2026-02-21T08:46:50.4611862Z %cst_10 = arith.constant dense<0> : tensor<1x512xi64, #blocked2> 2026-02-21T08:46:50.4612171Z %cst_11 = arith.constant dense<0> : tensor<1x2x512xi8, #blocked> 2026-02-21T08:46:50.4612465Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:46:50.4612675Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:46:50.4612872Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:46:50.4613195Z %cst_12 = arith.constant dense<4> : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:50.4613539Z %0 = tt.get_program_id x : i32 2026-02-21T08:46:50.4613738Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T08:46:50.4613941Z %2 = arith.minsi %1, %c28_i32 : i32 2026-02-21T08:46:50.4614300Z %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:50.4614793Z %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:50.4615246Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:50.4615568Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:46:50.4615836Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T08:46:50.4616159Z %8 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:50.4616593Z %9 = arith.extsi %8 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:50.4617081Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:46:50.4617631Z %11 = tt.expand_dims %10 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:46:50.4618217Z %12 = tt.expand_dims %11 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:50.4618550Z %13 = arith.cmpi eq, %12, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:50.4618815Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked> 2026-02-21T08:46:50.4619079Z %15 = arith.cmpi eq, %12, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:50.4619324Z %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked> 2026-02-21T08:46:50.4619601Z %17 = tt.splat %arg2 : !tt.ptr -> tensor<32x512x!tt.ptr, #mma> 2026-02-21T08:46:50.4619958Z %18 = arith.extsi %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:50.4620359Z %19 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:50.4620773Z %20 = arith.extsi %19 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:50.4621090Z scf.for %arg3 = %0 to %2 step %c1_i32 : i32 { 2026-02-21T08:46:50.4621269Z %21 = arith.remsi %arg3, %c2_i32 : i32 2026-02-21T08:46:50.4621436Z %22 = arith.divsi %arg3, %c2_i32 : i32 2026-02-21T08:46:50.4621591Z %23 = arith.muli %21, %c32_i32 : i32 2026-02-21T08:46:50.4621810Z %24 = tt.splat %23 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:50.4622110Z %25 = arith.addi %24, %3 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:50.4622331Z %26 = arith.muli %22, %c512_i32 : i32 2026-02-21T08:46:50.4622677Z %27 = tt.expand_dims %25 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T08:46:50.4623013Z %28 = arith.muli %27, %cst_6 : tensor<32x1xi32, #blocked1> 2026-02-21T08:46:50.4623264Z %29 = tt.broadcast %28 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:46:50.4623494Z %30 = arith.extsi %26 : i32 to i64 2026-02-21T08:46:50.4623709Z %31 = tt.splat %30 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:50.4624003Z %32 = arith.addi %31, %9 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:50.4624368Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x512xi64, #blocked2> 2026-02-21T08:46:50.4624745Z %34 = arith.cmpi sge, %33, %cst_10 : tensor<1x512xi64, #blocked2> 2026-02-21T08:46:50.4624936Z %35 = arith.cmpi slt, %33, %cst_9 : tensor<1x512xi64, #blocked2> 2026-02-21T08:46:50.4625108Z %36 = arith.andi %34, %35 : tensor<1x512xi1, #blocked2> 2026-02-21T08:46:50.4625357Z %37 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c1_i32 iter_args(%arg5 = %cst_7) -> (tensor<32x512xf32, #mma>) : i32 { 2026-02-21T08:46:50.4625592Z %60 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:46:50.4625767Z %61 = tt.splat %60 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:50.4625999Z %62 = arith.addi %61, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:50.4626283Z %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:46:50.4626571Z %64 = tt.broadcast %63 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:46:50.4626773Z %65 = arith.addi %29, %64 : tensor<32x2xi32, #blocked1> 2026-02-21T08:46:50.4626981Z %66 = tt.addptr %6, %65 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T08:46:50.4627194Z %67 = tt.load %66 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:46:50.4627461Z %68 = ttg.local_alloc %67 : (tensor<32x2xbf16, #blocked1>) -> !ttg.memdesc<32x2xbf16, #shared, #smem> 2026-02-21T08:46:50.4627807Z %69 = ttg.local_load %68 : !ttg.memdesc<32x2xbf16, #shared, #smem> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:50.4628237Z %70 = arith.extf %69 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:50.4628533Z %71 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:46:50.4628665Z %72 = arith.muli %71, %c7168_i64 : i64 2026-02-21T08:46:50.4628812Z %73 = tt.splat %72 : i64 -> tensor<1x512xi64, #blocked2> 2026-02-21T08:46:50.4628973Z %74 = arith.addi %73, %33 : tensor<1x512xi64, #blocked2> 2026-02-21T08:46:50.4629180Z %75 = tt.addptr %7, %74 : tensor<1x512x!tt.ptr, #blocked2>, tensor<1x512xi64, #blocked2> 2026-02-21T08:46:50.4629402Z %76 = tt.load %75, %36, %cst_8 : tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T08:46:50.4629680Z %77 = ttg.convert_layout %76 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:50.4629978Z %78 = arith.shli %77, %cst_12 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:50.4630228Z %79 = arith.shrsi %78, %cst_12 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:50.4630480Z %80 = arith.shrsi %77, %cst_12 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:50.4630788Z %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T08:46:50.4631186Z %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T08:46:50.4631487Z %83 = tt.broadcast %81 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T08:46:50.4631743Z %84 = arith.select %14, %83, %cst_11 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T08:46:50.4631996Z %85 = tt.broadcast %82 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T08:46:50.4632240Z %86 = arith.select %16, %85, %84 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T08:46:50.4632481Z %87 = tt.reshape %86 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked2> 2026-02-21T08:46:50.4632716Z %88 = arith.sitofp %87 : tensor<2x512xi8, #blocked2> to tensor<2x512xf32, #blocked2> 2026-02-21T08:46:50.4632981Z %89 = ttg.local_alloc %88 : (tensor<2x512xf32, #blocked2>) -> !ttg.memdesc<2x512xf32, #shared1, #smem> 2026-02-21T08:46:50.4633333Z %90 = ttg.local_load %89 : !ttg.memdesc<2x512xf32, #shared1, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:50.4633838Z %91 = tt.dot %70, %90, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma> 2026-02-21T08:46:50.4634212Z scf.yield %91 : tensor<32x512xf32, #mma> 2026-02-21T08:46:50.4634390Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T08:46:50.4634605Z %38 = arith.truncf %37 : tensor<32x512xf32, #mma> to tensor<32x512xbf16, #mma> 2026-02-21T08:46:50.4634774Z %39 = arith.extsi %23 : i32 to i64 2026-02-21T08:46:50.4634930Z %40 = tt.splat %39 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:50.4635133Z %41 = arith.addi %40, %18 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:50.4635391Z %42 = tt.expand_dims %41 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T08:46:50.4635652Z %43 = arith.muli %42, %cst_1 : tensor<32x1xi64, #mma> 2026-02-21T08:46:50.4635826Z %44 = tt.broadcast %43 : tensor<32x1xi64, #mma> -> tensor<32x512xi64, #mma> 2026-02-21T08:46:50.4636025Z %45 = tt.splat %30 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:50.4636230Z %46 = arith.addi %45, %20 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:50.4636491Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T08:46:50.4636748Z %48 = tt.broadcast %47 : tensor<1x512xi64, #mma> -> tensor<32x512xi64, #mma> 2026-02-21T08:46:50.4636923Z %49 = arith.addi %44, %48 : tensor<32x512xi64, #mma> 2026-02-21T08:46:50.4637105Z %50 = tt.addptr %17, %49 : tensor<32x512x!tt.ptr, #mma>, tensor<32x512xi64, #mma> 2026-02-21T08:46:50.4637298Z %51 = arith.cmpi sge, %42, %cst_0 : tensor<32x1xi64, #mma> 2026-02-21T08:46:50.4637459Z %52 = arith.cmpi slt, %42, %cst : tensor<32x1xi64, #mma> 2026-02-21T08:46:50.4637607Z %53 = arith.andi %51, %52 : tensor<32x1xi1, #mma> 2026-02-21T08:46:50.4637780Z %54 = tt.broadcast %53 : tensor<32x1xi1, #mma> -> tensor<32x512xi1, #mma> 2026-02-21T08:46:50.4637957Z %55 = arith.cmpi sge, %47, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T08:46:50.4638119Z %56 = arith.cmpi slt, %47, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T08:46:50.4638273Z %57 = arith.andi %55, %56 : tensor<1x512xi1, #mma> 2026-02-21T08:46:50.4638438Z %58 = tt.broadcast %57 : tensor<1x512xi1, #mma> -> tensor<32x512xi1, #mma> 2026-02-21T08:46:50.4638613Z %59 = arith.andi %54, %58 : tensor<32x512xi1, #mma> 2026-02-21T08:46:50.4638763Z tt.store %50, %38, %59 : tensor<32x512x!tt.ptr, #mma> 2026-02-21T08:46:50.4639019Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32} 2026-02-21T08:46:50.4639198Z tt.return 2026-02-21T08:46:50.4639282Z } 2026-02-21T08:46:50.4639359Z } 2026-02-21T08:46:50.4639404Z 2026-02-21T08:46:50.4639435Z {-# 2026-02-21T08:46:50.4639516Z external_resources: { 2026-02-21T08:46:50.4639612Z mlir_reproducer: { 2026-02-21T08:46:50.4640611Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:46:50.4641614Z disable_threading: false, 2026-02-21T08:46:50.4641719Z verify_each: true 2026-02-21T08:46:50.4641814Z } 2026-02-21T08:46:50.4641886Z } 2026-02-21T08:46:50.4641956Z #-} 2026-02-21T08:46:50.4642231Z /tmp/torchinductor_root/ov/cov4uajfvx5mko2n2x5v26enht6si7crrlbd46iulyg2niaxw46u.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:46:50.4642959Z /tmp/torchinductor_root/ov/cov4uajfvx5mko2n2x5v26enht6si7crrlbd46iulyg2niaxw46u.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:46:50.4643511Z [48s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:46:50.4644290Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[1, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T08:46:50.4645035Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:46:50.4645202Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:46:55.5810419Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:46:55.5833502Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:46:55.5834181Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:46:55.5834829Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 16], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:46:55.5835445Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:46:55.5836005Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T08:46:55.5836504Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:46:55.5836864Z #smem = #ttg.shared_memory 2026-02-21T08:46:55.5837324Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:46:55.5838266Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:46:55.5839789Z %cst = arith.constant dense<64> : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5840129Z %cst_0 = arith.constant dense<0> : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5840462Z %cst_1 = arith.constant dense<7168> : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5840800Z %cst_2 = arith.constant dense<7168> : tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5841130Z %cst_3 = arith.constant dense<0> : tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5841463Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:55.5841795Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:55.5842148Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<16x1024xf32, #mma> 2026-02-21T08:46:55.5842469Z %c331_i32 = arith.constant 331 : i32 2026-02-21T08:46:55.5842998Z %c-1_i32 = arith.constant -1 : i32 2026-02-21T08:46:55.5843281Z %cst_7 = arith.constant dense<0> : tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5843635Z %cst_8 = arith.constant dense : tensor<1x1024xi1, #blocked2> 2026-02-21T08:46:55.5843944Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T08:46:55.5844182Z %c4094_i32 = arith.constant 4094 : i32 2026-02-21T08:46:55.5844499Z %cst_9 = arith.constant dense<8188> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:55.5844901Z %cst_10 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:55.5845247Z %cst_11 = arith.constant dense<0> : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5845515Z %cst_12 = arith.constant dense<7168> : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5845793Z %cst_13 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T08:46:55.5846018Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:46:55.5846188Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:46:55.5846371Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:46:55.5846544Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:46:55.5846716Z %c28_i32 = arith.constant 28 : i32 2026-02-21T08:46:55.5846886Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:46:55.5847204Z %c912_i32 = arith.constant 912 : i32 2026-02-21T08:46:55.5847422Z %cst_14 = arith.constant dense<0> : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5847653Z %c608_i32 = arith.constant 608 : i32 2026-02-21T08:46:55.5847821Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:46:55.5847999Z %c0_i64 = arith.constant 0 : i64 2026-02-21T08:46:55.5848165Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T08:46:55.5848381Z %cst_15 = arith.constant dense<0> : tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5848605Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:46:55.5848775Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:46:55.5848941Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:46:55.5849108Z %c304_i32 = arith.constant 304 : i32 2026-02-21T08:46:55.5849327Z %cst_16 = arith.constant dense<4> : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5849654Z %cst_17 = arith.constant dense<4> : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5849956Z %0 = tt.get_program_id x : i32 2026-02-21T08:46:55.5850251Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:55.5850663Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:55.5851059Z %3 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:55.5851419Z %4 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:55.5851723Z %5 = tt.splat %arg1 : !tt.ptr -> tensor<1x1024x!tt.ptr, #blocked2> 2026-02-21T08:46:55.5852158Z %6 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:55.5852665Z %7 = arith.extsi %6 : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:55.5853215Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:46:55.5853841Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:46:55.5854396Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:55.5854690Z %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:55.5854925Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x1024xi1, #blocked> 2026-02-21T08:46:55.5855185Z %13 = arith.cmpi eq, %10, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:46:55.5855403Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x1024xi1, #blocked> 2026-02-21T08:46:55.5855655Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x1024x!tt.ptr, #mma> 2026-02-21T08:46:55.5855966Z %16 = arith.extsi %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:55.5856319Z %17 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:55.5856683Z %18 = arith.extsi %17 : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:55.5856955Z %19 = arith.subi %c331_i32, %0 : i32 2026-02-21T08:46:55.5857095Z %20 = arith.divui %19, %c304_i32 : i32 2026-02-21T08:46:55.5857232Z %21 = arith.remsi %20, %c3_i32 : i32 2026-02-21T08:46:55.5857361Z %22 = arith.subi %20, %21 : i32 2026-02-21T08:46:55.5857501Z %23 = arith.muli %22, %c304_i32 : i32 2026-02-21T08:46:55.5857637Z %24 = arith.addi %0, %23 : i32 2026-02-21T08:46:55.5857819Z scf.for %arg3 = %0 to %24 step %c912_i32 : i32 { 2026-02-21T08:46:55.5857975Z %30 = arith.divsi %arg3, %c28_i32 : i32 2026-02-21T08:46:55.5858118Z %31 = arith.muli %30, %c4_i32 : i32 2026-02-21T08:46:55.5858255Z %32 = arith.subi %c4_i32, %31 : i32 2026-02-21T08:46:55.5858384Z %33 = arith.minsi %32, %c4_i32 : i32 2026-02-21T08:46:55.5858525Z %34 = arith.remsi %arg3, %c28_i32 : i32 2026-02-21T08:46:55.5858658Z %35 = arith.remsi %34, %33 : i32 2026-02-21T08:46:55.5858787Z %36 = arith.addi %31, %35 : i32 2026-02-21T08:46:55.5858918Z %37 = arith.divsi %34, %33 : i32 2026-02-21T08:46:55.5859047Z %38 = arith.muli %36, %c16_i32 : i32 2026-02-21T08:46:55.5859243Z %39 = tt.splat %38 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:55.5859503Z %40 = arith.addi %39, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:55.5859703Z %41 = arith.muli %37, %c1024_i32 : i32 2026-02-21T08:46:55.5859960Z %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T08:46:55.5860253Z %43 = arith.muli %42, %cst_13 : tensor<16x1xi32, #blocked1> 2026-02-21T08:46:55.5860470Z %44 = tt.broadcast %43 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5860669Z %45 = arith.extsi %41 : i32 to i64 2026-02-21T08:46:55.5860866Z %46 = tt.splat %45 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:55.5861122Z %47 = arith.addi %46, %7 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:55.5861446Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5861787Z %49 = arith.cmpi sge, %48, %cst_11 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5861990Z %50 = arith.cmpi slt, %48, %cst_12 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5862184Z %51 = arith.andi %49, %50 : tensor<1x1024xi1, #blocked2> 2026-02-21T08:46:55.5862413Z %52 = tt.addptr %5, %48 : tensor<1x1024x!tt.ptr, #blocked2>, tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5862718Z %53 = tt.load %52, %51, %cst_14 {amd.pipeliner_part = "prologue"} : tensor<1x1024x!tt.ptr, #blocked2> 2026-02-21T08:46:55.5862973Z %54 = arith.addi %48, %cst_12 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5863200Z %55 = tt.addptr %5, %54 : tensor<1x1024x!tt.ptr, #blocked2>, tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5863492Z %56 = tt.load %55, %51, %cst_14 {amd.pipeliner_part = "prologue"} : tensor<1x1024x!tt.ptr, #blocked2> 2026-02-21T08:46:55.5863969Z %57:3 = scf.for %arg4 = %c0_i32 to %c4094_i32 step %c1_i32 iter_args(%arg5 = %cst_6, %arg6 = %53, %arg7 = %56) -> (tensor<16x1024xf32, #mma>, tensor<1x1024xi8, #blocked2>, tensor<1x1024xi8, #blocked2>) : i32 { 2026-02-21T08:46:55.5864320Z %314 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:46:55.5864493Z %315 = tt.splat %314 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:55.5864719Z %316 = arith.addi %315, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:55.5864997Z %317 = tt.expand_dims %316 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:46:55.5865272Z %318 = tt.broadcast %317 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5865467Z %319 = arith.addi %44, %318 : tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5865691Z %320 = tt.addptr %4, %319 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5865903Z %321 = tt.load %320 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:55.5866168Z %322 = ttg.convert_layout %321 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5866606Z %323 = arith.extf %322 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5866892Z %324 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T08:46:55.5867013Z %325 = arith.extsi %324 : i32 to i64 2026-02-21T08:46:55.5867136Z %326 = arith.muli %325, %c7168_i64 : i64 2026-02-21T08:46:55.5867276Z %327 = tt.splat %326 : i64 -> tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5867439Z %328 = arith.addi %327, %48 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5867639Z %329 = tt.addptr %5, %328 : tensor<1x1024x!tt.ptr, #blocked2>, tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5867853Z %330 = tt.load %329, %51, %cst_14 : tensor<1x1024x!tt.ptr, #blocked2> 2026-02-21T08:46:55.5868037Z %331 = arith.shli %arg6, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5868208Z %332 = arith.shrsi %331, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5868463Z %333 = ttg.convert_layout %332 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5868716Z %334 = arith.shrsi %arg6, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5868966Z %335 = ttg.convert_layout %334 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5869308Z %336 = tt.expand_dims %333 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5869687Z %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5869980Z %338 = tt.broadcast %336 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5870236Z %339 = arith.select %12, %338, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5870483Z %340 = tt.broadcast %337 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5870732Z %341 = arith.select %14, %340, %339 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5870974Z %342 = tt.reshape %341 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3> 2026-02-21T08:46:55.5871210Z %343 = arith.sitofp %342 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3> 2026-02-21T08:46:55.5871471Z %344 = ttg.local_alloc %343 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem> 2026-02-21T08:46:55.5871806Z %345 = ttg.local_load %344 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5872293Z %346 = tt.dot %323, %345, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma> 2026-02-21T08:46:55.5872737Z scf.yield %346, %arg7, %330 : tensor<16x1024xf32, #mma>, tensor<1x1024xi8, #blocked2>, tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5872980Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T08:46:55.5873178Z %58 = arith.addi %3, %cst_9 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:55.5873452Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:46:55.5873731Z %60 = tt.broadcast %59 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5873921Z %61 = arith.addi %44, %60 : tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5874157Z %62 = tt.addptr %4, %61 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5886029Z %63 = tt.load %62 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:55.5886308Z %64 = ttg.convert_layout %63 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5886715Z %65 = arith.extf %64 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5887025Z %66 = arith.shli %57#1, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5887200Z %67 = arith.shrsi %66, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5887458Z %68 = ttg.convert_layout %67 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5887711Z %69 = arith.shrsi %57#1, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5887960Z %70 = ttg.convert_layout %69 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5888302Z %71 = tt.expand_dims %68 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5888658Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5888947Z %73 = tt.broadcast %71 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5889190Z %74 = arith.select %12, %73, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5889433Z %75 = tt.broadcast %72 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5889745Z %76 = arith.select %14, %75, %74 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5889980Z %77 = tt.reshape %76 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3> 2026-02-21T08:46:55.5890208Z %78 = arith.sitofp %77 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3> 2026-02-21T08:46:55.5890459Z %79 = ttg.local_alloc %78 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem> 2026-02-21T08:46:55.5890792Z %80 = ttg.local_load %79 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5891267Z %81 = tt.dot %65, %80, %57#0, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma> 2026-02-21T08:46:55.5891668Z %82 = arith.addi %3, %cst_10 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:55.5891947Z %83 = tt.expand_dims %82 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:46:55.5892220Z %84 = tt.broadcast %83 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5892415Z %85 = arith.addi %44, %84 : tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5892611Z %86 = tt.addptr %4, %85 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5892814Z %87 = tt.load %86 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:55.5893076Z %88 = ttg.convert_layout %87 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5893468Z %89 = arith.extf %88 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5893776Z %90 = arith.shli %57#2, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5893942Z %91 = arith.shrsi %90, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5894233Z %92 = ttg.convert_layout %91 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5894484Z %93 = arith.shrsi %57#2, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5894730Z %94 = ttg.convert_layout %93 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5895067Z %95 = tt.expand_dims %92 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5895410Z %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5895692Z %97 = tt.broadcast %95 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5895937Z %98 = arith.select %12, %97, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5896177Z %99 = tt.broadcast %96 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5896413Z %100 = arith.select %14, %99, %98 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5896651Z %101 = tt.reshape %100 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3> 2026-02-21T08:46:55.5896887Z %102 = arith.sitofp %101 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3> 2026-02-21T08:46:55.5897149Z %103 = ttg.local_alloc %102 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem> 2026-02-21T08:46:55.5897481Z %104 = ttg.local_load %103 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5897992Z %105 = tt.dot %89, %104, %81, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma> 2026-02-21T08:46:55.5898387Z %106 = arith.truncf %105 : tensor<16x1024xf32, #mma> to tensor<16x1024xbf16, #mma> 2026-02-21T08:46:55.5898563Z %107 = arith.extsi %38 : i32 to i64 2026-02-21T08:46:55.5898733Z %108 = tt.splat %107 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:55.5898943Z %109 = arith.addi %108, %16 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:55.5899210Z %110 = tt.expand_dims %109 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5899456Z %111 = arith.muli %110, %cst_1 : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5899636Z %112 = tt.broadcast %111 : tensor<16x1xi64, #mma> -> tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5899853Z %113 = tt.splat %45 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:55.5900071Z %114 = arith.addi %113, %18 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:55.5900349Z %115 = tt.expand_dims %114 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5900625Z %116 = tt.broadcast %115 : tensor<1x1024xi64, #mma> -> tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5900813Z %117 = arith.addi %112, %116 : tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5901014Z %118 = tt.addptr %15, %117 : tensor<16x1024x!tt.ptr, #mma>, tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5901217Z %119 = arith.cmpi sge, %110, %cst_0 : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5901387Z %120 = arith.cmpi slt, %110, %cst : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5901543Z %121 = arith.andi %119, %120 : tensor<16x1xi1, #mma> 2026-02-21T08:46:55.5901722Z %122 = tt.broadcast %121 : tensor<16x1xi1, #mma> -> tensor<16x1024xi1, #mma> 2026-02-21T08:46:55.5901945Z %123 = arith.cmpi sge, %115, %cst_3 : tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5902111Z %124 = arith.cmpi slt, %115, %cst_2 : tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5902276Z %125 = arith.andi %123, %124 : tensor<1x1024xi1, #mma> 2026-02-21T08:46:55.5902451Z %126 = tt.broadcast %125 : tensor<1x1024xi1, #mma> -> tensor<16x1024xi1, #mma> 2026-02-21T08:46:55.5902636Z %127 = arith.andi %122, %126 : tensor<16x1024xi1, #mma> 2026-02-21T08:46:55.5902804Z tt.store %118, %106, %127 : tensor<16x1024x!tt.ptr, #mma> 2026-02-21T08:46:55.5902957Z %128 = arith.addi %arg3, %c304_i32 : i32 2026-02-21T08:46:55.5903085Z %129 = arith.divsi %128, %c28_i32 : i32 2026-02-21T08:46:55.5903205Z %130 = arith.muli %129, %c4_i32 : i32 2026-02-21T08:46:55.5903326Z %131 = arith.subi %c4_i32, %130 : i32 2026-02-21T08:46:55.5903444Z %132 = arith.minsi %131, %c4_i32 : i32 2026-02-21T08:46:55.5903567Z %133 = arith.remsi %128, %c28_i32 : i32 2026-02-21T08:46:55.5903686Z %134 = arith.remsi %133, %132 : i32 2026-02-21T08:46:55.5903806Z %135 = arith.addi %130, %134 : i32 2026-02-21T08:46:55.5903922Z %136 = arith.divsi %133, %132 : i32 2026-02-21T08:46:55.5904036Z %137 = arith.muli %135, %c16_i32 : i32 2026-02-21T08:46:55.5904211Z %138 = tt.splat %137 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:55.5904440Z %139 = arith.addi %138, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:55.5904620Z %140 = arith.muli %136, %c1024_i32 : i32 2026-02-21T08:46:55.5904847Z %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T08:46:55.5905107Z %142 = arith.muli %141, %cst_13 : tensor<16x1xi32, #blocked1> 2026-02-21T08:46:55.5905350Z %143 = tt.broadcast %142 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5905525Z %144 = arith.extsi %140 : i32 to i64 2026-02-21T08:46:55.5905702Z %145 = tt.splat %144 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:55.5905934Z %146 = arith.addi %145, %7 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:55.5906219Z %147 = tt.expand_dims %146 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5906487Z %148 = arith.cmpi sge, %147, %cst_11 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5906665Z %149 = arith.cmpi slt, %147, %cst_12 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5906835Z %150 = arith.andi %148, %149 : tensor<1x1024xi1, #blocked2> 2026-02-21T08:46:55.5907039Z %151 = tt.addptr %5, %147 : tensor<1x1024x!tt.ptr, #blocked2>, tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5907305Z %152 = tt.load %151, %150, %cst_14 {amd.pipeliner_part = "prologue"} : tensor<1x1024x!tt.ptr, #blocked2> 2026-02-21T08:46:55.5907535Z %153 = arith.addi %147, %cst_12 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5907741Z %154 = tt.addptr %5, %153 : tensor<1x1024x!tt.ptr, #blocked2>, tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5908000Z %155 = tt.load %154, %150, %cst_14 {amd.pipeliner_part = "prologue"} : tensor<1x1024x!tt.ptr, #blocked2> 2026-02-21T08:46:55.5908412Z %156:3 = scf.for %arg4 = %c0_i32 to %c4094_i32 step %c1_i32 iter_args(%arg5 = %cst_6, %arg6 = %152, %arg7 = %155) -> (tensor<16x1024xf32, #mma>, tensor<1x1024xi8, #blocked2>, tensor<1x1024xi8, #blocked2>) : i32 { 2026-02-21T08:46:55.5908739Z %314 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:46:55.5908916Z %315 = tt.splat %314 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:55.5909143Z %316 = arith.addi %315, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:55.5909421Z %317 = tt.expand_dims %316 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:46:55.5909776Z %318 = tt.broadcast %317 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5909970Z %319 = arith.addi %143, %318 : tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5910170Z %320 = tt.addptr %4, %319 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5910377Z %321 = tt.load %320 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:55.5910644Z %322 = ttg.convert_layout %321 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5911055Z %323 = arith.extf %322 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5911336Z %324 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T08:46:55.5911464Z %325 = arith.extsi %324 : i32 to i64 2026-02-21T08:46:55.5911587Z %326 = arith.muli %325, %c7168_i64 : i64 2026-02-21T08:46:55.5911730Z %327 = tt.splat %326 : i64 -> tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5911894Z %328 = arith.addi %327, %147 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5912094Z %329 = tt.addptr %5, %328 : tensor<1x1024x!tt.ptr, #blocked2>, tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5912312Z %330 = tt.load %329, %150, %cst_14 : tensor<1x1024x!tt.ptr, #blocked2> 2026-02-21T08:46:55.5912495Z %331 = arith.shli %arg6, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5912666Z %332 = arith.shrsi %331, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5912951Z %333 = ttg.convert_layout %332 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5913206Z %334 = arith.shrsi %arg6, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5913458Z %335 = ttg.convert_layout %334 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5913798Z %336 = tt.expand_dims %333 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5914168Z %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5914460Z %338 = tt.broadcast %336 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5914710Z %339 = arith.select %12, %338, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5914959Z %340 = tt.broadcast %337 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5915202Z %341 = arith.select %14, %340, %339 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5915444Z %342 = tt.reshape %341 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3> 2026-02-21T08:46:55.5915677Z %343 = arith.sitofp %342 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3> 2026-02-21T08:46:55.5915935Z %344 = ttg.local_alloc %343 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem> 2026-02-21T08:46:55.5916265Z %345 = ttg.local_load %344 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5916742Z %346 = tt.dot %323, %345, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma> 2026-02-21T08:46:55.5917184Z scf.yield %346, %arg7, %330 : tensor<16x1024xf32, #mma>, tensor<1x1024xi8, #blocked2>, tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5922415Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T08:46:55.5922644Z %157 = arith.addi %143, %60 : tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5922844Z %158 = tt.addptr %4, %157 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5923049Z %159 = tt.load %158 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:55.5923315Z %160 = ttg.convert_layout %159 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5923713Z %161 = arith.extf %160 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5924017Z %162 = arith.shli %156#1, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5924190Z %163 = arith.shrsi %162, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5924451Z %164 = ttg.convert_layout %163 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5924706Z %165 = arith.shrsi %156#1, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5924955Z %166 = ttg.convert_layout %165 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5925300Z %167 = tt.expand_dims %164 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5925638Z %168 = tt.expand_dims %166 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5925930Z %169 = tt.broadcast %167 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5926225Z %170 = arith.select %12, %169, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5926470Z %171 = tt.broadcast %168 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5926712Z %172 = arith.select %14, %171, %170 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5926947Z %173 = tt.reshape %172 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3> 2026-02-21T08:46:55.5927181Z %174 = arith.sitofp %173 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3> 2026-02-21T08:46:55.5927440Z %175 = ttg.local_alloc %174 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem> 2026-02-21T08:46:55.5927768Z %176 = ttg.local_load %175 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5928253Z %177 = tt.dot %161, %176, %156#0, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma> 2026-02-21T08:46:55.5928620Z %178 = arith.addi %143, %84 : tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5928816Z %179 = tt.addptr %4, %178 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5929021Z %180 = tt.load %179 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:55.5929285Z %181 = ttg.convert_layout %180 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5929680Z %182 = arith.extf %181 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5929988Z %183 = arith.shli %156#2, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5930159Z %184 = arith.shrsi %183, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5930411Z %185 = ttg.convert_layout %184 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5930707Z %186 = arith.shrsi %156#2, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5930959Z %187 = ttg.convert_layout %186 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5931300Z %188 = tt.expand_dims %185 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5931640Z %189 = tt.expand_dims %187 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5931929Z %190 = tt.broadcast %188 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5932177Z %191 = arith.select %12, %190, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5932417Z %192 = tt.broadcast %189 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5932658Z %193 = arith.select %14, %192, %191 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5932893Z %194 = tt.reshape %193 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3> 2026-02-21T08:46:55.5933122Z %195 = arith.sitofp %194 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3> 2026-02-21T08:46:55.5933380Z %196 = ttg.local_alloc %195 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem> 2026-02-21T08:46:55.5933706Z %197 = ttg.local_load %196 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5934209Z %198 = tt.dot %182, %197, %177, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma> 2026-02-21T08:46:55.5934600Z %199 = arith.truncf %198 : tensor<16x1024xf32, #mma> to tensor<16x1024xbf16, #mma> 2026-02-21T08:46:55.5934773Z %200 = arith.extsi %137 : i32 to i64 2026-02-21T08:46:55.5934936Z %201 = tt.splat %200 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:55.5935144Z %202 = arith.addi %201, %16 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:55.5935404Z %203 = tt.expand_dims %202 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5935644Z %204 = arith.muli %203, %cst_1 : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5935820Z %205 = tt.broadcast %204 : tensor<16x1xi64, #mma> -> tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5936031Z %206 = tt.splat %144 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:55.5936245Z %207 = arith.addi %206, %18 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:55.5936517Z %208 = tt.expand_dims %207 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5936782Z %209 = tt.broadcast %208 : tensor<1x1024xi64, #mma> -> tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5936966Z %210 = arith.addi %205, %209 : tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5937161Z %211 = tt.addptr %15, %210 : tensor<16x1024x!tt.ptr, #mma>, tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5937360Z %212 = arith.cmpi sge, %203, %cst_0 : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5937526Z %213 = arith.cmpi slt, %203, %cst : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5937678Z %214 = arith.andi %212, %213 : tensor<16x1xi1, #mma> 2026-02-21T08:46:55.5937851Z %215 = tt.broadcast %214 : tensor<16x1xi1, #mma> -> tensor<16x1024xi1, #mma> 2026-02-21T08:46:55.5938042Z %216 = arith.cmpi sge, %208, %cst_3 : tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5938207Z %217 = arith.cmpi slt, %208, %cst_2 : tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5938404Z %218 = arith.andi %216, %217 : tensor<1x1024xi1, #mma> 2026-02-21T08:46:55.5938578Z %219 = tt.broadcast %218 : tensor<1x1024xi1, #mma> -> tensor<16x1024xi1, #mma> 2026-02-21T08:46:55.5938761Z %220 = arith.andi %215, %219 : tensor<16x1024xi1, #mma> 2026-02-21T08:46:55.5938921Z tt.store %211, %199, %220 : tensor<16x1024x!tt.ptr, #mma> 2026-02-21T08:46:55.5939075Z %221 = arith.addi %arg3, %c608_i32 : i32 2026-02-21T08:46:55.5939201Z %222 = arith.divsi %221, %c28_i32 : i32 2026-02-21T08:46:55.5939319Z %223 = arith.muli %222, %c4_i32 : i32 2026-02-21T08:46:55.5939437Z %224 = arith.subi %c4_i32, %223 : i32 2026-02-21T08:46:55.5939553Z %225 = arith.minsi %224, %c4_i32 : i32 2026-02-21T08:46:55.5939673Z %226 = arith.remsi %221, %c28_i32 : i32 2026-02-21T08:46:55.5939789Z %227 = arith.remsi %226, %225 : i32 2026-02-21T08:46:55.5939904Z %228 = arith.addi %223, %227 : i32 2026-02-21T08:46:55.5940021Z %229 = arith.divsi %226, %225 : i32 2026-02-21T08:46:55.5940133Z %230 = arith.muli %228, %c16_i32 : i32 2026-02-21T08:46:55.5940303Z %231 = tt.splat %230 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:55.5940524Z %232 = arith.addi %231, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:55.5940699Z %233 = arith.muli %229, %c1024_i32 : i32 2026-02-21T08:46:55.5940921Z %234 = tt.expand_dims %232 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T08:46:55.5941174Z %235 = arith.muli %234, %cst_13 : tensor<16x1xi32, #blocked1> 2026-02-21T08:46:55.5941392Z %236 = tt.broadcast %235 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5941598Z %237 = arith.extsi %233 : i32 to i64 2026-02-21T08:46:55.5941771Z %238 = tt.splat %237 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:55.5942003Z %239 = arith.addi %238, %7 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:55.5942286Z %240 = tt.expand_dims %239 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5942548Z %241 = arith.cmpi sge, %240, %cst_11 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5942725Z %242 = arith.cmpi slt, %240, %cst_12 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5942896Z %243 = arith.andi %241, %242 : tensor<1x1024xi1, #blocked2> 2026-02-21T08:46:55.5943098Z %244 = tt.addptr %5, %240 : tensor<1x1024x!tt.ptr, #blocked2>, tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5943365Z %245 = tt.load %244, %243, %cst_14 {amd.pipeliner_part = "prologue"} : tensor<1x1024x!tt.ptr, #blocked2> 2026-02-21T08:46:55.5943591Z %246 = arith.addi %240, %cst_12 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5943796Z %247 = tt.addptr %5, %246 : tensor<1x1024x!tt.ptr, #blocked2>, tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5944057Z %248 = tt.load %247, %243, %cst_14 {amd.pipeliner_part = "prologue"} : tensor<1x1024x!tt.ptr, #blocked2> 2026-02-21T08:46:55.5944468Z %249:3 = scf.for %arg4 = %c0_i32 to %c4094_i32 step %c1_i32 iter_args(%arg5 = %cst_6, %arg6 = %245, %arg7 = %248) -> (tensor<16x1024xf32, #mma>, tensor<1x1024xi8, #blocked2>, tensor<1x1024xi8, #blocked2>) : i32 { 2026-02-21T08:46:55.5944794Z %314 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:46:55.5944971Z %315 = tt.splat %314 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:55.5945195Z %316 = arith.addi %315, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:55.5945478Z %317 = tt.expand_dims %316 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:46:55.5945790Z %318 = tt.broadcast %317 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5945984Z %319 = arith.addi %236, %318 : tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5946181Z %320 = tt.addptr %4, %319 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5946381Z %321 = tt.load %320 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:55.5946648Z %322 = ttg.convert_layout %321 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5947053Z %323 = arith.extf %322 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5947332Z %324 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T08:46:55.5947459Z %325 = arith.extsi %324 : i32 to i64 2026-02-21T08:46:55.5947579Z %326 = arith.muli %325, %c7168_i64 : i64 2026-02-21T08:46:55.5947725Z %327 = tt.splat %326 : i64 -> tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5947889Z %328 = arith.addi %327, %240 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5948089Z %329 = tt.addptr %5, %328 : tensor<1x1024x!tt.ptr, #blocked2>, tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5948306Z %330 = tt.load %329, %243, %cst_14 : tensor<1x1024x!tt.ptr, #blocked2> 2026-02-21T08:46:55.5948486Z %331 = arith.shli %arg6, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5948656Z %332 = arith.shrsi %331, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5948910Z %333 = ttg.convert_layout %332 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5949197Z %334 = arith.shrsi %arg6, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5949450Z %335 = ttg.convert_layout %334 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5949790Z %336 = tt.expand_dims %333 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5950140Z %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5950433Z %338 = tt.broadcast %336 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5950684Z %339 = arith.select %12, %338, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5950931Z %340 = tt.broadcast %337 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5951174Z %341 = arith.select %14, %340, %339 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5951414Z %342 = tt.reshape %341 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3> 2026-02-21T08:46:55.5951648Z %343 = arith.sitofp %342 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3> 2026-02-21T08:46:55.5951906Z %344 = ttg.local_alloc %343 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem> 2026-02-21T08:46:55.5952241Z %345 = ttg.local_load %344 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5952725Z %346 = tt.dot %323, %345, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma> 2026-02-21T08:46:55.5953180Z scf.yield %346, %arg7, %330 : tensor<16x1024xf32, #mma>, tensor<1x1024xi8, #blocked2>, tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5953423Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T08:46:55.5953616Z %250 = arith.addi %236, %60 : tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5953818Z %251 = tt.addptr %4, %250 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5954023Z %252 = tt.load %251 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:55.5954285Z %253 = ttg.convert_layout %252 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5954691Z %254 = arith.extf %253 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5954995Z %255 = arith.shli %249#1, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5955166Z %256 = arith.shrsi %255, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5955419Z %257 = ttg.convert_layout %256 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5955671Z %258 = arith.shrsi %249#1, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5955920Z %259 = ttg.convert_layout %258 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5956257Z %260 = tt.expand_dims %257 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5956598Z %261 = tt.expand_dims %259 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5956887Z %262 = tt.broadcast %260 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5957131Z %263 = arith.select %12, %262, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5957410Z %264 = tt.broadcast %261 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5957648Z %265 = arith.select %14, %264, %263 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5957884Z %266 = tt.reshape %265 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3> 2026-02-21T08:46:55.5958115Z %267 = arith.sitofp %266 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3> 2026-02-21T08:46:55.5958369Z %268 = ttg.local_alloc %267 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem> 2026-02-21T08:46:55.5958696Z %269 = ttg.local_load %268 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5959175Z %270 = tt.dot %254, %269, %249#0, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma> 2026-02-21T08:46:55.5959537Z %271 = arith.addi %236, %84 : tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5959738Z %272 = tt.addptr %4, %271 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5959937Z %273 = tt.load %272 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:55.5960199Z %274 = ttg.convert_layout %273 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5960593Z %275 = arith.extf %274 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5960893Z %276 = arith.shli %249#2, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5961062Z %277 = arith.shrsi %276, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5961311Z %278 = ttg.convert_layout %277 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5961562Z %279 = arith.shrsi %249#2, %cst_16 : tensor<1x1024xi8, #blocked2> 2026-02-21T08:46:55.5961837Z %280 = ttg.convert_layout %279 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5962172Z %281 = tt.expand_dims %278 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5962513Z %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5962834Z %283 = tt.broadcast %281 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5963081Z %284 = arith.select %12, %283, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5963329Z %285 = tt.broadcast %282 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5963565Z %286 = arith.select %14, %285, %284 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5963803Z %287 = tt.reshape %286 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3> 2026-02-21T08:46:55.5964030Z %288 = arith.sitofp %287 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3> 2026-02-21T08:46:55.5964287Z %289 = ttg.local_alloc %288 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem> 2026-02-21T08:46:55.5964615Z %290 = ttg.local_load %289 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5965082Z %291 = tt.dot %275, %290, %270, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma> 2026-02-21T08:46:55.5965506Z %292 = arith.truncf %291 : tensor<16x1024xf32, #mma> to tensor<16x1024xbf16, #mma> 2026-02-21T08:46:55.5965684Z %293 = arith.extsi %230 : i32 to i64 2026-02-21T08:46:55.5965844Z %294 = tt.splat %293 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:55.5966051Z %295 = arith.addi %294, %16 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:55.5966310Z %296 = tt.expand_dims %295 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5966546Z %297 = arith.muli %296, %cst_1 : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5966722Z %298 = tt.broadcast %297 : tensor<16x1xi64, #mma> -> tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5966932Z %299 = tt.splat %237 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:55.5967144Z %300 = arith.addi %299, %18 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:55.5967413Z %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5967680Z %302 = tt.broadcast %301 : tensor<1x1024xi64, #mma> -> tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5967864Z %303 = arith.addi %298, %302 : tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5968055Z %304 = tt.addptr %15, %303 : tensor<16x1024x!tt.ptr, #mma>, tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5968257Z %305 = arith.cmpi sge, %296, %cst_0 : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5968418Z %306 = arith.cmpi slt, %296, %cst : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5968573Z %307 = arith.andi %305, %306 : tensor<16x1xi1, #mma> 2026-02-21T08:46:55.5968765Z %308 = tt.broadcast %307 : tensor<16x1xi1, #mma> -> tensor<16x1024xi1, #mma> 2026-02-21T08:46:55.5968949Z %309 = arith.cmpi sge, %301, %cst_3 : tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5969120Z %310 = arith.cmpi slt, %301, %cst_2 : tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5969276Z %311 = arith.andi %309, %310 : tensor<1x1024xi1, #mma> 2026-02-21T08:46:55.5969497Z %312 = tt.broadcast %311 : tensor<1x1024xi1, #mma> -> tensor<16x1024xi1, #mma> 2026-02-21T08:46:55.5969678Z %313 = arith.andi %308, %312 : tensor<16x1024xi1, #mma> 2026-02-21T08:46:55.5969837Z tt.store %304, %292, %313 : tensor<16x1024x!tt.ptr, #mma> 2026-02-21T08:46:55.5969990Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:46:55.5970112Z %25 = arith.subi %c28_i32, %24 : i32 2026-02-21T08:46:55.5970234Z %26 = arith.ceildivsi %25, %c304_i32 : i32 2026-02-21T08:46:55.5970355Z %27 = arith.muli %26, %c4096_i32 : i32 2026-02-21T08:46:55.5970468Z %28 = arith.subi %24, %c304_i32 : i32 2026-02-21T08:46:55.5970970Z %29:9 = scf.for %arg3 = %c0_i32 to %27 step %c1_i32 iter_args(%arg4 = %c-1_i32, %arg5 = %28, %arg6 = %c0_i32, %arg7 = %cst_6, %arg8 = %c0_i32, %arg9 = %c0_i32, %arg10 = %cst_7, %arg11 = %cst_11, %arg12 = %cst_8) -> (i32, i32, i32, tensor<16x1024xf32, #mma>, i32, i32, tensor<16x2xi32, #blocked1>, tensor<1x1024xi64, #blocked2>, tensor<1x1024xi1, #blocked2>) : i32 { 2026-02-21T08:46:55.5971473Z %30 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T08:46:55.5971599Z %31 = arith.cmpi eq, %arg4, %c4095_i32 : i32 2026-02-21T08:46:55.5971726Z %32 = arith.select %31, %c0_i32, %30 : i32 2026-02-21T08:46:55.5971852Z %33 = arith.cmpi eq, %32, %c0_i32 : i32 2026-02-21T08:46:55.5971974Z %34 = arith.select %33, %c0_i32, %arg6 : i32 2026-02-21T08:46:55.5972204Z %35:6 = scf.if %33 -> (i32, i32, tensor<16x2xi32, #blocked1>, tensor<1x1024xi64, #blocked2>, tensor<1x1024xi1, #blocked2>, i32) { 2026-02-21T08:46:55.5972438Z %75 = arith.addi %arg5, %c304_i32 : i32 2026-02-21T08:46:55.5972560Z %76 = arith.divsi %75, %c28_i32 : i32 2026-02-21T08:46:55.5972676Z %77 = arith.muli %76, %c4_i32 : i32 2026-02-21T08:46:55.5972828Z %78 = arith.subi %c4_i32, %77 : i32 2026-02-21T08:46:55.5972942Z %79 = arith.minsi %78, %c4_i32 : i32 2026-02-21T08:46:55.5973055Z %80 = arith.remsi %75, %c28_i32 : i32 2026-02-21T08:46:55.5973171Z %81 = arith.remsi %80, %79 : i32 2026-02-21T08:46:55.5973284Z %82 = arith.addi %77, %81 : i32 2026-02-21T08:46:55.5973392Z %83 = arith.divsi %80, %79 : i32 2026-02-21T08:46:55.5973504Z %84 = arith.muli %82, %c16_i32 : i32 2026-02-21T08:46:55.5973672Z %85 = tt.splat %84 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:55.5973890Z %86 = arith.addi %85, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:46:55.5974060Z %87 = arith.muli %83, %c1024_i32 : i32 2026-02-21T08:46:55.5974283Z %88 = tt.expand_dims %86 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T08:46:55.5974540Z %89 = arith.muli %88, %cst_13 : tensor<16x1xi32, #blocked1> 2026-02-21T08:46:55.5974730Z %90 = tt.broadcast %89 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5974906Z %91 = arith.extsi %87 : i32 to i64 2026-02-21T08:46:55.5975073Z %92 = tt.splat %91 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:55.5975297Z %93 = arith.addi %92, %7 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:46:55.5975583Z %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5975844Z %95 = arith.cmpi sge, %94, %cst_11 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5976021Z %96 = arith.cmpi slt, %94, %cst_12 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5976185Z %97 = arith.andi %95, %96 : tensor<1x1024xi1, #blocked2> 2026-02-21T08:46:55.5976446Z scf.yield %84, %87, %90, %94, %97, %75 : i32, i32, tensor<16x2xi32, #blocked1>, tensor<1x1024xi64, #blocked2>, tensor<1x1024xi1, #blocked2>, i32 2026-02-21T08:46:55.5976682Z } else { 2026-02-21T08:46:55.5976954Z scf.yield %arg8, %arg9, %arg10, %arg11, %arg12, %arg5 : i32, i32, tensor<16x2xi32, #blocked1>, tensor<1x1024xi64, #blocked2>, tensor<1x1024xi1, #blocked2>, i32 2026-02-21T08:46:55.5977215Z } 2026-02-21T08:46:55.5977298Z %36 = arith.muli %34, %c2_i32 : i32 2026-02-21T08:46:55.5977466Z %37 = tt.splat %36 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:55.5977679Z %38 = arith.addi %37, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:46:55.5977948Z %39 = tt.expand_dims %38 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:46:55.5978217Z %40 = tt.broadcast %39 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5978405Z %41 = arith.addi %35#2, %40 : tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5978601Z %42 = tt.addptr %4, %41 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T08:46:55.5978799Z %43 = tt.load %42 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T08:46:55.5979061Z %44 = ttg.convert_layout %43 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5979455Z %45 = arith.extf %44 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5979729Z %46 = arith.extsi %34 : i32 to i64 2026-02-21T08:46:55.5979849Z %47 = arith.muli %46, %c7168_i64 : i64 2026-02-21T08:46:55.5979982Z %48 = tt.splat %47 : i64 -> tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5980144Z %49 = arith.addi %48, %35#3 : tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5980377Z %50 = tt.addptr %5, %49 : tensor<1x1024x!tt.ptr, #blocked2>, tensor<1x1024xi64, #blocked2> 2026-02-21T08:46:55.5980561Z %51 = arith.cmpi sge, %46, %c0_i64 : i64 2026-02-21T08:46:55.5980691Z %52 = arith.cmpi slt, %46, %c4096_i64 : i64 2026-02-21T08:46:55.5980812Z %53 = arith.andi %51, %52 : i1 2026-02-21T08:46:55.5980945Z %54 = tt.splat %53 : i1 -> tensor<1x1024xi1, #blocked2> 2026-02-21T08:46:55.5981099Z %55 = arith.andi %54, %35#4 : tensor<1x1024xi1, #blocked2> 2026-02-21T08:46:55.5981265Z %56 = tt.load %50, %55, %cst_14 : tensor<1x1024x!tt.ptr, #blocked2> 2026-02-21T08:46:55.5981527Z %57 = ttg.convert_layout %56 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5981808Z %58 = arith.shli %57, %cst_17 : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5982046Z %59 = arith.shrsi %58, %cst_17 : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5982283Z %60 = arith.shrsi %57, %cst_17 : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:46:55.5982577Z %61 = tt.expand_dims %59 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5982917Z %62 = tt.expand_dims %60 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked> 2026-02-21T08:46:55.5983200Z %63 = tt.broadcast %61 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5983443Z %64 = arith.select %12, %63, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5983681Z %65 = tt.broadcast %62 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5983917Z %66 = arith.select %14, %65, %64 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked> 2026-02-21T08:46:55.5984150Z %67 = tt.reshape %66 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3> 2026-02-21T08:46:55.5984373Z %68 = arith.sitofp %67 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3> 2026-02-21T08:46:55.5984663Z %69 = ttg.local_alloc %68 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem> 2026-02-21T08:46:55.5984988Z %70 = ttg.local_load %69 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:46:55.5985465Z %71 = tt.dot %45, %70, %arg7, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma> 2026-02-21T08:46:55.5985814Z %72 = arith.addi %34, %c1_i32 : i32 2026-02-21T08:46:55.5985935Z %73 = arith.cmpi eq, %32, %c4095_i32 : i32 2026-02-21T08:46:55.5986083Z %74 = arith.select %73, %cst_6, %71 : tensor<16x1024xf32, #mma> 2026-02-21T08:46:55.5986223Z scf.if %73 { 2026-02-21T08:46:55.5986363Z %75 = arith.truncf %71 : tensor<16x1024xf32, #mma> to tensor<16x1024xbf16, #mma> 2026-02-21T08:46:55.5986538Z %76 = arith.extsi %35#0 : i32 to i64 2026-02-21T08:46:55.5986655Z %77 = arith.extsi %35#1 : i32 to i64 2026-02-21T08:46:55.5986817Z %78 = tt.splat %76 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:55.5987022Z %79 = arith.addi %78, %16 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:46:55.5987283Z %80 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5987518Z %81 = arith.muli %80, %cst_1 : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5987693Z %82 = tt.broadcast %81 : tensor<16x1xi64, #mma> -> tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5987904Z %83 = tt.splat %77 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:55.5988148Z %84 = arith.addi %83, %18 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:46:55.5988428Z %85 = tt.expand_dims %84 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5988692Z %86 = tt.broadcast %85 : tensor<1x1024xi64, #mma> -> tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5988871Z %87 = arith.addi %82, %86 : tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5989059Z %88 = tt.addptr %15, %87 : tensor<16x1024x!tt.ptr, #mma>, tensor<16x1024xi64, #mma> 2026-02-21T08:46:55.5989256Z %89 = arith.cmpi sge, %80, %cst_0 : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5989416Z %90 = arith.cmpi slt, %80, %cst : tensor<16x1xi64, #mma> 2026-02-21T08:46:55.5989566Z %91 = arith.andi %89, %90 : tensor<16x1xi1, #mma> 2026-02-21T08:46:55.5989732Z %92 = tt.broadcast %91 : tensor<16x1xi1, #mma> -> tensor<16x1024xi1, #mma> 2026-02-21T08:46:55.5989918Z %93 = arith.cmpi sge, %85, %cst_3 : tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5990080Z %94 = arith.cmpi slt, %85, %cst_2 : tensor<1x1024xi64, #mma> 2026-02-21T08:46:55.5990240Z %95 = arith.andi %93, %94 : tensor<1x1024xi1, #mma> 2026-02-21T08:46:55.5990409Z %96 = tt.broadcast %95 : tensor<1x1024xi1, #mma> -> tensor<16x1024xi1, #mma> 2026-02-21T08:46:55.5990586Z %97 = arith.andi %92, %96 : tensor<16x1024xi1, #mma> 2026-02-21T08:46:55.5990741Z tt.store %88, %75, %97 : tensor<16x1024x!tt.ptr, #mma> 2026-02-21T08:46:55.5990873Z } 2026-02-21T08:46:55.5991150Z scf.yield %32, %35#5, %72, %74, %35#0, %35#1, %35#2, %35#3, %35#4 : i32, i32, i32, tensor<16x1024xf32, #mma>, i32, i32, tensor<16x2xi32, #blocked1>, tensor<1x1024xi64, #blocked2>, tensor<1x1024xi1, #blocked2> 2026-02-21T08:46:55.5991452Z } {tt.num_stages = 3 : i32} 2026-02-21T08:46:55.5991556Z tt.return 2026-02-21T08:46:55.5991637Z } 2026-02-21T08:46:55.5991717Z } 2026-02-21T08:46:55.5991760Z 2026-02-21T08:46:55.5991797Z {-# 2026-02-21T08:46:55.5991875Z external_resources: { 2026-02-21T08:46:55.5991976Z mlir_reproducer: { 2026-02-21T08:46:55.5993009Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:46:55.5994014Z disable_threading: false, 2026-02-21T08:46:55.5994121Z verify_each: true 2026-02-21T08:46:55.5994209Z } 2026-02-21T08:46:55.5994283Z } 2026-02-21T08:46:55.5994352Z #-} 2026-02-21T08:46:55.5994633Z /tmp/torchinductor_root/jv/cjvgrlgzvuwhzkuafdg5p4pazwxtkgt72fjgnqr47zio66tszde6.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:46:55.5995346Z /tmp/torchinductor_root/jv/cjvgrlgzvuwhzkuafdg5p4pazwxtkgt72fjgnqr47zio66tszde6.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:46:55.5995892Z [53s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:46:55.5996728Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 1024], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T08:46:55.5997446Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:46:55.5997610Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:46:55.6803766Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 8.9 configs/s 2026-02-21T08:46:55.6815673Z [53s] Adaptive compile timeout: 30s (90% percentile=14.0s, bounds=[30.0s, 30s]) 2026-02-21T08:46:55.6818410Z [53s] Initial random population of 100, 5 starting points: 2026-02-21T08:46:55.6818600Z error=3 2026-02-21T08:46:55.6818693Z timeout=3 2026-02-21T08:46:55.6818781Z ok=94 2026-02-21T08:46:55.6818865Z min=0.2657 2026-02-21T08:46:55.6818946Z mid=2.9382 2026-02-21T08:46:55.6819029Z max=85.7581 2026-02-21T08:46:55.6819128Z best={'block_sizes': [16, 16, 16], 2026-02-21T08:46:55.6819271Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:46:55.6819413Z 'l2_groupings': [1], 2026-02-21T08:46:55.6819542Z 'load_eviction_policies': ['', ''], 2026-02-21T08:46:55.6819665Z 'loop_orders': [[0, 1]], 2026-02-21T08:46:55.6819790Z 'matrix_instr_nonkdim': 0, 2026-02-21T08:46:55.6819897Z 'num_stages': 1, 2026-02-21T08:46:55.6819992Z 'num_warps': 4, 2026-02-21T08:46:55.6820081Z 'pid_type': 'flat', 2026-02-21T08:46:55.6820192Z 'range_flattens': [None, None], 2026-02-21T08:46:55.6820314Z 'range_multi_buffers': [None, None], 2026-02-21T08:46:55.6820439Z 'range_num_stages': [0, 0], 2026-02-21T08:46:55.6820557Z 'range_unroll_factors': [0, 0], 2026-02-21T08:46:55.6820679Z 'range_warp_specializes': [], 2026-02-21T08:46:55.6820794Z 'waves_per_eu': 1} 2026-02-21T08:46:55.6834099Z [53s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:46:56.6474352Z [54s] Generation 1 starting: 98 neighbors, 5 active search path(s) 2026-02-21T08:47:17.0937567Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 3.4 configs/s 2026-02-21T08:47:24.1370816Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 102/102 14.8 configs/s 2026-02-21T08:47:26.5906090Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 340.5 2026-02-21T08:47:26.5907452Z configs/s 2026-02-21T08:47:27.2087894Z [85s] Generation 1 complete: 2026-02-21T08:47:27.2088332Z error=3 2026-02-21T08:47:27.2088535Z ok=101 2026-02-21T08:47:27.2088744Z min=0.1526 2026-02-21T08:47:27.2088956Z mid=0.6797 2026-02-21T08:47:27.2089159Z max=20.1061 2026-02-21T08:47:27.2089398Z best={'block_sizes': [32, 16, 16], 2026-02-21T08:47:27.2089766Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:47:27.2090127Z 'l2_groupings': [1], 2026-02-21T08:47:27.2090411Z 'load_eviction_policies': ['', ''], 2026-02-21T08:47:27.2090724Z 'loop_orders': [[0, 1]], 2026-02-21T08:47:27.2090982Z 'matrix_instr_nonkdim': 0, 2026-02-21T08:47:27.2091225Z 'num_stages': 1, 2026-02-21T08:47:27.2091431Z 'num_warps': 2, 2026-02-21T08:47:27.2091643Z 'pid_type': 'flat', 2026-02-21T08:47:27.2091938Z 'range_flattens': [None, False], 2026-02-21T08:47:27.2092223Z 'range_multi_buffers': [None, None], 2026-02-21T08:47:27.2092529Z 'range_num_stages': [0, 0], 2026-02-21T08:47:27.2092786Z 'range_unroll_factors': [0, 0], 2026-02-21T08:47:27.2093071Z 'range_warp_specializes': [], 2026-02-21T08:47:27.2093322Z 'waves_per_eu': 1} 2026-02-21T08:47:27.2426248Z [85s] Fitting surrogate: 204 points, 204 targets 2026-02-21T08:47:28.3060212Z [86s] Generation 2 starting: 106 neighbors, 5 active search path(s) 2026-02-21T08:47:50.4586098Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 109/109 1.2 configs/s 2026-02-21T08:47:56.2577615Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:47:56.2580686Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:47:56.2581143Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:47:56.2581667Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:47:56.2582148Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:47:56.2582557Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T08:47:56.2582912Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:47:56.2583174Z #smem = #ttg.shared_memory 2026-02-21T08:47:56.2583510Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:47:56.2584180Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:47:56.2584733Z %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T08:47:56.2584984Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:56.2585219Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:56.2585470Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma> 2026-02-21T08:47:56.2585727Z %cst_3 = arith.constant dense<7168> : tensor<2x1xi32, #blocked1> 2026-02-21T08:47:56.2585977Z %cst_4 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2> 2026-02-21T08:47:56.2586184Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:47:56.2586344Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:47:56.2586511Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:47:56.2586668Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:47:56.2586831Z %c448_i32 = arith.constant 448 : i32 2026-02-21T08:47:56.2587025Z %cst_5 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.2587403Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T08:47:56.2587562Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:47:56.2587706Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:47:56.2587838Z %c38912_i32 = arith.constant 38912 : i32 2026-02-21T08:47:56.2588061Z %cst_6 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.2588282Z %0 = tt.get_program_id x : i32 2026-02-21T08:47:56.2588511Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:56.2588842Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:47:56.2589156Z %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:47:56.2589459Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:47:56.2589766Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:47:56.2590073Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:47:56.2590347Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:47:56.2590578Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T08:47:56.2590886Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:47:56.2591407Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:47:56.2591872Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:56.2592168Z %12 = arith.cmpi eq, %11, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:56.2592392Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T08:47:56.2592619Z %14 = arith.cmpi eq, %11, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:56.2592838Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T08:47:56.2593082Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<16x64x!tt.ptr, #mma> 2026-02-21T08:47:56.2593292Z scf.for %arg3 = %0 to %c448_i32 step %c38912_i32 : i32 { 2026-02-21T08:47:56.2593463Z %17 = arith.divsi %arg3, %c7168_i32 : i32 2026-02-21T08:47:56.2593608Z %18 = arith.muli %17, %c64_i32 : i32 2026-02-21T08:47:56.2593750Z %19 = arith.subi %c4_i32, %18 : i32 2026-02-21T08:47:56.2593881Z %20 = arith.minsi %19, %c64_i32 : i32 2026-02-21T08:47:56.2594027Z %21 = arith.remsi %arg3, %c7168_i32 : i32 2026-02-21T08:47:56.2594166Z %22 = arith.remsi %21, %20 : i32 2026-02-21T08:47:56.2594294Z %23 = arith.addi %18, %22 : i32 2026-02-21T08:47:56.2594424Z %24 = arith.divsi %21, %20 : i32 2026-02-21T08:47:56.2594548Z %25 = arith.muli %23, %c16_i32 : i32 2026-02-21T08:47:56.2594744Z %26 = tt.splat %25 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:56.2594990Z %27 = tt.splat %25 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:47:56.2595237Z %28 = arith.addi %26, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:56.2595473Z %29 = arith.addi %27, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:47:56.2595661Z %30 = arith.muli %24, %c64_i32 : i32 2026-02-21T08:47:56.2595858Z %31 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:47:56.2596100Z %32 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:47:56.2596380Z %33 = arith.addi %31, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:47:56.2596623Z %34 = arith.addi %32, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:47:56.2596934Z %35 = tt.expand_dims %28 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T08:47:56.2597228Z %36 = arith.muli %35, %cst_4 : tensor<16x1xi32, #blocked2> 2026-02-21T08:47:56.2597444Z %37 = tt.broadcast %36 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:47:56.2597784Z %38 = tt.expand_dims %33 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1> 2026-02-21T08:47:56.2598088Z %39 = tt.broadcast %38 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T08:47:56.2598352Z %40 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T08:47:56.2598625Z %49 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:47:56.2598845Z %50 = arith.addi %49, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:47:56.2599019Z %51 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:47:56.2599188Z %52 = tt.splat %51 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:47:56.2599402Z %53 = arith.addi %52, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:47:56.2599677Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:47:56.2599995Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:47:56.2600190Z %56 = arith.addi %37, %55 : tensor<16x4xi32, #blocked2> 2026-02-21T08:47:56.2600394Z %57 = tt.addptr %7, %56 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T08:47:56.2600597Z %58 = tt.load %57 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:47:56.2600868Z %59 = ttg.convert_layout %58 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:56.2601268Z %60 = arith.extf %59 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:56.2601658Z %61 = tt.expand_dims %50 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T08:47:56.2601903Z %62 = arith.muli %61, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T08:47:56.2602094Z %63 = tt.broadcast %62 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T08:47:56.2602284Z %64 = arith.addi %63, %39 : tensor<2x64xi32, #blocked1> 2026-02-21T08:47:56.2602476Z %65 = tt.addptr %8, %64 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T08:47:56.2602729Z %66 = tt.load %65 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T08:47:56.2602969Z %67 = ttg.convert_layout %66 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.2603249Z %68 = arith.shli %67, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.2603482Z %69 = arith.shrsi %68, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.2603714Z %70 = arith.shrsi %67, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.2604002Z %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T08:47:56.2604381Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T08:47:56.2604654Z %73 = tt.broadcast %71 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.2604888Z %74 = arith.select %13, %73, %cst_5 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.2605116Z %75 = tt.broadcast %72 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.2605341Z %76 = arith.select %15, %75, %74 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.2605564Z %77 = tt.reshape %76 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T08:47:56.2605777Z %78 = arith.sitofp %77 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T08:47:56.2606021Z %79 = ttg.local_alloc %78 : (tensor<4x64xf32, #blocked3>) -> !ttg.memdesc<4x64xf32, #shared, #smem> 2026-02-21T08:47:56.2606338Z %80 = ttg.local_load %79 : !ttg.memdesc<4x64xf32, #shared, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:56.2606808Z %81 = tt.dot %60, %80, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T08:47:56.2607155Z %82 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T08:47:56.2607338Z %83 = tt.splat %82 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:47:56.2607558Z %84 = arith.addi %83, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:47:56.2607727Z %85 = arith.muli %82, %c2_i32 : i32 2026-02-21T08:47:56.2607929Z %86 = tt.splat %85 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:47:56.2608141Z %87 = arith.addi %86, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:47:56.2608411Z %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:47:56.2608683Z %89 = tt.broadcast %88 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:47:56.2608875Z %90 = arith.addi %37, %89 : tensor<16x4xi32, #blocked2> 2026-02-21T08:47:56.2609073Z %91 = tt.addptr %7, %90 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T08:47:56.2609276Z %92 = tt.load %91 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:47:56.2609535Z %93 = ttg.convert_layout %92 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:56.2609933Z %94 = arith.extf %93 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:56.2610311Z %95 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T08:47:56.2610551Z %96 = arith.muli %95, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T08:47:56.2610741Z %97 = tt.broadcast %96 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T08:47:56.2610925Z %98 = arith.addi %97, %39 : tensor<2x64xi32, #blocked1> 2026-02-21T08:47:56.2611121Z %99 = tt.addptr %8, %98 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T08:47:56.2611317Z %100 = tt.load %99 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T08:47:56.2611556Z %101 = ttg.convert_layout %100 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.2611844Z %102 = arith.shli %101, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.2621016Z %103 = arith.shrsi %102, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.2621253Z %104 = arith.shrsi %101, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.2621544Z %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T08:47:56.2621879Z %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T08:47:56.2622164Z %107 = tt.broadcast %105 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.2622405Z %108 = arith.select %13, %107, %cst_5 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.2622645Z %109 = tt.broadcast %106 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.2622880Z %110 = arith.select %15, %109, %108 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.2623112Z %111 = tt.reshape %110 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T08:47:56.2623334Z %112 = arith.sitofp %111 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T08:47:56.2623583Z %113 = ttg.local_alloc %112 : (tensor<4x64xf32, #blocked3>) -> !ttg.memdesc<4x64xf32, #shared, #smem> 2026-02-21T08:47:56.2623910Z %114 = ttg.local_load %113 : !ttg.memdesc<4x64xf32, #shared, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:56.2624380Z %115 = tt.dot %94, %114, %81, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T08:47:56.2624764Z scf.yield %115 : tensor<16x64xf32, #mma> 2026-02-21T08:47:56.2624897Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:47:56.2625061Z %41 = arith.truncf %40 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T08:47:56.2625319Z %42 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T08:47:56.2625562Z %43 = arith.muli %42, %cst : tensor<16x1xi32, #mma> 2026-02-21T08:47:56.2625782Z %44 = tt.expand_dims %34 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T08:47:56.2626034Z %45 = tt.broadcast %43 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T08:47:56.2626229Z %46 = tt.broadcast %44 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T08:47:56.2626399Z %47 = arith.addi %45, %46 : tensor<16x64xi32, #mma> 2026-02-21T08:47:56.2626582Z %48 = tt.addptr %16, %47 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T08:47:56.2626769Z tt.store %48, %41 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T08:47:56.2626931Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T08:47:56.2627066Z tt.return 2026-02-21T08:47:56.2627148Z } 2026-02-21T08:47:56.2627226Z } 2026-02-21T08:47:56.2627272Z 2026-02-21T08:47:56.2627303Z {-# 2026-02-21T08:47:56.2627387Z external_resources: { 2026-02-21T08:47:56.2627486Z mlir_reproducer: { 2026-02-21T08:47:56.2628493Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:47:56.2629532Z disable_threading: false, 2026-02-21T08:47:56.2629639Z verify_each: true 2026-02-21T08:47:56.2629732Z } 2026-02-21T08:47:56.2629806Z } 2026-02-21T08:47:56.2629876Z #-} 2026-02-21T08:47:56.2630157Z /tmp/torchinductor_root/dz/cdzhwkiurocicad57kyzlaveorf27dlzeuloa2mas23wvubxwhkl.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:47:56.2630845Z /tmp/torchinductor_root/dz/cdzhwkiurocicad57kyzlaveorf27dlzeuloa2mas23wvubxwhkl.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:47:56.2631405Z [114s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:47:56.2632194Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 64], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T08:47:56.2632906Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:47:56.2633079Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:47:56.5123000Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:47:56.5125871Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}> 2026-02-21T08:47:56.5127102Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T08:47:56.5128032Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T08:47:56.5128496Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T08:47:56.5128954Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:47:56.5129302Z #smem = #ttg.shared_memory 2026-02-21T08:47:56.5129774Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:47:56.5130803Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:47:56.5131605Z %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T08:47:56.5131831Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:56.5132057Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:56.5132288Z %cst_2 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T08:47:56.5132484Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:47:56.5132682Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma> 2026-02-21T08:47:56.5132909Z %cst_4 = arith.constant dense<7168> : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:56.5133140Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:56.5133362Z %cst_6 = arith.constant dense<4096> : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:56.5133663Z %cst_7 = arith.constant dense<0> : tensor<1x64xi64, #blocked2> 2026-02-21T08:47:56.5133884Z %cst_8 = arith.constant dense<7168> : tensor<1x64xi64, #blocked2> 2026-02-21T08:47:56.5134107Z %cst_9 = arith.constant dense<0> : tensor<2x64xi8, #blocked2> 2026-02-21T08:47:56.5134473Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:47:56.5134642Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:47:56.5134790Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:47:56.5134967Z %cst_10 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.5135160Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T08:47:56.5135312Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:47:56.5135458Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:47:56.5135691Z %cst_11 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.5135936Z %0 = tt.get_program_id x : i32 2026-02-21T08:47:56.5136083Z %1 = arith.divsi %0, %c7168_i32 : i32 2026-02-21T08:47:56.5136228Z %2 = arith.muli %1, %c64_i32 : i32 2026-02-21T08:47:56.5136376Z %3 = arith.subi %c4_i32, %2 : i32 2026-02-21T08:47:56.5136523Z %4 = arith.minsi %3, %c64_i32 : i32 2026-02-21T08:47:56.5136663Z %5 = arith.remsi %0, %c7168_i32 : i32 2026-02-21T08:47:56.5136809Z %6 = arith.remsi %5, %4 : i32 2026-02-21T08:47:56.5136950Z %7 = arith.addi %2, %6 : i32 2026-02-21T08:47:56.5137091Z %8 = arith.divsi %5, %4 : i32 2026-02-21T08:47:56.5137221Z %9 = arith.muli %7, %c16_i32 : i32 2026-02-21T08:47:56.5137493Z %10 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:47:56.5137843Z %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:47:56.5138100Z %12 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:47:56.5138319Z %13 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:47:56.5138574Z %14 = arith.addi %12, %10 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:47:56.5138790Z %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:47:56.5138959Z %16 = arith.muli %8, %c64_i32 : i32 2026-02-21T08:47:56.5139154Z %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:47:56.5139429Z %18 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:47:56.5139671Z %19 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:47:56.5139876Z %20 = arith.addi %19, %17 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:47:56.5140117Z %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:47:56.5140430Z %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T08:47:56.5140689Z %23 = arith.muli %22, %cst_2 : tensor<16x1xi32, #blocked1> 2026-02-21T08:47:56.5140882Z %24 = tt.broadcast %23 : tensor<16x1xi32, #blocked1> -> tensor<16x4xi32, #blocked1> 2026-02-21T08:47:56.5141120Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<16x4x!tt.ptr, #blocked1> 2026-02-21T08:47:56.5141286Z %26 = arith.extsi %16 : i32 to i64 2026-02-21T08:47:56.5141440Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T08:47:56.5141679Z %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:56.5142014Z %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:56.5153318Z %30 = tt.splat %26 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:47:56.5153650Z %31 = arith.extsi %18 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:47:56.5153956Z %32 = arith.addi %30, %31 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:47:56.5154300Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T08:47:56.5154586Z %34 = tt.broadcast %33 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T08:47:56.5154790Z %35 = arith.cmpi sge, %33, %cst_7 : tensor<1x64xi64, #blocked2> 2026-02-21T08:47:56.5154958Z %36 = arith.cmpi slt, %33, %cst_8 : tensor<1x64xi64, #blocked2> 2026-02-21T08:47:56.5155125Z %37 = arith.andi %35, %36 : tensor<1x64xi1, #blocked2> 2026-02-21T08:47:56.5155304Z %38 = tt.broadcast %37 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T08:47:56.5155592Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:47:56.5156006Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:47:56.5156407Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:56.5156663Z %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:56.5156855Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T08:47:56.5157049Z %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:56.5157239Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T08:47:56.5157501Z %46 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg4 = %cst_3) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T08:47:56.5157781Z %56 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T08:47:56.5157950Z %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:47:56.5158173Z %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:47:56.5158443Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:47:56.5158713Z %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<16x4xi32, #blocked1> 2026-02-21T08:47:56.5158906Z %61 = arith.addi %24, %60 : tensor<16x4xi32, #blocked1> 2026-02-21T08:47:56.5159101Z %62 = tt.addptr %25, %61 : tensor<16x4x!tt.ptr, #blocked1>, tensor<16x4xi32, #blocked1> 2026-02-21T08:47:56.5159305Z %63 = tt.load %62 : tensor<16x4x!tt.ptr, #blocked1> 2026-02-21T08:47:56.5159569Z %64 = ttg.convert_layout %63 : tensor<16x4xbf16, #blocked1> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:56.5159971Z %65 = arith.extf %64 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:56.5160260Z %66 = arith.extsi %arg3 : i32 to i64 2026-02-21T08:47:56.5160427Z %67 = tt.splat %66 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:56.5160647Z %68 = arith.addi %67, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:56.5160914Z %69 = tt.expand_dims %68 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T08:47:56.5161157Z %70 = arith.muli %69, %cst_4 : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:56.5161342Z %71 = tt.broadcast %70 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T08:47:56.5161527Z %72 = arith.addi %71, %34 : tensor<2x64xi64, #blocked2> 2026-02-21T08:47:56.5161722Z %73 = tt.addptr %27, %72 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T08:47:56.5161956Z %74 = arith.cmpi sge, %69, %cst_5 : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:56.5162126Z %75 = arith.cmpi slt, %69, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:56.5162289Z %76 = arith.andi %74, %75 : tensor<2x1xi1, #blocked2> 2026-02-21T08:47:56.5162468Z %77 = tt.broadcast %76 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T08:47:56.5162729Z %78 = arith.andi %77, %38 : tensor<2x64xi1, #blocked2> 2026-02-21T08:47:56.5162889Z %79 = tt.load %73, %78, %cst_9 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T08:47:56.5163141Z %80 = ttg.convert_layout %79 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.5163418Z %81 = arith.shli %80, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.5163655Z %82 = arith.shrsi %81, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.5163890Z %83 = arith.shrsi %80, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.5164170Z %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T08:47:56.5164498Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T08:47:56.5164777Z %86 = tt.broadcast %84 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.5165008Z %87 = arith.select %43, %86, %cst_10 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.5165244Z %88 = tt.broadcast %85 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.5165512Z %89 = arith.select %45, %88, %87 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.5165733Z %90 = tt.reshape %89 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T08:47:56.5165951Z %91 = arith.sitofp %90 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T08:47:56.5166189Z %92 = ttg.local_alloc %91 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared, #smem> 2026-02-21T08:47:56.5166509Z %93 = ttg.local_load %92 : !ttg.memdesc<4x64xf32, #shared, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:56.5166972Z %94 = tt.dot %65, %93, %arg4, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T08:47:56.5167315Z %95 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T08:47:56.5167438Z %96 = arith.muli %95, %c2_i32 : i32 2026-02-21T08:47:56.5167607Z %97 = tt.splat %96 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:47:56.5167824Z %98 = arith.addi %97, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:47:56.5168092Z %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:47:56.5168363Z %100 = tt.broadcast %99 : tensor<1x4xi32, #blocked1> -> tensor<16x4xi32, #blocked1> 2026-02-21T08:47:56.5168558Z %101 = arith.addi %24, %100 : tensor<16x4xi32, #blocked1> 2026-02-21T08:47:56.5168757Z %102 = tt.addptr %25, %101 : tensor<16x4x!tt.ptr, #blocked1>, tensor<16x4xi32, #blocked1> 2026-02-21T08:47:56.5168966Z %103 = tt.load %102 : tensor<16x4x!tt.ptr, #blocked1> 2026-02-21T08:47:56.5169230Z %104 = ttg.convert_layout %103 : tensor<16x4xbf16, #blocked1> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:56.5169632Z %105 = arith.extf %104 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:56.5169955Z %106 = arith.extsi %95 : i32 to i64 2026-02-21T08:47:56.5170124Z %107 = tt.splat %106 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:56.5170348Z %108 = arith.addi %107, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:56.5170625Z %109 = tt.expand_dims %108 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T08:47:56.5170877Z %110 = arith.muli %109, %cst_4 : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:56.5171070Z %111 = tt.broadcast %110 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T08:47:56.5171260Z %112 = arith.addi %111, %34 : tensor<2x64xi64, #blocked2> 2026-02-21T08:47:56.5171463Z %113 = tt.addptr %27, %112 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T08:47:56.5171671Z %114 = arith.cmpi sge, %109, %cst_5 : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:56.5171849Z %115 = arith.cmpi slt, %109, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:56.5172012Z %116 = arith.andi %114, %115 : tensor<2x1xi1, #blocked2> 2026-02-21T08:47:56.5172199Z %117 = tt.broadcast %116 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T08:47:56.5172391Z %118 = arith.andi %117, %38 : tensor<2x64xi1, #blocked2> 2026-02-21T08:47:56.5172557Z %119 = tt.load %113, %118, %cst_9 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T08:47:56.5172817Z %120 = ttg.convert_layout %119 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.5173095Z %121 = arith.shli %120, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.5173367Z %122 = arith.shrsi %121, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.5173606Z %123 = arith.shrsi %120, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:56.5173895Z %124 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T08:47:56.5174226Z %125 = tt.expand_dims %123 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T08:47:56.5174501Z %126 = tt.broadcast %124 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.5174739Z %127 = arith.select %43, %126, %cst_10 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.5174973Z %128 = tt.broadcast %125 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.5175200Z %129 = arith.select %45, %128, %127 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:56.5175431Z %130 = tt.reshape %129 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T08:47:56.5175650Z %131 = arith.sitofp %130 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T08:47:56.5175896Z %132 = ttg.local_alloc %131 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared, #smem> 2026-02-21T08:47:56.5176214Z %133 = ttg.local_load %132 : !ttg.memdesc<4x64xf32, #shared, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:56.5176672Z %134 = tt.dot %105, %133, %94, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T08:47:56.5177016Z scf.yield %134 : tensor<16x64xf32, #mma> 2026-02-21T08:47:56.5177140Z } {tt.num_stages = 1 : i32} 2026-02-21T08:47:56.5177294Z %47 = arith.truncf %46 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T08:47:56.5177549Z %48 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T08:47:56.5177812Z %49 = arith.muli %48, %cst : tensor<16x1xi32, #mma> 2026-02-21T08:47:56.5178033Z %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T08:47:56.5178278Z %51 = tt.broadcast %49 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T08:47:56.5178471Z %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T08:47:56.5178638Z %53 = arith.addi %51, %52 : tensor<16x64xi32, #mma> 2026-02-21T08:47:56.5178806Z %54 = tt.splat %arg2 : !tt.ptr -> tensor<16x64x!tt.ptr, #mma> 2026-02-21T08:47:56.5179010Z %55 = tt.addptr %54, %53 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T08:47:56.5179195Z tt.store %55, %47 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T08:47:56.5179324Z tt.return 2026-02-21T08:47:56.5179404Z } 2026-02-21T08:47:56.5179480Z } 2026-02-21T08:47:56.5179526Z 2026-02-21T08:47:56.5179556Z {-# 2026-02-21T08:47:56.5179640Z external_resources: { 2026-02-21T08:47:56.5179736Z mlir_reproducer: { 2026-02-21T08:47:56.5180747Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:47:56.5181758Z disable_threading: false, 2026-02-21T08:47:56.5181899Z verify_each: true 2026-02-21T08:47:56.5181986Z } 2026-02-21T08:47:56.5182060Z } 2026-02-21T08:47:56.5182127Z #-} 2026-02-21T08:47:56.5182407Z /tmp/torchinductor_root/2d/c2d2fvrviaycdfpxcopjx3txsy2pc2p2sbbhp3w33hnxcsxbupf7.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:47:56.5183088Z /tmp/torchinductor_root/2d/c2d2fvrviaycdfpxcopjx3txsy2pc2p2sbbhp3w33hnxcsxbupf7.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:47:56.5183633Z [114s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:47:56.5184356Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 64], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T08:47:56.5185009Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:47:56.5185173Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:47:57.1794942Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:47:57.1797083Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}> 2026-02-21T08:47:57.1798488Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T08:47:57.1799607Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T08:47:57.1800438Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T08:47:57.1801486Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:47:57.1801993Z #smem = #ttg.shared_memory 2026-02-21T08:47:57.1803021Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:47:57.1804536Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:47:57.1805458Z %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T08:47:57.1806044Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:57.1806469Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:57.1806890Z %cst_2 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T08:47:57.1807287Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:47:57.1807625Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma> 2026-02-21T08:47:57.1807938Z %cst_4 = arith.constant dense<7168> : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:57.1808221Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:57.1808506Z %cst_6 = arith.constant dense<4096> : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:57.1808792Z %cst_7 = arith.constant dense<0> : tensor<1x64xi64, #blocked2> 2026-02-21T08:47:57.1809078Z %cst_8 = arith.constant dense<7168> : tensor<1x64xi64, #blocked2> 2026-02-21T08:47:57.1809377Z %cst_9 = arith.constant dense<0> : tensor<2x64xi8, #blocked2> 2026-02-21T08:47:57.1809619Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:47:57.1809803Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:47:57.1810183Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:47:57.1810410Z %cst_10 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:57.1810662Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T08:47:57.1810863Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:47:57.1811055Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:47:57.1811353Z %cst_11 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:57.1811674Z %0 = tt.get_program_id x : i32 2026-02-21T08:47:57.1811860Z %1 = arith.divsi %0, %c7168_i32 : i32 2026-02-21T08:47:57.1812047Z %2 = arith.muli %1, %c64_i32 : i32 2026-02-21T08:47:57.1812234Z %3 = arith.subi %c4_i32, %2 : i32 2026-02-21T08:47:57.1812411Z %4 = arith.minsi %3, %c64_i32 : i32 2026-02-21T08:47:57.1812607Z %5 = arith.remsi %0, %c7168_i32 : i32 2026-02-21T08:47:57.1812790Z %6 = arith.remsi %5, %4 : i32 2026-02-21T08:47:57.1812968Z %7 = arith.addi %2, %6 : i32 2026-02-21T08:47:57.1813145Z %8 = arith.divsi %5, %4 : i32 2026-02-21T08:47:57.1813321Z %9 = arith.muli %7, %c16_i32 : i32 2026-02-21T08:47:57.1813656Z %10 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:47:57.1814111Z %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:47:57.1814515Z %12 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:47:57.1814865Z %13 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:47:57.1815212Z %14 = arith.addi %12, %10 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:47:57.1815554Z %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:47:57.1815818Z %16 = arith.muli %8, %c64_i32 : i32 2026-02-21T08:47:57.1816142Z %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:47:57.1816580Z %18 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:47:57.1817048Z %19 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:47:57.1817377Z %20 = arith.addi %19, %17 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:47:57.1817707Z %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:47:57.1818086Z %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T08:47:57.1818392Z %23 = arith.muli %22, %cst_2 : tensor<16x1xi32, #blocked1> 2026-02-21T08:47:57.1818623Z %24 = tt.broadcast %23 : tensor<16x1xi32, #blocked1> -> tensor<16x4xi32, #blocked1> 2026-02-21T08:47:57.1818897Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<16x4x!tt.ptr, #blocked1> 2026-02-21T08:47:57.1819092Z %26 = arith.extsi %16 : i32 to i64 2026-02-21T08:47:57.1819282Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T08:47:57.1819568Z %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:57.1819960Z %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:57.1820323Z %30 = tt.splat %26 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:47:57.1820704Z %31 = arith.extsi %18 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:47:57.1821063Z %32 = arith.addi %30, %31 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:47:57.1821443Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T08:47:57.1821781Z %34 = tt.broadcast %33 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T08:47:57.1822029Z %35 = arith.cmpi sge, %33, %cst_7 : tensor<1x64xi64, #blocked2> 2026-02-21T08:47:57.1822240Z %36 = arith.cmpi slt, %33, %cst_8 : tensor<1x64xi64, #blocked2> 2026-02-21T08:47:57.1822442Z %37 = arith.andi %35, %36 : tensor<1x64xi1, #blocked2> 2026-02-21T08:47:57.1822663Z %38 = tt.broadcast %37 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T08:47:57.1823012Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:47:57.1823520Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:47:57.1824016Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:57.1824332Z %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:57.1824571Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T08:47:57.1824809Z %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:47:57.1825047Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T08:47:57.1825376Z %46 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg4 = %cst_3) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T08:47:57.1825647Z %56 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T08:47:57.1825866Z %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:47:57.1826136Z %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:47:57.1826480Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:47:57.1844349Z %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<16x4xi32, #blocked1> 2026-02-21T08:47:57.1844542Z %61 = arith.addi %24, %60 : tensor<16x4xi32, #blocked1> 2026-02-21T08:47:57.1844743Z %62 = tt.addptr %25, %61 : tensor<16x4x!tt.ptr, #blocked1>, tensor<16x4xi32, #blocked1> 2026-02-21T08:47:57.1844945Z %63 = tt.load %62 : tensor<16x4x!tt.ptr, #blocked1> 2026-02-21T08:47:57.1845213Z %64 = ttg.convert_layout %63 : tensor<16x4xbf16, #blocked1> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:57.1845611Z %65 = arith.extf %64 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:57.1845895Z %66 = arith.extsi %arg3 : i32 to i64 2026-02-21T08:47:57.1846065Z %67 = tt.splat %66 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:57.1846283Z %68 = arith.addi %67, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:57.1846553Z %69 = tt.expand_dims %68 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T08:47:57.1846794Z %70 = arith.muli %69, %cst_4 : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:57.1846984Z %71 = tt.broadcast %70 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T08:47:57.1847169Z %72 = arith.addi %71, %34 : tensor<2x64xi64, #blocked2> 2026-02-21T08:47:57.1847359Z %73 = tt.addptr %27, %72 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T08:47:57.1847563Z %74 = arith.cmpi sge, %69, %cst_5 : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:57.1847778Z %75 = arith.cmpi slt, %69, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:57.1847941Z %76 = arith.andi %74, %75 : tensor<2x1xi1, #blocked2> 2026-02-21T08:47:57.1848124Z %77 = tt.broadcast %76 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T08:47:57.1848307Z %78 = arith.andi %77, %38 : tensor<2x64xi1, #blocked2> 2026-02-21T08:47:57.1848472Z %79 = tt.load %73, %78, %cst_9 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T08:47:57.1848719Z %80 = ttg.convert_layout %79 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:57.1849000Z %81 = arith.shli %80, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:57.1849230Z %82 = arith.shrsi %81, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:57.1849460Z %83 = arith.shrsi %80, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:57.1849743Z %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T08:47:57.1850072Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T08:47:57.1850350Z %86 = tt.broadcast %84 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:57.1850585Z %87 = arith.select %43, %86, %cst_10 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:57.1850821Z %88 = tt.broadcast %85 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:57.1851049Z %89 = arith.select %45, %88, %87 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:57.1851265Z %90 = tt.reshape %89 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T08:47:57.1851485Z %91 = arith.sitofp %90 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T08:47:57.1851728Z %92 = ttg.local_alloc %91 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared, #smem> 2026-02-21T08:47:57.1852083Z %93 = ttg.local_load %92 : !ttg.memdesc<4x64xf32, #shared, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:57.1852548Z %94 = tt.dot %65, %93, %arg4, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T08:47:57.1852898Z %95 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T08:47:57.1853021Z %96 = arith.muli %95, %c2_i32 : i32 2026-02-21T08:47:57.1853188Z %97 = tt.splat %96 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:47:57.1853402Z %98 = arith.addi %97, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:47:57.1853677Z %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:47:57.1853946Z %100 = tt.broadcast %99 : tensor<1x4xi32, #blocked1> -> tensor<16x4xi32, #blocked1> 2026-02-21T08:47:57.1854144Z %101 = arith.addi %24, %100 : tensor<16x4xi32, #blocked1> 2026-02-21T08:47:57.1854349Z %102 = tt.addptr %25, %101 : tensor<16x4x!tt.ptr, #blocked1>, tensor<16x4xi32, #blocked1> 2026-02-21T08:47:57.1854552Z %103 = tt.load %102 : tensor<16x4x!tt.ptr, #blocked1> 2026-02-21T08:47:57.1854818Z %104 = ttg.convert_layout %103 : tensor<16x4xbf16, #blocked1> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:57.1855214Z %105 = arith.extf %104 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:57.1855495Z %106 = arith.extsi %95 : i32 to i64 2026-02-21T08:47:57.1855710Z %107 = tt.splat %106 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:57.1855931Z %108 = arith.addi %107, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:47:57.1856214Z %109 = tt.expand_dims %108 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T08:47:57.1856462Z %110 = arith.muli %109, %cst_4 : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:57.1856656Z %111 = tt.broadcast %110 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T08:47:57.1856846Z %112 = arith.addi %111, %34 : tensor<2x64xi64, #blocked2> 2026-02-21T08:47:57.1857044Z %113 = tt.addptr %27, %112 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T08:47:57.1857258Z %114 = arith.cmpi sge, %109, %cst_5 : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:57.1857431Z %115 = arith.cmpi slt, %109, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T08:47:57.1857602Z %116 = arith.andi %114, %115 : tensor<2x1xi1, #blocked2> 2026-02-21T08:47:57.1857784Z %117 = tt.broadcast %116 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T08:47:57.1857976Z %118 = arith.andi %117, %38 : tensor<2x64xi1, #blocked2> 2026-02-21T08:47:57.1858146Z %119 = tt.load %113, %118, %cst_9 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T08:47:57.1858399Z %120 = ttg.convert_layout %119 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:57.1858683Z %121 = arith.shli %120, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:57.1858924Z %122 = arith.shrsi %121, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:57.1859166Z %123 = arith.shrsi %120, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:47:57.1859461Z %124 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T08:47:57.1859795Z %125 = tt.expand_dims %123 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T08:47:57.1860111Z %126 = tt.broadcast %124 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:57.1860349Z %127 = arith.select %43, %126, %cst_10 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:57.1860590Z %128 = tt.broadcast %125 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:57.1860823Z %129 = arith.select %45, %128, %127 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T08:47:57.1861049Z %130 = tt.reshape %129 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T08:47:57.1861270Z %131 = arith.sitofp %130 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T08:47:57.1861517Z %132 = ttg.local_alloc %131 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared, #smem> 2026-02-21T08:47:57.1861843Z %133 = ttg.local_load %132 : !ttg.memdesc<4x64xf32, #shared, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:47:57.1862316Z %134 = tt.dot %105, %133, %94, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T08:47:57.1862660Z scf.yield %134 : tensor<16x64xf32, #mma> 2026-02-21T08:47:57.1862793Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:47:57.1862955Z %47 = arith.truncf %46 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T08:47:57.1863214Z %48 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T08:47:57.1863451Z %49 = arith.muli %48, %cst : tensor<16x1xi32, #mma> 2026-02-21T08:47:57.1863701Z %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T08:47:57.1863954Z %51 = tt.broadcast %49 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T08:47:57.1864148Z %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T08:47:57.1864325Z %53 = arith.addi %51, %52 : tensor<16x64xi32, #mma> 2026-02-21T08:47:57.1864496Z %54 = tt.splat %arg2 : !tt.ptr -> tensor<16x64x!tt.ptr, #mma> 2026-02-21T08:47:57.1864699Z %55 = tt.addptr %54, %53 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T08:47:57.1864888Z tt.store %55, %47 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T08:47:57.1865017Z tt.return 2026-02-21T08:47:57.1865103Z } 2026-02-21T08:47:57.1865180Z } 2026-02-21T08:47:57.1865229Z 2026-02-21T08:47:57.1865261Z {-# 2026-02-21T08:47:57.1865342Z external_resources: { 2026-02-21T08:47:57.1865447Z mlir_reproducer: { 2026-02-21T08:47:57.1866446Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:47:57.1867445Z disable_threading: false, 2026-02-21T08:47:57.1867552Z verify_each: true 2026-02-21T08:47:57.1867646Z } 2026-02-21T08:47:57.1867720Z } 2026-02-21T08:47:57.1867794Z #-} 2026-02-21T08:47:57.1868073Z /tmp/torchinductor_root/45/c453xlk5xasye6ijooj4jwaykdpsr4scc47yhy2o7agxwzgm2aq3.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:47:57.1868758Z /tmp/torchinductor_root/45/c453xlk5xasye6ijooj4jwaykdpsr4scc47yhy2o7agxwzgm2aq3.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:47:57.1869342Z [115s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:47:57.1870066Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 64], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T08:47:57.1870726Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:47:57.1870901Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:47:57.4770312Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 109/109 15.8 configs/s 2026-02-21T08:47:59.3998588Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 431.4 2026-02-21T08:47:59.3998997Z configs/s 2026-02-21T08:47:59.9489531Z [118s] Generation 2 complete: 2026-02-21T08:47:59.9489762Z error=6 2026-02-21T08:47:59.9489903Z ok=105 2026-02-21T08:47:59.9490039Z min=0.1224 2026-02-21T08:47:59.9490181Z mid=0.3420 2026-02-21T08:47:59.9490316Z max=14.9881 2026-02-21T08:47:59.9490475Z best={'block_sizes': [64, 32, 32], 2026-02-21T08:47:59.9490732Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T08:47:59.9490976Z 'l2_groupings': [8], 2026-02-21T08:47:59.9491163Z 'load_eviction_policies': ['', ''], 2026-02-21T08:47:59.9491401Z 'loop_orders': [[1, 0]], 2026-02-21T08:47:59.9491945Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:47:59.9492156Z 'num_sm_multiplier': 8, 2026-02-21T08:47:59.9492335Z 'num_stages': 2, 2026-02-21T08:47:59.9492515Z 'num_warps': 4, 2026-02-21T08:47:59.9492694Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:47:59.9492909Z 'range_flattens': [None, True], 2026-02-21T08:47:59.9493114Z 'range_multi_buffers': [None, None], 2026-02-21T08:47:59.9493317Z 'range_num_stages': [0, 3], 2026-02-21T08:47:59.9493506Z 'range_unroll_factors': [1, 4], 2026-02-21T08:47:59.9493704Z 'range_warp_specializes': [], 2026-02-21T08:47:59.9493888Z 'waves_per_eu': 3} 2026-02-21T08:47:59.9644479Z [118s] Fitting surrogate: 315 points, 315 targets 2026-02-21T08:48:00.9612186Z [119s] Generation 3 starting: 92 neighbors, 5 active search path(s) 2026-02-21T08:48:19.2611418Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 49.3 configs/s 2026-02-21T08:48:25.1523471Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 94/94 16.1 configs/s 2026-02-21T08:48:29.8853087Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 196.8 2026-02-21T08:48:29.8853708Z configs/s 2026-02-21T08:48:30.4247871Z [148s] Generation 3 complete: 2026-02-21T08:48:30.4248283Z ok=98 2026-02-21T08:48:30.4248547Z min=0.1082 2026-02-21T08:48:30.4248763Z mid=0.2042 2026-02-21T08:48:30.4248974Z max=10.4375 2026-02-21T08:48:30.4249213Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:48:30.4249585Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:48:30.4249936Z 'l2_groupings': [1], 2026-02-21T08:48:30.4250212Z 'load_eviction_policies': ['', ''], 2026-02-21T08:48:30.4250525Z 'loop_orders': [[0, 1]], 2026-02-21T08:48:30.4250800Z 'matrix_instr_nonkdim': 0, 2026-02-21T08:48:30.4251065Z 'num_stages': 1, 2026-02-21T08:48:30.4251296Z 'num_warps': 2, 2026-02-21T08:48:30.4251523Z 'pid_type': 'flat', 2026-02-21T08:48:30.4251790Z 'range_flattens': [None, None], 2026-02-21T08:48:30.4252135Z 'range_multi_buffers': [None, None], 2026-02-21T08:48:30.4252446Z 'range_num_stages': [0, 0], 2026-02-21T08:48:30.4253306Z 'range_unroll_factors': [0, 0], 2026-02-21T08:48:30.4253598Z 'range_warp_specializes': [], 2026-02-21T08:48:30.4253872Z 'waves_per_eu': 1} 2026-02-21T08:48:30.4871268Z [148s] Fitting surrogate: 413 points, 413 targets 2026-02-21T08:48:31.4260467Z [149s] Generation 4 starting: 88 neighbors, 5 active search path(s) 2026-02-21T08:48:49.5628951Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 1.4 configs/s 2026-02-21T08:48:54.7889464Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 17.7 configs/s 2026-02-21T08:49:03.6276752Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 120.6 2026-02-21T08:49:03.6278751Z configs/s 2026-02-21T08:49:04.3557720Z [182s] Generation 4 complete: 2026-02-21T08:49:04.3558161Z error=7 2026-02-21T08:49:04.3558364Z ok=86 2026-02-21T08:49:04.3558634Z min=0.1070 2026-02-21T08:49:04.3558841Z mid=0.1488 2026-02-21T08:49:04.3559043Z max=10.2584 2026-02-21T08:49:04.3559308Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:49:04.3559682Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:49:04.3560030Z 'l2_groupings': [1], 2026-02-21T08:49:04.3560309Z 'load_eviction_policies': ['', ''], 2026-02-21T08:49:04.3560607Z 'loop_orders': [[0, 1]], 2026-02-21T08:49:04.3560850Z 'matrix_instr_nonkdim': 0, 2026-02-21T08:49:04.3561083Z 'num_stages': 1, 2026-02-21T08:49:04.3561294Z 'num_warps': 2, 2026-02-21T08:49:04.3561496Z 'pid_type': 'flat', 2026-02-21T08:49:04.3561719Z 'range_flattens': [None, None], 2026-02-21T08:49:04.3561986Z 'range_multi_buffers': [None, None], 2026-02-21T08:49:04.3562256Z 'range_num_stages': [0, 0], 2026-02-21T08:49:04.3562497Z 'range_unroll_factors': [0, 0], 2026-02-21T08:49:04.3562845Z 'range_warp_specializes': [], 2026-02-21T08:49:04.3563091Z 'waves_per_eu': 1} 2026-02-21T08:49:04.4526449Z [182s] Fitting surrogate: 506 points, 506 targets 2026-02-21T08:49:05.3038930Z [183s] Generation 5 starting: 81 neighbors, 5 active search path(s) 2026-02-21T08:49:34.5568580Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 0.2 configs/s 2026-02-21T08:49:38.8590302Z /tmp/torchinductor_root/65/c65wjfoaegxg4l3qhkglwuwaoieefsi6tdfg3pxozbvljcnd7vhm.py:47:25: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T08:49:38.8591656Z b_tile = tl.load(tl.make_block_ptr(B, [4096, 7168], [7168, 1], [offset_3, offset_2], [_BLOCK_SIZE_0, _BLOCK_SIZE_2], [1, 0]), boundary_check=[0, 1], padding_option='zero') 2026-02-21T08:49:38.8592430Z ^ 2026-02-21T08:49:38.8594317Z /tmp/torchinductor_root/65/c65wjfoaegxg4l3qhkglwuwaoieefsi6tdfg3pxozbvljcnd7vhm.py:60:41: note: - use: %112 = "ttg.convert_layout"(<>) : (tensor<64x16xi8, #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>>) -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>}>> 2026-02-21T08:49:38.8596462Z 2026-02-21T08:49:38.8596605Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T08:49:38.8596868Z ^ 2026-02-21T08:49:38.8598241Z /tmp/torchinductor_root/65/c65wjfoaegxg4l3qhkglwuwaoieefsi6tdfg3pxozbvljcnd7vhm.py:59:41: note: - use: %113 = "ttg.convert_layout"(<>) : (tensor<64x16xi8, #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>>) -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>}>> 2026-02-21T08:49:38.8599541Z 2026-02-21T08:49:38.8599632Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T08:49:38.8599877Z ^ 2026-02-21T08:49:38.8600156Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T08:49:38.8600652Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T08:49:38.8601302Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T08:49:38.8601962Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:49:38.8602746Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:49:38.8603368Z #blocked4 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T08:49:38.8603980Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T08:49:38.8604613Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T08:49:38.8605230Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T08:49:38.8605855Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T08:49:38.8606492Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [2, 1, 1], order = [0, 1, 2]}> 2026-02-21T08:49:38.8607003Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [2, 1, 1], order = [0, 1, 2]}> 2026-02-21T08:49:38.8607481Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:49:38.8607939Z #blocked12 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [8, 8], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T08:49:38.8608638Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:49:38.8609341Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:49:38.8609908Z %cst = arith.constant dense<0> : tensor<64x16xi8, #blocked> 2026-02-21T08:49:38.8610186Z %cst_0 = arith.constant dense<7168> : tensor<1x16xi64, #blocked> 2026-02-21T08:49:38.8610446Z %cst_1 = arith.constant dense<0> : tensor<1x16xi64, #blocked> 2026-02-21T08:49:38.8610705Z %cst_2 = arith.constant dense<4096> : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:38.8610970Z %cst_3 = arith.constant dense<0> : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:38.8611228Z %cst_4 = arith.constant dense<7168> : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:38.8611494Z %cst_5 = arith.constant dense<0> : tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:38.8611726Z %c28672_i32 = arith.constant 28672 : i32 2026-02-21T08:49:38.8611910Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:49:38.8612140Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:49:38.8612353Z %cst_6 = arith.constant dense<7168> : tensor<32x1xi32, #blocked1> 2026-02-21T08:49:38.8612617Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked3> 2026-02-21T08:49:38.8612874Z %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked3> 2026-02-21T08:49:38.8613129Z %cst_9 = arith.constant dense<4> : tensor<64x16xi8, #blocked> 2026-02-21T08:49:38.8613393Z %cst_10 = arith.constant dense<8192> : tensor<32x1xi32, #blocked1> 2026-02-21T08:49:38.8613617Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:49:38.8613835Z %cst_11 = arith.constant dense<0.000000e+00> : tensor<32x16xf32, #blocked> 2026-02-21T08:49:38.8614081Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:49:38.8614246Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:49:38.8614416Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:49:38.8614593Z %0 = tt.get_program_id x : i32 2026-02-21T08:49:38.8614755Z %1 = arith.divsi %0, %c28672_i32 : i32 2026-02-21T08:49:38.8614930Z %2 = arith.muli %1, %c64_i32 : i32 2026-02-21T08:49:38.8615099Z %3 = arith.subi %c2_i32, %2 : i32 2026-02-21T08:49:38.8615262Z %4 = arith.minsi %3, %c64_i32 : i32 2026-02-21T08:49:38.8615426Z %5 = arith.remsi %0, %c28672_i32 : i32 2026-02-21T08:49:38.8615603Z %6 = arith.remsi %5, %4 : i32 2026-02-21T08:49:38.8615770Z %7 = arith.addi %2, %6 : i32 2026-02-21T08:49:38.8615925Z %8 = arith.divsi %5, %4 : i32 2026-02-21T08:49:38.8616085Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T08:49:38.8616318Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked4> 2026-02-21T08:49:38.8616635Z %11 = tt.splat %9 : i32 -> tensor<32xi32, #blocked4> 2026-02-21T08:49:38.8616824Z %12 = arith.addi %11, %10 : tensor<32xi32, #blocked4> 2026-02-21T08:49:38.8616978Z %13 = arith.muli %8, %c16_i32 : i32 2026-02-21T08:49:38.8617159Z %14 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked4> 2026-02-21T08:49:38.8617367Z %15 = tt.splat %13 : i32 -> tensor<16xi32, #blocked4> 2026-02-21T08:49:38.8617566Z %16 = arith.addi %15, %14 : tensor<16xi32, #blocked4> 2026-02-21T08:49:38.8617771Z %17 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked4> 2026-02-21T08:49:38.8618081Z %18 = ttg.convert_layout %12 : tensor<32xi32, #blocked4> -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T08:49:38.8618461Z %19 = tt.expand_dims %18 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<32x1xi32, #blocked5> 2026-02-21T08:49:38.8618799Z %20 = ttg.convert_layout %19 : tensor<32x1xi32, #blocked5> -> tensor<32x1xi32, #blocked1> 2026-02-21T08:49:38.8619030Z %21 = arith.muli %20, %cst_10 : tensor<32x1xi32, #blocked1> 2026-02-21T08:49:38.8619299Z %22 = tt.broadcast %21 : tensor<32x1xi32, #blocked1> -> tensor<32x128xi32, #blocked1> 2026-02-21T08:49:38.8619568Z %23 = ttg.convert_layout %22 : tensor<32x128xi32, #blocked1> -> tensor<32x128xi32, #blocked6> 2026-02-21T08:49:38.8619841Z %24 = tt.splat %arg0 : !tt.ptr -> tensor<32x128x!tt.ptr, #blocked6> 2026-02-21T08:49:38.8620037Z %25 = arith.extsi %13 : i32 to i64 2026-02-21T08:49:38.8620205Z %26 = tt.splat %arg1 : !tt.ptr -> tensor<64x16x!tt.ptr, #blocked> 2026-02-21T08:49:38.8620440Z %27 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked4> 2026-02-21T08:49:38.8620679Z %28 = arith.extsi %27 : tensor<64xi32, #blocked4> to tensor<64xi64, #blocked4> 2026-02-21T08:49:38.8620888Z %29 = tt.splat %25 : i64 -> tensor<16xi64, #blocked4> 2026-02-21T08:49:38.8621081Z %30 = arith.extsi %14 : tensor<16xi32, #blocked4> to tensor<16xi64, #blocked4> 2026-02-21T08:49:38.8621283Z %31 = arith.addi %29, %30 : tensor<16xi64, #blocked4> 2026-02-21T08:49:38.8621555Z %32 = ttg.convert_layout %31 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:49:38.8621971Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x16xi64, #blocked7> 2026-02-21T08:49:38.8622292Z %34 = ttg.convert_layout %33 : tensor<1x16xi64, #blocked7> -> tensor<1x16xi64, #blocked> 2026-02-21T08:49:38.8622549Z %35 = tt.broadcast %34 : tensor<1x16xi64, #blocked> -> tensor<64x16xi64, #blocked> 2026-02-21T08:49:38.8622777Z %36 = arith.cmpi sge, %34, %cst_1 : tensor<1x16xi64, #blocked> 2026-02-21T08:49:38.8622968Z %37 = arith.cmpi slt, %34, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T08:49:38.8623151Z %38 = arith.andi %36, %37 : tensor<1x16xi1, #blocked> 2026-02-21T08:49:38.8623354Z %39 = tt.broadcast %38 : tensor<1x16xi1, #blocked> -> tensor<64x16xi1, #blocked> 2026-02-21T08:49:38.8623587Z %40 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked4> 2026-02-21T08:49:38.8623915Z %41 = ttg.convert_layout %40 : tensor<2xi32, #blocked4> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:49:38.8624285Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x2xi32, #blocked7> 2026-02-21T08:49:38.8624607Z %43 = ttg.convert_layout %42 : tensor<1x2xi32, #blocked7> -> tensor<1x2xi32, #blocked8> 2026-02-21T08:49:38.8624934Z %44 = ttg.convert_layout %43 : tensor<1x2xi32, #blocked8> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T08:49:38.8625309Z %45 = tt.expand_dims %44 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x2x1xi32, #blocked9> 2026-02-21T08:49:38.8625655Z %46 = ttg.convert_layout %45 : tensor<1x2x1xi32, #blocked9> -> tensor<1x2x1xi32, #blocked3> 2026-02-21T08:49:38.8625894Z %47 = arith.cmpi eq, %46, %cst_8 : tensor<1x2x1xi32, #blocked3> 2026-02-21T08:49:38.8626124Z %48 = tt.broadcast %47 : tensor<1x2x1xi1, #blocked3> -> tensor<64x2x16xi1, #blocked3> 2026-02-21T08:49:38.8626395Z %49 = ttg.convert_layout %48 : tensor<64x2x16xi1, #blocked3> -> tensor<64x2x16xi1, #blocked2> 2026-02-21T08:49:38.8626620Z %50 = arith.cmpi eq, %46, %cst_7 : tensor<1x2x1xi32, #blocked3> 2026-02-21T08:49:38.8626813Z %51 = tt.broadcast %50 : tensor<1x2x1xi1, #blocked3> -> tensor<64x2x16xi1, #blocked3> 2026-02-21T08:49:38.8627040Z %52 = ttg.convert_layout %51 : tensor<64x2x16xi1, #blocked3> -> tensor<64x2x16xi1, #blocked2> 2026-02-21T08:49:38.8627224Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:49:38.8627446Z %53 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg4 = %cst_11) -> (tensor<32x16xf32, #blocked>) : i32 { 2026-02-21T08:49:38.8627665Z %68 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T08:49:38.8627802Z %69 = tt.splat %68 : i32 -> tensor<128xi32, #blocked4> 2026-02-21T08:49:38.8627951Z %70 = arith.addi %69, %17 : tensor<128xi32, #blocked4> 2026-02-21T08:49:38.8628219Z %71 = ttg.convert_layout %70 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:49:38.8628545Z %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7> 2026-02-21T08:49:38.8628830Z %73 = ttg.convert_layout %72 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked6> 2026-02-21T08:49:38.8629063Z %74 = tt.broadcast %73 : tensor<1x128xi32, #blocked6> -> tensor<32x128xi32, #blocked6> 2026-02-21T08:49:38.8629253Z %75 = arith.addi %23, %74 : tensor<32x128xi32, #blocked6> 2026-02-21T08:49:38.8629453Z %76 = tt.addptr %24, %75 : tensor<32x128x!tt.ptr, #blocked6>, tensor<32x128xi32, #blocked6> 2026-02-21T08:49:38.8629659Z %77 = tt.load %76 : tensor<32x128x!tt.ptr, #blocked6> 2026-02-21T08:49:38.8629848Z %78 = arith.extf %77 : tensor<32x128xbf16, #blocked6> to tensor<32x128xf32, #blocked6> 2026-02-21T08:49:38.8630025Z %79 = arith.extsi %arg3 : i32 to i64 2026-02-21T08:49:38.8630156Z %80 = tt.splat %79 : i64 -> tensor<64xi64, #blocked4> 2026-02-21T08:49:38.8630340Z %81 = arith.addi %80, %28 : tensor<64xi64, #blocked4> 2026-02-21T08:49:38.8630569Z %82 = ttg.convert_layout %81 : tensor<64xi64, #blocked4> -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T08:49:38.8630889Z %83 = tt.expand_dims %82 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi64, #blocked5> 2026-02-21T08:49:38.8631167Z %84 = ttg.convert_layout %83 : tensor<64x1xi64, #blocked5> -> tensor<64x1xi64, #blocked1> 2026-02-21T08:49:38.8631364Z %85 = arith.muli %84, %cst_4 : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:38.8631551Z %86 = tt.broadcast %85 : tensor<64x1xi64, #blocked1> -> tensor<64x16xi64, #blocked1> 2026-02-21T08:49:38.8631776Z %87 = ttg.convert_layout %86 : tensor<64x16xi64, #blocked1> -> tensor<64x16xi64, #blocked> 2026-02-21T08:49:38.8631975Z %88 = arith.addi %87, %35 : tensor<64x16xi64, #blocked> 2026-02-21T08:49:38.8632166Z %89 = tt.addptr %26, %88 : tensor<64x16x!tt.ptr, #blocked>, tensor<64x16xi64, #blocked> 2026-02-21T08:49:38.8632368Z %90 = arith.cmpi sge, %84, %cst_3 : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:38.8632536Z %91 = arith.cmpi slt, %84, %cst_2 : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:38.8632691Z %92 = arith.andi %90, %91 : tensor<64x1xi1, #blocked1> 2026-02-21T08:49:38.8632870Z %93 = tt.broadcast %92 : tensor<64x1xi1, #blocked1> -> tensor<64x16xi1, #blocked1> 2026-02-21T08:49:38.8633093Z %94 = ttg.convert_layout %93 : tensor<64x16xi1, #blocked1> -> tensor<64x16xi1, #blocked> 2026-02-21T08:49:38.8633282Z %95 = arith.andi %94, %39 : tensor<64x16xi1, #blocked> 2026-02-21T08:49:38.8633439Z %96 = tt.load %89, %95, %cst : tensor<64x16x!tt.ptr, #blocked> 2026-02-21T08:49:38.8633603Z %97 = arith.shli %96, %cst_9 : tensor<64x16xi8, #blocked> 2026-02-21T08:49:38.8633758Z %98 = arith.shrsi %97, %cst_9 : tensor<64x16xi8, #blocked> 2026-02-21T08:49:38.8633915Z %99 = arith.shrsi %96, %cst_9 : tensor<64x16xi8, #blocked> 2026-02-21T08:49:38.8634157Z %100 = ttg.convert_layout %98 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:49:38.8634503Z %101 = tt.expand_dims %100 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10> 2026-02-21T08:49:38.8634811Z %102 = ttg.convert_layout %101 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11> 2026-02-21T08:49:38.8635104Z %103 = ttg.convert_layout %99 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:49:38.8635442Z %104 = tt.expand_dims %103 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10> 2026-02-21T08:49:38.8635795Z %105 = ttg.convert_layout %104 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11> 2026-02-21T08:49:38.8636048Z %106 = tt.broadcast %102 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11> 2026-02-21T08:49:38.8636292Z %107 = ttg.convert_layout %106 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:38.8636547Z %108 = arith.select %49, %107, %cst_5 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:38.8636793Z %109 = tt.broadcast %105 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11> 2026-02-21T08:49:38.8637037Z %110 = ttg.convert_layout %109 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:38.8637292Z %111 = arith.select %52, %110, %108 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:38.8637527Z %112 = tt.reshape %111 : tensor<64x2x16xi8, #blocked2> -> tensor<128x16xi8, #blocked> 2026-02-21T08:49:38.8637759Z %113 = arith.sitofp %112 : tensor<128x16xi8, #blocked> to tensor<128x16xf32, #blocked> 2026-02-21T08:49:38.8638084Z %114 = ttg.convert_layout %78 : tensor<32x128xf32, #blocked6> -> tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>> 2026-02-21T08:49:38.8638431Z %115 = ttg.convert_layout %113 : tensor<128x16xf32, #blocked> -> tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>> 2026-02-21T08:49:38.8638731Z %116 = ttg.convert_layout %arg4 : tensor<32x16xf32, #blocked> -> tensor<32x16xf32, #blocked12> 2026-02-21T08:49:38.8639140Z %117 = tt.dot %114, %115, %116, inputPrecision = tf32 : tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>> * tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>> -> tensor<32x16xf32, #blocked12> 2026-02-21T08:49:38.8639547Z %118 = ttg.convert_layout %117 : tensor<32x16xf32, #blocked12> -> tensor<32x16xf32, #blocked> 2026-02-21T08:49:38.8639740Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:49:38.8639862Z %119 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T08:49:38.8639984Z %120 = arith.addi %arg3, %119 : i32 2026-02-21T08:49:38.8640101Z %121 = arith.muli %120, %c2_i32 : i32 2026-02-21T08:49:38.8640240Z %122 = tt.splat %121 : i32 -> tensor<128xi32, #blocked4> 2026-02-21T08:49:38.8640395Z %123 = arith.addi %122, %17 : tensor<128xi32, #blocked4> 2026-02-21T08:49:38.8640635Z %124 = ttg.convert_layout %123 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:49:38.8640972Z %125 = tt.expand_dims %124 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7> 2026-02-21T08:49:38.8641267Z %126 = ttg.convert_layout %125 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked6> 2026-02-21T08:49:38.8641506Z %127 = tt.broadcast %126 : tensor<1x128xi32, #blocked6> -> tensor<32x128xi32, #blocked6> 2026-02-21T08:49:38.8641705Z %128 = arith.addi %23, %127 : tensor<32x128xi32, #blocked6> 2026-02-21T08:49:38.8641912Z %129 = tt.addptr %24, %128 : tensor<32x128x!tt.ptr, #blocked6>, tensor<32x128xi32, #blocked6> 2026-02-21T08:49:38.8642125Z %130 = tt.load %129 : tensor<32x128x!tt.ptr, #blocked6> 2026-02-21T08:49:38.8642319Z %131 = arith.extf %130 : tensor<32x128xbf16, #blocked6> to tensor<32x128xf32, #blocked6> 2026-02-21T08:49:38.8642496Z %132 = arith.extsi %120 : i32 to i64 2026-02-21T08:49:38.8642663Z %133 = tt.splat %132 : i64 -> tensor<64xi64, #blocked4> 2026-02-21T08:49:38.8642815Z %134 = arith.addi %133, %28 : tensor<64xi64, #blocked4> 2026-02-21T08:49:38.8643049Z %135 = ttg.convert_layout %134 : tensor<64xi64, #blocked4> -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T08:49:38.8643370Z %136 = tt.expand_dims %135 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi64, #blocked5> 2026-02-21T08:49:38.8643693Z %137 = ttg.convert_layout %136 : tensor<64x1xi64, #blocked5> -> tensor<64x1xi64, #blocked1> 2026-02-21T08:49:38.8643897Z %138 = arith.muli %137, %cst_4 : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:38.8644093Z %139 = tt.broadcast %138 : tensor<64x1xi64, #blocked1> -> tensor<64x16xi64, #blocked1> 2026-02-21T08:49:38.8644326Z %140 = ttg.convert_layout %139 : tensor<64x16xi64, #blocked1> -> tensor<64x16xi64, #blocked> 2026-02-21T08:49:38.8644525Z %141 = arith.addi %140, %35 : tensor<64x16xi64, #blocked> 2026-02-21T08:49:38.8644721Z %142 = tt.addptr %26, %141 : tensor<64x16x!tt.ptr, #blocked>, tensor<64x16xi64, #blocked> 2026-02-21T08:49:38.8644927Z %143 = arith.cmpi sge, %137, %cst_3 : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:38.8645104Z %144 = arith.cmpi slt, %137, %cst_2 : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:38.8645267Z %145 = arith.andi %143, %144 : tensor<64x1xi1, #blocked1> 2026-02-21T08:49:38.8645459Z %146 = tt.broadcast %145 : tensor<64x1xi1, #blocked1> -> tensor<64x16xi1, #blocked1> 2026-02-21T08:49:38.8645691Z %147 = ttg.convert_layout %146 : tensor<64x16xi1, #blocked1> -> tensor<64x16xi1, #blocked> 2026-02-21T08:49:38.8645923Z %148 = arith.andi %147, %39 : tensor<64x16xi1, #blocked> 2026-02-21T08:49:38.8646091Z %149 = tt.load %142, %148, %cst : tensor<64x16x!tt.ptr, #blocked> 2026-02-21T08:49:38.8646263Z %150 = arith.shli %149, %cst_9 : tensor<64x16xi8, #blocked> 2026-02-21T08:49:38.8646428Z %151 = arith.shrsi %150, %cst_9 : tensor<64x16xi8, #blocked> 2026-02-21T08:49:38.8646591Z %152 = arith.shrsi %149, %cst_9 : tensor<64x16xi8, #blocked> 2026-02-21T08:49:38.8646840Z %153 = ttg.convert_layout %151 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:49:38.8647188Z %154 = tt.expand_dims %153 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10> 2026-02-21T08:49:38.8647495Z %155 = ttg.convert_layout %154 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11> 2026-02-21T08:49:38.8647791Z %156 = ttg.convert_layout %152 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:49:38.8648131Z %157 = tt.expand_dims %156 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10> 2026-02-21T08:49:38.8648435Z %158 = ttg.convert_layout %157 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11> 2026-02-21T08:49:38.8648684Z %159 = tt.broadcast %155 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11> 2026-02-21T08:49:38.8648928Z %160 = ttg.convert_layout %159 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:38.8649182Z %161 = arith.select %49, %160, %cst_5 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:38.8649432Z %162 = tt.broadcast %158 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11> 2026-02-21T08:49:38.8649679Z %163 = ttg.convert_layout %162 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:38.8649929Z %164 = arith.select %52, %163, %161 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:38.8650172Z %165 = tt.reshape %164 : tensor<64x2x16xi8, #blocked2> -> tensor<128x16xi8, #blocked> 2026-02-21T08:49:38.8650403Z %166 = arith.sitofp %165 : tensor<128x16xi8, #blocked> to tensor<128x16xf32, #blocked> 2026-02-21T08:49:38.8650695Z %167 = ttg.convert_layout %131 : tensor<32x128xf32, #blocked6> -> tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>> 2026-02-21T08:49:38.8651046Z %168 = ttg.convert_layout %166 : tensor<128x16xf32, #blocked> -> tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>> 2026-02-21T08:49:38.8651377Z %169 = ttg.convert_layout %118 : tensor<32x16xf32, #blocked> -> tensor<32x16xf32, #blocked12> 2026-02-21T08:49:38.8651779Z %170 = tt.dot %167, %168, %169, inputPrecision = tf32 : tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>> * tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>> -> tensor<32x16xf32, #blocked12> 2026-02-21T08:49:38.8652178Z %171 = ttg.convert_layout %170 : tensor<32x16xf32, #blocked12> -> tensor<32x16xf32, #blocked> 2026-02-21T08:49:38.8652374Z scf.yield %171 : tensor<32x16xf32, #blocked> 2026-02-21T08:49:38.8652495Z } {tt.flatten} 2026-02-21T08:49:38.8652637Z %54 = arith.truncf %53 : tensor<32x16xf32, #blocked> to tensor<32x16xbf16, #blocked> 2026-02-21T08:49:38.8652905Z %55 = ttg.convert_layout %12 : tensor<32xi32, #blocked4> -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T08:49:38.8653227Z %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<32x1xi32, #blocked5> 2026-02-21T08:49:38.8653512Z %57 = ttg.convert_layout %56 : tensor<32x1xi32, #blocked5> -> tensor<32x1xi32, #blocked1> 2026-02-21T08:49:38.8653711Z %58 = arith.muli %57, %cst_6 : tensor<32x1xi32, #blocked1> 2026-02-21T08:49:38.8653974Z %59 = ttg.convert_layout %16 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:49:38.8654290Z %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x16xi32, #blocked7> 2026-02-21T08:49:38.8654566Z %61 = ttg.convert_layout %60 : tensor<1x16xi32, #blocked7> -> tensor<1x16xi32, #blocked> 2026-02-21T08:49:38.8654789Z %62 = tt.broadcast %58 : tensor<32x1xi32, #blocked1> -> tensor<32x16xi32, #blocked1> 2026-02-21T08:49:38.8655011Z %63 = ttg.convert_layout %62 : tensor<32x16xi32, #blocked1> -> tensor<32x16xi32, #blocked> 2026-02-21T08:49:38.8655233Z %64 = tt.broadcast %61 : tensor<1x16xi32, #blocked> -> tensor<32x16xi32, #blocked> 2026-02-21T08:49:38.8655418Z %65 = arith.addi %63, %64 : tensor<32x16xi32, #blocked> 2026-02-21T08:49:38.8655593Z %66 = tt.splat %arg2 : !tt.ptr -> tensor<32x16x!tt.ptr, #blocked> 2026-02-21T08:49:38.8655812Z %67 = tt.addptr %66, %65 : tensor<32x16x!tt.ptr, #blocked>, tensor<32x16xi32, #blocked> 2026-02-21T08:49:38.8656008Z tt.store %67, %54 : tensor<32x16x!tt.ptr, #blocked> 2026-02-21T08:49:38.8656140Z tt.return 2026-02-21T08:49:38.8656219Z } 2026-02-21T08:49:38.8656296Z } 2026-02-21T08:49:38.8656336Z 2026-02-21T08:49:38.8656366Z {-# 2026-02-21T08:49:38.8656448Z external_resources: { 2026-02-21T08:49:38.8656544Z mlir_reproducer: { 2026-02-21T08:49:38.8658752Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=16}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T08:49:38.8661073Z disable_threading: false, 2026-02-21T08:49:38.8661182Z verify_each: true 2026-02-21T08:49:38.8661278Z } 2026-02-21T08:49:38.8661349Z } 2026-02-21T08:49:38.8661420Z #-} 2026-02-21T08:49:38.8661698Z /tmp/torchinductor_root/65/c65wjfoaegxg4l3qhkglwuwaoieefsi6tdfg3pxozbvljcnd7vhm.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:49:38.8662431Z /tmp/torchinductor_root/65/c65wjfoaegxg4l3qhkglwuwaoieefsi6tdfg3pxozbvljcnd7vhm.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:49:38.8662989Z [217s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:49:38.8663717Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 32, 16], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T08:49:38.8671087Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:49:38.8671278Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:49:39.2539140Z /tmp/torchinductor_root/tw/ctwf45jgblprssxbx6digyhxum3ybqilmizfwxub35zw4vh7cfja.py:47:25: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T08:49:39.2539778Z b_tile = tl.load(tl.make_block_ptr(B, [4096, 7168], [7168, 1], [offset_3, offset_2], [_BLOCK_SIZE_0, _BLOCK_SIZE_2], [1, 0]), boundary_check=[0, 1], padding_option='zero') 2026-02-21T08:49:39.2540146Z ^ 2026-02-21T08:49:39.2541034Z /tmp/torchinductor_root/tw/ctwf45jgblprssxbx6digyhxum3ybqilmizfwxub35zw4vh7cfja.py:60:41: note: - use: %112 = "ttg.convert_layout"(<>) : (tensor<64x16xi8, #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}>>) -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>}>> 2026-02-21T08:49:39.2541856Z 2026-02-21T08:49:39.2541921Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T08:49:39.2542081Z ^ 2026-02-21T08:49:39.2542912Z /tmp/torchinductor_root/tw/ctwf45jgblprssxbx6digyhxum3ybqilmizfwxub35zw4vh7cfja.py:59:41: note: - use: %113 = "ttg.convert_layout"(<>) : (tensor<64x16xi8, #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}>>) -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>}>> 2026-02-21T08:49:39.2543703Z 2026-02-21T08:49:39.2543764Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T08:49:39.2543915Z ^ 2026-02-21T08:49:39.2544084Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T08:49:39.2547717Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T08:49:39.2548120Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T08:49:39.2548500Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:49:39.2548874Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:49:39.2549220Z #blocked4 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [4], order = [0]}> 2026-02-21T08:49:39.2549694Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [0, 1]}> 2026-02-21T08:49:39.2550035Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T08:49:39.2550385Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [0, 1]}> 2026-02-21T08:49:39.2550727Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T08:49:39.2551086Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T08:49:39.2551457Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T08:49:39.2551837Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:49:39.2552254Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:49:39.2552876Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:49:39.2553318Z %cst = arith.constant dense<0> : tensor<64x16xi8, #blocked> 2026-02-21T08:49:39.2553535Z %cst_0 = arith.constant dense<7168> : tensor<1x16xi64, #blocked> 2026-02-21T08:49:39.2553740Z %cst_1 = arith.constant dense<0> : tensor<1x16xi64, #blocked> 2026-02-21T08:49:39.2553950Z %cst_2 = arith.constant dense<4096> : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:39.2554158Z %cst_3 = arith.constant dense<0> : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:39.2554383Z %cst_4 = arith.constant dense<7168> : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:39.2554598Z %cst_5 = arith.constant dense<0> : tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:39.2554777Z %c28672_i32 = arith.constant 28672 : i32 2026-02-21T08:49:39.2554924Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:49:39.2555072Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:49:39.2555236Z %cst_6 = arith.constant dense<7168> : tensor<32x1xi32, #blocked1> 2026-02-21T08:49:39.2555446Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked3> 2026-02-21T08:49:39.2555645Z %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked3> 2026-02-21T08:49:39.2555847Z %cst_9 = arith.constant dense<4> : tensor<64x16xi8, #blocked> 2026-02-21T08:49:39.2556054Z %cst_10 = arith.constant dense<8192> : tensor<32x1xi32, #blocked1> 2026-02-21T08:49:39.2556225Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:49:39.2556410Z %cst_11 = arith.constant dense<0.000000e+00> : tensor<32x16xf32, #blocked> 2026-02-21T08:49:39.2556630Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:49:39.2556744Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:49:39.2556858Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:49:39.2556976Z %0 = tt.get_program_id x : i32 2026-02-21T08:49:39.2557094Z %1 = arith.divsi %0, %c28672_i32 : i32 2026-02-21T08:49:39.2557210Z %2 = arith.muli %1, %c64_i32 : i32 2026-02-21T08:49:39.2557323Z %3 = arith.subi %c2_i32, %2 : i32 2026-02-21T08:49:39.2557433Z %4 = arith.minsi %3, %c64_i32 : i32 2026-02-21T08:49:39.2557549Z %5 = arith.remsi %0, %c28672_i32 : i32 2026-02-21T08:49:39.2557664Z %6 = arith.remsi %5, %4 : i32 2026-02-21T08:49:39.2557774Z %7 = arith.addi %2, %6 : i32 2026-02-21T08:49:39.2557883Z %8 = arith.divsi %5, %4 : i32 2026-02-21T08:49:39.2557990Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T08:49:39.2558147Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked4> 2026-02-21T08:49:39.2558334Z %11 = tt.splat %9 : i32 -> tensor<32xi32, #blocked4> 2026-02-21T08:49:39.2558485Z %12 = arith.addi %11, %10 : tensor<32xi32, #blocked4> 2026-02-21T08:49:39.2558663Z %13 = arith.muli %8, %c16_i32 : i32 2026-02-21T08:49:39.2558827Z %14 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked4> 2026-02-21T08:49:39.2559010Z %15 = tt.splat %13 : i32 -> tensor<16xi32, #blocked4> 2026-02-21T08:49:39.2559154Z %16 = arith.addi %15, %14 : tensor<16xi32, #blocked4> 2026-02-21T08:49:39.2559326Z %17 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked4> 2026-02-21T08:49:39.2559602Z %18 = ttg.convert_layout %12 : tensor<32xi32, #blocked4> -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T08:49:39.2559936Z %19 = tt.expand_dims %18 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<32x1xi32, #blocked5> 2026-02-21T08:49:39.2560254Z %20 = ttg.convert_layout %19 : tensor<32x1xi32, #blocked5> -> tensor<32x1xi32, #blocked1> 2026-02-21T08:49:39.2560459Z %21 = arith.muli %20, %cst_10 : tensor<32x1xi32, #blocked1> 2026-02-21T08:49:39.2571685Z %22 = tt.broadcast %21 : tensor<32x1xi32, #blocked1> -> tensor<32x128xi32, #blocked1> 2026-02-21T08:49:39.2572033Z %23 = ttg.convert_layout %22 : tensor<32x128xi32, #blocked1> -> tensor<32x128xi32, #blocked6> 2026-02-21T08:49:39.2572283Z %24 = tt.splat %arg0 : !tt.ptr -> tensor<32x128x!tt.ptr, #blocked6> 2026-02-21T08:49:39.2572452Z %25 = arith.extsi %13 : i32 to i64 2026-02-21T08:49:39.2572609Z %26 = tt.splat %arg1 : !tt.ptr -> tensor<64x16x!tt.ptr, #blocked> 2026-02-21T08:49:39.2572814Z %27 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked4> 2026-02-21T08:49:39.2573023Z %28 = arith.extsi %27 : tensor<64xi32, #blocked4> to tensor<64xi64, #blocked4> 2026-02-21T08:49:39.2573207Z %29 = tt.splat %25 : i64 -> tensor<16xi64, #blocked4> 2026-02-21T08:49:39.2573388Z %30 = arith.extsi %14 : tensor<16xi32, #blocked4> to tensor<16xi64, #blocked4> 2026-02-21T08:49:39.2573569Z %31 = arith.addi %29, %30 : tensor<16xi64, #blocked4> 2026-02-21T08:49:39.2573797Z %32 = ttg.convert_layout %31 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:49:39.2574128Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x16xi64, #blocked7> 2026-02-21T08:49:39.2574415Z %34 = ttg.convert_layout %33 : tensor<1x16xi64, #blocked7> -> tensor<1x16xi64, #blocked> 2026-02-21T08:49:39.2574636Z %35 = tt.broadcast %34 : tensor<1x16xi64, #blocked> -> tensor<64x16xi64, #blocked> 2026-02-21T08:49:39.2574834Z %36 = arith.cmpi sge, %34, %cst_1 : tensor<1x16xi64, #blocked> 2026-02-21T08:49:39.2575000Z %37 = arith.cmpi slt, %34, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T08:49:39.2575160Z %38 = arith.andi %36, %37 : tensor<1x16xi1, #blocked> 2026-02-21T08:49:39.2575336Z %39 = tt.broadcast %38 : tensor<1x16xi1, #blocked> -> tensor<64x16xi1, #blocked> 2026-02-21T08:49:39.2575542Z %40 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked4> 2026-02-21T08:49:39.2575798Z %41 = ttg.convert_layout %40 : tensor<2xi32, #blocked4> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:49:39.2576117Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x2xi32, #blocked7> 2026-02-21T08:49:39.2576398Z %43 = ttg.convert_layout %42 : tensor<1x2xi32, #blocked7> -> tensor<1x2xi32, #blocked8> 2026-02-21T08:49:39.2576674Z %44 = ttg.convert_layout %43 : tensor<1x2xi32, #blocked8> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T08:49:39.2576996Z %45 = tt.expand_dims %44 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x2x1xi32, #blocked9> 2026-02-21T08:49:39.2577289Z %46 = ttg.convert_layout %45 : tensor<1x2x1xi32, #blocked9> -> tensor<1x2x1xi32, #blocked3> 2026-02-21T08:49:39.2577498Z %47 = arith.cmpi eq, %46, %cst_8 : tensor<1x2x1xi32, #blocked3> 2026-02-21T08:49:39.2577761Z %48 = tt.broadcast %47 : tensor<1x2x1xi1, #blocked3> -> tensor<64x2x16xi1, #blocked3> 2026-02-21T08:49:39.2578003Z %49 = ttg.convert_layout %48 : tensor<64x2x16xi1, #blocked3> -> tensor<64x2x16xi1, #blocked2> 2026-02-21T08:49:39.2578209Z %50 = arith.cmpi eq, %46, %cst_7 : tensor<1x2x1xi32, #blocked3> 2026-02-21T08:49:39.2578404Z %51 = tt.broadcast %50 : tensor<1x2x1xi1, #blocked3> -> tensor<64x2x16xi1, #blocked3> 2026-02-21T08:49:39.2578633Z %52 = ttg.convert_layout %51 : tensor<64x2x16xi1, #blocked3> -> tensor<64x2x16xi1, #blocked2> 2026-02-21T08:49:39.2578823Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:49:39.2579048Z %53 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg4 = %cst_11) -> (tensor<32x16xf32, #blocked>) : i32 { 2026-02-21T08:49:39.2579272Z %68 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T08:49:39.2579415Z %69 = tt.splat %68 : i32 -> tensor<128xi32, #blocked4> 2026-02-21T08:49:39.2579571Z %70 = arith.addi %69, %17 : tensor<128xi32, #blocked4> 2026-02-21T08:49:39.2579808Z %71 = ttg.convert_layout %70 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:49:39.2580164Z %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7> 2026-02-21T08:49:39.2580455Z %73 = ttg.convert_layout %72 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked6> 2026-02-21T08:49:39.2580688Z %74 = tt.broadcast %73 : tensor<1x128xi32, #blocked6> -> tensor<32x128xi32, #blocked6> 2026-02-21T08:49:39.2580880Z %75 = arith.addi %23, %74 : tensor<32x128xi32, #blocked6> 2026-02-21T08:49:39.2581084Z %76 = tt.addptr %24, %75 : tensor<32x128x!tt.ptr, #blocked6>, tensor<32x128xi32, #blocked6> 2026-02-21T08:49:39.2581288Z %77 = tt.load %76 : tensor<32x128x!tt.ptr, #blocked6> 2026-02-21T08:49:39.2581485Z %78 = arith.extf %77 : tensor<32x128xbf16, #blocked6> to tensor<32x128xf32, #blocked6> 2026-02-21T08:49:39.2581663Z %79 = arith.extsi %arg3 : i32 to i64 2026-02-21T08:49:39.2581801Z %80 = tt.splat %79 : i64 -> tensor<64xi64, #blocked4> 2026-02-21T08:49:39.2581952Z %81 = arith.addi %80, %28 : tensor<64xi64, #blocked4> 2026-02-21T08:49:39.2582180Z %82 = ttg.convert_layout %81 : tensor<64xi64, #blocked4> -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T08:49:39.2582505Z %83 = tt.expand_dims %82 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi64, #blocked5> 2026-02-21T08:49:39.2582793Z %84 = ttg.convert_layout %83 : tensor<64x1xi64, #blocked5> -> tensor<64x1xi64, #blocked1> 2026-02-21T08:49:39.2582994Z %85 = arith.muli %84, %cst_4 : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:39.2583186Z %86 = tt.broadcast %85 : tensor<64x1xi64, #blocked1> -> tensor<64x16xi64, #blocked1> 2026-02-21T08:49:39.2583419Z %87 = ttg.convert_layout %86 : tensor<64x16xi64, #blocked1> -> tensor<64x16xi64, #blocked> 2026-02-21T08:49:39.2583618Z %88 = arith.addi %87, %35 : tensor<64x16xi64, #blocked> 2026-02-21T08:49:39.2583812Z %89 = tt.addptr %26, %88 : tensor<64x16x!tt.ptr, #blocked>, tensor<64x16xi64, #blocked> 2026-02-21T08:49:39.2584019Z %90 = arith.cmpi sge, %84, %cst_3 : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:39.2584193Z %91 = arith.cmpi slt, %84, %cst_2 : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:39.2584352Z %92 = arith.andi %90, %91 : tensor<64x1xi1, #blocked1> 2026-02-21T08:49:39.2584534Z %93 = tt.broadcast %92 : tensor<64x1xi1, #blocked1> -> tensor<64x16xi1, #blocked1> 2026-02-21T08:49:39.2584759Z %94 = ttg.convert_layout %93 : tensor<64x16xi1, #blocked1> -> tensor<64x16xi1, #blocked> 2026-02-21T08:49:39.2584954Z %95 = arith.andi %94, %39 : tensor<64x16xi1, #blocked> 2026-02-21T08:49:39.2585117Z %96 = tt.load %89, %95, %cst : tensor<64x16x!tt.ptr, #blocked> 2026-02-21T08:49:39.2585317Z %97 = arith.shli %96, %cst_9 : tensor<64x16xi8, #blocked> 2026-02-21T08:49:39.2585477Z %98 = arith.shrsi %97, %cst_9 : tensor<64x16xi8, #blocked> 2026-02-21T08:49:39.2585634Z %99 = arith.shrsi %96, %cst_9 : tensor<64x16xi8, #blocked> 2026-02-21T08:49:39.2585879Z %100 = ttg.convert_layout %98 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:49:39.2586224Z %101 = tt.expand_dims %100 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10> 2026-02-21T08:49:39.2586541Z %102 = ttg.convert_layout %101 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11> 2026-02-21T08:49:39.2586838Z %103 = ttg.convert_layout %99 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:49:39.2587181Z %104 = tt.expand_dims %103 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10> 2026-02-21T08:49:39.2587491Z %105 = ttg.convert_layout %104 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11> 2026-02-21T08:49:39.2587780Z %106 = tt.broadcast %102 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11> 2026-02-21T08:49:39.2588024Z %107 = ttg.convert_layout %106 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:39.2588284Z %108 = arith.select %49, %107, %cst_5 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:39.2588530Z %109 = tt.broadcast %105 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11> 2026-02-21T08:49:39.2588779Z %110 = ttg.convert_layout %109 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:39.2589032Z %111 = arith.select %52, %110, %108 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:39.2589269Z %112 = tt.reshape %111 : tensor<64x2x16xi8, #blocked2> -> tensor<128x16xi8, #blocked> 2026-02-21T08:49:39.2589500Z %113 = arith.sitofp %112 : tensor<128x16xi8, #blocked> to tensor<128x16xf32, #blocked> 2026-02-21T08:49:39.2589790Z %114 = ttg.convert_layout %78 : tensor<32x128xf32, #blocked6> -> tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T08:49:39.2590137Z %115 = ttg.convert_layout %113 : tensor<128x16xf32, #blocked> -> tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T08:49:39.2590432Z %116 = ttg.convert_layout %arg4 : tensor<32x16xf32, #blocked> -> tensor<32x16xf32, #blocked> 2026-02-21T08:49:39.2590834Z %117 = tt.dot %114, %115, %116, inputPrecision = tf32 : tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<32x16xf32, #blocked> 2026-02-21T08:49:39.2591169Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:49:39.2591296Z %118 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T08:49:39.2591420Z %119 = arith.addi %arg3, %118 : i32 2026-02-21T08:49:39.2591540Z %120 = arith.muli %119, %c2_i32 : i32 2026-02-21T08:49:39.2591680Z %121 = tt.splat %120 : i32 -> tensor<128xi32, #blocked4> 2026-02-21T08:49:39.2591836Z %122 = arith.addi %121, %17 : tensor<128xi32, #blocked4> 2026-02-21T08:49:39.2592076Z %123 = ttg.convert_layout %122 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:49:39.2592409Z %124 = tt.expand_dims %123 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7> 2026-02-21T08:49:39.2592703Z %125 = ttg.convert_layout %124 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked6> 2026-02-21T08:49:39.2592940Z %126 = tt.broadcast %125 : tensor<1x128xi32, #blocked6> -> tensor<32x128xi32, #blocked6> 2026-02-21T08:49:39.2593138Z %127 = arith.addi %23, %126 : tensor<32x128xi32, #blocked6> 2026-02-21T08:49:39.2593379Z %128 = tt.addptr %24, %127 : tensor<32x128x!tt.ptr, #blocked6>, tensor<32x128xi32, #blocked6> 2026-02-21T08:49:39.2593588Z %129 = tt.load %128 : tensor<32x128x!tt.ptr, #blocked6> 2026-02-21T08:49:39.2593785Z %130 = arith.extf %129 : tensor<32x128xbf16, #blocked6> to tensor<32x128xf32, #blocked6> 2026-02-21T08:49:39.2593961Z %131 = arith.extsi %119 : i32 to i64 2026-02-21T08:49:39.2594098Z %132 = tt.splat %131 : i64 -> tensor<64xi64, #blocked4> 2026-02-21T08:49:39.2594250Z %133 = arith.addi %132, %28 : tensor<64xi64, #blocked4> 2026-02-21T08:49:39.2594481Z %134 = ttg.convert_layout %133 : tensor<64xi64, #blocked4> -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T08:49:39.2594809Z %135 = tt.expand_dims %134 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi64, #blocked5> 2026-02-21T08:49:39.2595096Z %136 = ttg.convert_layout %135 : tensor<64x1xi64, #blocked5> -> tensor<64x1xi64, #blocked1> 2026-02-21T08:49:39.2595301Z %137 = arith.muli %136, %cst_4 : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:39.2595522Z %138 = tt.broadcast %137 : tensor<64x1xi64, #blocked1> -> tensor<64x16xi64, #blocked1> 2026-02-21T08:49:39.2595756Z %139 = ttg.convert_layout %138 : tensor<64x16xi64, #blocked1> -> tensor<64x16xi64, #blocked> 2026-02-21T08:49:39.2595956Z %140 = arith.addi %139, %35 : tensor<64x16xi64, #blocked> 2026-02-21T08:49:39.2596151Z %141 = tt.addptr %26, %140 : tensor<64x16x!tt.ptr, #blocked>, tensor<64x16xi64, #blocked> 2026-02-21T08:49:39.2596359Z %142 = arith.cmpi sge, %136, %cst_3 : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:39.2596531Z %143 = arith.cmpi slt, %136, %cst_2 : tensor<64x1xi64, #blocked1> 2026-02-21T08:49:39.2596696Z %144 = arith.andi %142, %143 : tensor<64x1xi1, #blocked1> 2026-02-21T08:49:39.2596884Z %145 = tt.broadcast %144 : tensor<64x1xi1, #blocked1> -> tensor<64x16xi1, #blocked1> 2026-02-21T08:49:39.2597115Z %146 = ttg.convert_layout %145 : tensor<64x16xi1, #blocked1> -> tensor<64x16xi1, #blocked> 2026-02-21T08:49:39.2597312Z %147 = arith.andi %146, %39 : tensor<64x16xi1, #blocked> 2026-02-21T08:49:39.2597474Z %148 = tt.load %141, %147, %cst : tensor<64x16x!tt.ptr, #blocked> 2026-02-21T08:49:39.2597646Z %149 = arith.shli %148, %cst_9 : tensor<64x16xi8, #blocked> 2026-02-21T08:49:39.2597807Z %150 = arith.shrsi %149, %cst_9 : tensor<64x16xi8, #blocked> 2026-02-21T08:49:39.2597966Z %151 = arith.shrsi %148, %cst_9 : tensor<64x16xi8, #blocked> 2026-02-21T08:49:39.2598212Z %152 = ttg.convert_layout %150 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:49:39.2598552Z %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10> 2026-02-21T08:49:39.2598861Z %154 = ttg.convert_layout %153 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11> 2026-02-21T08:49:39.2599154Z %155 = ttg.convert_layout %151 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:49:39.2599492Z %156 = tt.expand_dims %155 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10> 2026-02-21T08:49:39.2599796Z %157 = ttg.convert_layout %156 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11> 2026-02-21T08:49:39.2600045Z %158 = tt.broadcast %154 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11> 2026-02-21T08:49:39.2600294Z %159 = ttg.convert_layout %158 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:39.2600548Z %160 = arith.select %49, %159, %cst_5 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:39.2600792Z %161 = tt.broadcast %157 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11> 2026-02-21T08:49:39.2601066Z %162 = ttg.convert_layout %161 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:39.2601315Z %163 = arith.select %52, %162, %160 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2> 2026-02-21T08:49:39.2601554Z %164 = tt.reshape %163 : tensor<64x2x16xi8, #blocked2> -> tensor<128x16xi8, #blocked> 2026-02-21T08:49:39.2601781Z %165 = arith.sitofp %164 : tensor<128x16xi8, #blocked> to tensor<128x16xf32, #blocked> 2026-02-21T08:49:39.2602065Z %166 = ttg.convert_layout %130 : tensor<32x128xf32, #blocked6> -> tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T08:49:39.2602406Z %167 = ttg.convert_layout %165 : tensor<128x16xf32, #blocked> -> tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T08:49:39.2602733Z %168 = ttg.convert_layout %117 : tensor<32x16xf32, #blocked> -> tensor<32x16xf32, #blocked> 2026-02-21T08:49:39.2603131Z %169 = tt.dot %166, %167, %168, inputPrecision = tf32 : tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<32x16xf32, #blocked> 2026-02-21T08:49:39.2603515Z scf.yield %169 : tensor<32x16xf32, #blocked> 2026-02-21T08:49:39.2603637Z } {tt.flatten} 2026-02-21T08:49:39.2603783Z %54 = arith.truncf %53 : tensor<32x16xf32, #blocked> to tensor<32x16xbf16, #blocked> 2026-02-21T08:49:39.2604056Z %55 = ttg.convert_layout %12 : tensor<32xi32, #blocked4> -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T08:49:39.2604381Z %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<32x1xi32, #blocked5> 2026-02-21T08:49:39.2604664Z %57 = ttg.convert_layout %56 : tensor<32x1xi32, #blocked5> -> tensor<32x1xi32, #blocked1> 2026-02-21T08:49:39.2604861Z %58 = arith.muli %57, %cst_6 : tensor<32x1xi32, #blocked1> 2026-02-21T08:49:39.2605096Z %59 = ttg.convert_layout %16 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:49:39.2605410Z %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x16xi32, #blocked7> 2026-02-21T08:49:39.2605692Z %61 = ttg.convert_layout %60 : tensor<1x16xi32, #blocked7> -> tensor<1x16xi32, #blocked> 2026-02-21T08:49:39.2605916Z %62 = tt.broadcast %58 : tensor<32x1xi32, #blocked1> -> tensor<32x16xi32, #blocked1> 2026-02-21T08:49:39.2606140Z %63 = ttg.convert_layout %62 : tensor<32x16xi32, #blocked1> -> tensor<32x16xi32, #blocked> 2026-02-21T08:49:39.2606363Z %64 = tt.broadcast %61 : tensor<1x16xi32, #blocked> -> tensor<32x16xi32, #blocked> 2026-02-21T08:49:39.2606546Z %65 = arith.addi %63, %64 : tensor<32x16xi32, #blocked> 2026-02-21T08:49:39.2606721Z %66 = tt.splat %arg2 : !tt.ptr -> tensor<32x16x!tt.ptr, #blocked> 2026-02-21T08:49:39.2606941Z %67 = tt.addptr %66, %65 : tensor<32x16x!tt.ptr, #blocked>, tensor<32x16xi32, #blocked> 2026-02-21T08:49:39.2607137Z tt.store %67, %54 : tensor<32x16x!tt.ptr, #blocked> 2026-02-21T08:49:39.2607270Z tt.return 2026-02-21T08:49:39.2607349Z } 2026-02-21T08:49:39.2607423Z } 2026-02-21T08:49:39.2607465Z 2026-02-21T08:49:39.2607502Z {-# 2026-02-21T08:49:39.2607581Z external_resources: { 2026-02-21T08:49:39.2607681Z mlir_reproducer: { 2026-02-21T08:49:39.2609979Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=16}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T08:49:39.2612253Z disable_threading: false, 2026-02-21T08:49:39.2612358Z verify_each: true 2026-02-21T08:49:39.2612449Z } 2026-02-21T08:49:39.2612522Z } 2026-02-21T08:49:39.2612592Z #-} 2026-02-21T08:49:39.2612871Z /tmp/torchinductor_root/tw/ctwf45jgblprssxbx6digyhxum3ybqilmizfwxub35zw4vh7cfja.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:49:39.2613639Z /tmp/torchinductor_root/tw/ctwf45jgblprssxbx6digyhxum3ybqilmizfwxub35zw4vh7cfja.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:49:39.2614201Z [217s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:49:39.2614921Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 32, 16], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T08:49:39.2615571Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:49:39.2615743Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:49:39.4436230Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 17.3 configs/s 2026-02-21T08:49:47.9059375Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 114.2 2026-02-21T08:49:47.9059824Z configs/s 2026-02-21T08:49:48.6047478Z [226s] Generation 5 complete: 2026-02-21T08:49:48.6047630Z error=5 2026-02-21T08:49:48.6047710Z ok=81 2026-02-21T08:49:48.6047786Z min=0.1098 2026-02-21T08:49:48.6047868Z mid=0.1358 2026-02-21T08:49:48.6047942Z max=6.3263 2026-02-21T08:49:48.6048028Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:49:48.6048168Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T08:49:48.6048305Z 'l2_groupings': [64], 2026-02-21T08:49:48.6048777Z 'load_eviction_policies': ['', ''], 2026-02-21T08:49:48.6048892Z 'loop_orders': [[0, 1]], 2026-02-21T08:49:48.6048999Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:49:48.6049124Z 'num_stages': 4, 2026-02-21T08:49:48.6049209Z 'num_warps': 2, 2026-02-21T08:49:48.6049296Z 'pid_type': 'flat', 2026-02-21T08:49:48.6049397Z 'range_flattens': [None, True], 2026-02-21T08:49:48.6049509Z 'range_multi_buffers': [None, True], 2026-02-21T08:49:48.6049623Z 'range_num_stages': [0, 1], 2026-02-21T08:49:48.6049728Z 'range_unroll_factors': [0, 2], 2026-02-21T08:49:48.6049838Z 'range_warp_specializes': [], 2026-02-21T08:49:48.6049941Z 'waves_per_eu': 1} 2026-02-21T08:49:48.7048920Z [226s] Fitting surrogate: 592 points, 592 targets 2026-02-21T08:49:49.4145089Z [227s] Generation 6 starting: 69 neighbors, 4 active search path(s) 2026-02-21T08:50:10.0795198Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 0.5 configs/s 2026-02-21T08:50:14.1922001Z /tmp/torchinductor_root/6s/c6s6ohrgipzdk4omizjox23stwih4vsfrgk7j252il3i4a32r56e.py:48:83: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T08:50:14.1924648Z b_tile = tl.load(B + (indices_3[:, None] * 7168 + indices_2[None, :] * 1), None) 2026-02-21T08:50:14.1925306Z ^ 2026-02-21T08:50:14.1927240Z /tmp/torchinductor_root/6s/c6s6ohrgipzdk4omizjox23stwih4vsfrgk7j252il3i4a32r56e.py:61:41: note: - use: %89 = "ttg.convert_layout"(<>) : (tensor<64x16xi8, #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>>) -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>}>> 2026-02-21T08:50:14.1928975Z 2026-02-21T08:50:14.1929114Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T08:50:14.1929453Z ^ 2026-02-21T08:50:14.1931208Z /tmp/torchinductor_root/6s/c6s6ohrgipzdk4omizjox23stwih4vsfrgk7j252il3i4a32r56e.py:60:41: note: - use: %90 = "ttg.convert_layout"(<>) : (tensor<64x16xi8, #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>>) -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>}>> 2026-02-21T08:50:14.1933037Z 2026-02-21T08:50:14.1933086Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T08:50:14.1933223Z ^ 2026-02-21T08:50:14.1933364Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T08:50:14.1933655Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:50:14.1934017Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:50:14.1934371Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T08:50:14.1934711Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T08:50:14.1935028Z #blocked4 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T08:50:14.1935347Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T08:50:14.1935673Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T08:50:14.1935999Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T08:50:14.1936319Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T08:50:14.1936834Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [2, 1, 1], order = [0, 1, 2]}> 2026-02-21T08:50:14.1937192Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [2, 1, 1], order = [0, 1, 2]}> 2026-02-21T08:50:14.1937546Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:50:14.1937889Z #blocked12 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [8, 8], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T08:50:14.1938265Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:50:14.1938806Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:50:14.1939236Z %cst = arith.constant dense<0> : tensor<64x2x16xi8, #blocked> 2026-02-21T08:50:14.1939408Z %c28672_i32 = arith.constant 28672 : i32 2026-02-21T08:50:14.1939548Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:50:14.1939686Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:50:14.1939851Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T08:50:14.1940050Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T08:50:14.1940243Z %cst_2 = arith.constant dense<4> : tensor<64x16xi8, #blocked2> 2026-02-21T08:50:14.1940443Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32, #blocked3> 2026-02-21T08:50:14.1940634Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32, #blocked3> 2026-02-21T08:50:14.1940802Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:50:14.1940974Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x16xf32, #blocked2> 2026-02-21T08:50:14.1941154Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:50:14.1941286Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:50:14.1941417Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:50:14.1941543Z %0 = tt.get_program_id x : i32 2026-02-21T08:50:14.1941706Z %1 = arith.divsi %0, %c28672_i32 : i32 2026-02-21T08:50:14.1941839Z %2 = arith.muli %1, %c64_i32 : i32 2026-02-21T08:50:14.1941965Z %3 = arith.subi %c1_i32, %2 : i32 2026-02-21T08:50:14.1942090Z %4 = arith.minsi %3, %c64_i32 : i32 2026-02-21T08:50:14.1942217Z %5 = arith.remsi %0, %c28672_i32 : i32 2026-02-21T08:50:14.1942345Z %6 = arith.remsi %5, %4 : i32 2026-02-21T08:50:14.1942465Z %7 = arith.addi %2, %6 : i32 2026-02-21T08:50:14.1942579Z %8 = arith.divsi %5, %4 : i32 2026-02-21T08:50:14.1942699Z %9 = arith.muli %7, %c64_i32 : i32 2026-02-21T08:50:14.1942853Z %10 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked4> 2026-02-21T08:50:14.1943043Z %11 = tt.splat %9 : i32 -> tensor<64xi32, #blocked4> 2026-02-21T08:50:14.1943194Z %12 = arith.addi %11, %10 : tensor<64xi32, #blocked4> 2026-02-21T08:50:14.1943325Z %13 = arith.muli %8, %c16_i32 : i32 2026-02-21T08:50:14.1943480Z %14 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked4> 2026-02-21T08:50:14.1943659Z %15 = tt.splat %13 : i32 -> tensor<16xi32, #blocked4> 2026-02-21T08:50:14.1943803Z %16 = arith.addi %15, %14 : tensor<16xi32, #blocked4> 2026-02-21T08:50:14.1943973Z %17 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked4> 2026-02-21T08:50:14.1944239Z %18 = ttg.convert_layout %12 : tensor<64xi32, #blocked4> -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T08:50:14.1944564Z %19 = tt.expand_dims %18 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi32, #blocked5> 2026-02-21T08:50:14.1944847Z %20 = ttg.convert_layout %19 : tensor<64x1xi32, #blocked5> -> tensor<64x1xi32, #blocked3> 2026-02-21T08:50:14.1945049Z %21 = arith.muli %20, %cst_4 : tensor<64x1xi32, #blocked3> 2026-02-21T08:50:14.1945290Z %22 = tt.broadcast %21 : tensor<64x1xi32, #blocked3> -> tensor<64x128xi32, #blocked3> 2026-02-21T08:50:14.1945530Z %23 = ttg.convert_layout %22 : tensor<64x128xi32, #blocked3> -> tensor<64x128xi32, #blocked6> 2026-02-21T08:50:14.1945758Z %24 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr, #blocked6> 2026-02-21T08:50:14.1946012Z %25 = ttg.convert_layout %16 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:50:14.1946328Z %26 = tt.expand_dims %25 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x16xi32, #blocked7> 2026-02-21T08:50:14.1946607Z %27 = ttg.convert_layout %26 : tensor<1x16xi32, #blocked7> -> tensor<1x16xi32, #blocked2> 2026-02-21T08:50:14.1946834Z %28 = tt.broadcast %27 : tensor<1x16xi32, #blocked2> -> tensor<64x16xi32, #blocked2> 2026-02-21T08:50:14.1947042Z %29 = tt.splat %arg1 : !tt.ptr -> tensor<64x16x!tt.ptr, #blocked2> 2026-02-21T08:50:14.1947240Z %30 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked4> 2026-02-21T08:50:14.1947493Z %31 = ttg.convert_layout %30 : tensor<2xi32, #blocked4> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:50:14.1947804Z %32 = tt.expand_dims %31 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x2xi32, #blocked7> 2026-02-21T08:50:14.1948080Z %33 = ttg.convert_layout %32 : tensor<1x2xi32, #blocked7> -> tensor<1x2xi32, #blocked8> 2026-02-21T08:50:14.1948356Z %34 = ttg.convert_layout %33 : tensor<1x2xi32, #blocked8> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T08:50:14.1948678Z %35 = tt.expand_dims %34 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x2x1xi32, #blocked9> 2026-02-21T08:50:14.1948969Z %36 = ttg.convert_layout %35 : tensor<1x2x1xi32, #blocked9> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T08:50:14.1949176Z %37 = arith.cmpi eq, %36, %cst_1 : tensor<1x2x1xi32, #blocked1> 2026-02-21T08:50:14.1949373Z %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked1> -> tensor<64x2x16xi1, #blocked1> 2026-02-21T08:50:14.1949640Z %39 = ttg.convert_layout %38 : tensor<64x2x16xi1, #blocked1> -> tensor<64x2x16xi1, #blocked> 2026-02-21T08:50:14.1949843Z %40 = arith.cmpi eq, %36, %cst_0 : tensor<1x2x1xi32, #blocked1> 2026-02-21T08:50:14.1950037Z %41 = tt.broadcast %40 : tensor<1x2x1xi1, #blocked1> -> tensor<64x2x16xi1, #blocked1> 2026-02-21T08:50:14.1950265Z %42 = ttg.convert_layout %41 : tensor<64x2x16xi1, #blocked1> -> tensor<64x2x16xi1, #blocked> 2026-02-21T08:50:14.1950455Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:50:14.1950678Z %43 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x16xf32, #blocked2>) : i32 { 2026-02-21T08:50:14.1950919Z %58 = tt.splat %arg3 : i32 -> tensor<64xi32, #blocked4> 2026-02-21T08:50:14.1951076Z %59 = arith.addi %58, %10 : tensor<64xi32, #blocked4> 2026-02-21T08:50:14.1951214Z %60 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T08:50:14.1951353Z %61 = tt.splat %60 : i32 -> tensor<128xi32, #blocked4> 2026-02-21T08:50:14.1951499Z %62 = arith.addi %61, %17 : tensor<128xi32, #blocked4> 2026-02-21T08:50:14.1951732Z %63 = ttg.convert_layout %62 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:50:14.1952057Z %64 = tt.expand_dims %63 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7> 2026-02-21T08:50:14.1952341Z %65 = ttg.convert_layout %64 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked6> 2026-02-21T08:50:14.1952573Z %66 = tt.broadcast %65 : tensor<1x128xi32, #blocked6> -> tensor<64x128xi32, #blocked6> 2026-02-21T08:50:14.1952762Z %67 = arith.addi %23, %66 : tensor<64x128xi32, #blocked6> 2026-02-21T08:50:14.1952999Z %68 = tt.addptr %24, %67 : tensor<64x128x!tt.ptr, #blocked6>, tensor<64x128xi32, #blocked6> 2026-02-21T08:50:14.1953205Z %69 = tt.load %68 : tensor<64x128x!tt.ptr, #blocked6> 2026-02-21T08:50:14.1953396Z %70 = arith.extf %69 : tensor<64x128xbf16, #blocked6> to tensor<64x128xf32, #blocked6> 2026-02-21T08:50:14.1953665Z %71 = ttg.convert_layout %59 : tensor<64xi32, #blocked4> -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T08:50:14.1953982Z %72 = tt.expand_dims %71 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi32, #blocked5> 2026-02-21T08:50:14.1954263Z %73 = ttg.convert_layout %72 : tensor<64x1xi32, #blocked5> -> tensor<64x1xi32, #blocked3> 2026-02-21T08:50:14.1954462Z %74 = arith.muli %73, %cst_3 : tensor<64x1xi32, #blocked3> 2026-02-21T08:50:14.1954646Z %75 = tt.broadcast %74 : tensor<64x1xi32, #blocked3> -> tensor<64x16xi32, #blocked3> 2026-02-21T08:50:14.1954881Z %76 = ttg.convert_layout %75 : tensor<64x16xi32, #blocked3> -> tensor<64x16xi32, #blocked2> 2026-02-21T08:50:14.1955079Z %77 = arith.addi %76, %28 : tensor<64x16xi32, #blocked2> 2026-02-21T08:50:14.1955275Z %78 = tt.addptr %29, %77 : tensor<64x16x!tt.ptr, #blocked2>, tensor<64x16xi32, #blocked2> 2026-02-21T08:50:14.1955473Z %79 = tt.load %78 : tensor<64x16x!tt.ptr, #blocked2> 2026-02-21T08:50:14.1955627Z %80 = arith.shli %79, %cst_2 : tensor<64x16xi8, #blocked2> 2026-02-21T08:50:14.1955787Z %81 = arith.shrsi %80, %cst_2 : tensor<64x16xi8, #blocked2> 2026-02-21T08:50:14.1955944Z %82 = arith.shrsi %79, %cst_2 : tensor<64x16xi8, #blocked2> 2026-02-21T08:50:14.1956186Z %83 = ttg.convert_layout %81 : tensor<64x16xi8, #blocked2> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:50:14.1956524Z %84 = tt.expand_dims %83 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10> 2026-02-21T08:50:14.1956825Z %85 = ttg.convert_layout %84 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11> 2026-02-21T08:50:14.1957114Z %86 = ttg.convert_layout %82 : tensor<64x16xi8, #blocked2> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:50:14.1966337Z %87 = tt.expand_dims %86 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10> 2026-02-21T08:50:14.1966638Z %88 = ttg.convert_layout %87 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11> 2026-02-21T08:50:14.1966883Z %89 = tt.broadcast %85 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11> 2026-02-21T08:50:14.1967121Z %90 = ttg.convert_layout %89 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked> 2026-02-21T08:50:14.1967362Z %91 = arith.select %39, %90, %cst : tensor<64x2x16xi1, #blocked>, tensor<64x2x16xi8, #blocked> 2026-02-21T08:50:14.1967600Z %92 = tt.broadcast %88 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11> 2026-02-21T08:50:14.1967835Z %93 = ttg.convert_layout %92 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked> 2026-02-21T08:50:14.1968075Z %94 = arith.select %42, %93, %91 : tensor<64x2x16xi1, #blocked>, tensor<64x2x16xi8, #blocked> 2026-02-21T08:50:14.1968298Z %95 = tt.reshape %94 : tensor<64x2x16xi8, #blocked> -> tensor<128x16xi8, #blocked2> 2026-02-21T08:50:14.1968523Z %96 = arith.sitofp %95 : tensor<128x16xi8, #blocked2> to tensor<128x16xf32, #blocked2> 2026-02-21T08:50:14.1968809Z %97 = ttg.convert_layout %70 : tensor<64x128xf32, #blocked6> -> tensor<64x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>> 2026-02-21T08:50:14.1969156Z %98 = ttg.convert_layout %96 : tensor<128x16xf32, #blocked2> -> tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>> 2026-02-21T08:50:14.1969458Z %99 = ttg.convert_layout %arg4 : tensor<64x16xf32, #blocked2> -> tensor<64x16xf32, #blocked12> 2026-02-21T08:50:14.1969935Z %100 = tt.dot %97, %98, %99, inputPrecision = tf32 : tensor<64x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>> * tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>> -> tensor<64x16xf32, #blocked12> 2026-02-21T08:50:14.1970344Z %101 = ttg.convert_layout %100 : tensor<64x16xf32, #blocked12> -> tensor<64x16xf32, #blocked2> 2026-02-21T08:50:14.1970557Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T08:50:14.1970683Z %102 = arith.muli %c64_i32, %c1_i32_6 : i32 2026-02-21T08:50:14.1970807Z %103 = arith.addi %arg3, %102 : i32 2026-02-21T08:50:14.1970946Z %104 = tt.splat %103 : i32 -> tensor<64xi32, #blocked4> 2026-02-21T08:50:14.1971097Z %105 = arith.addi %104, %10 : tensor<64xi32, #blocked4> 2026-02-21T08:50:14.1971232Z %106 = arith.muli %103, %c2_i32 : i32 2026-02-21T08:50:14.1971365Z %107 = tt.splat %106 : i32 -> tensor<128xi32, #blocked4> 2026-02-21T08:50:14.1971520Z %108 = arith.addi %107, %17 : tensor<128xi32, #blocked4> 2026-02-21T08:50:14.1971763Z %109 = ttg.convert_layout %108 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:50:14.1972094Z %110 = tt.expand_dims %109 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7> 2026-02-21T08:50:14.1972389Z %111 = ttg.convert_layout %110 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked6> 2026-02-21T08:50:14.1972626Z %112 = tt.broadcast %111 : tensor<1x128xi32, #blocked6> -> tensor<64x128xi32, #blocked6> 2026-02-21T08:50:14.1972824Z %113 = arith.addi %23, %112 : tensor<64x128xi32, #blocked6> 2026-02-21T08:50:14.1973030Z %114 = tt.addptr %24, %113 : tensor<64x128x!tt.ptr, #blocked6>, tensor<64x128xi32, #blocked6> 2026-02-21T08:50:14.1973241Z %115 = tt.load %114 : tensor<64x128x!tt.ptr, #blocked6> 2026-02-21T08:50:14.1973437Z %116 = arith.extf %115 : tensor<64x128xbf16, #blocked6> to tensor<64x128xf32, #blocked6> 2026-02-21T08:50:14.1973711Z %117 = ttg.convert_layout %105 : tensor<64xi32, #blocked4> -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T08:50:14.1974077Z %118 = tt.expand_dims %117 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi32, #blocked5> 2026-02-21T08:50:14.1974366Z %119 = ttg.convert_layout %118 : tensor<64x1xi32, #blocked5> -> tensor<64x1xi32, #blocked3> 2026-02-21T08:50:14.1974570Z %120 = arith.muli %119, %cst_3 : tensor<64x1xi32, #blocked3> 2026-02-21T08:50:14.1974762Z %121 = tt.broadcast %120 : tensor<64x1xi32, #blocked3> -> tensor<64x16xi32, #blocked3> 2026-02-21T08:50:14.1974993Z %122 = ttg.convert_layout %121 : tensor<64x16xi32, #blocked3> -> tensor<64x16xi32, #blocked2> 2026-02-21T08:50:14.1975197Z %123 = arith.addi %122, %28 : tensor<64x16xi32, #blocked2> 2026-02-21T08:50:14.1975393Z %124 = tt.addptr %29, %123 : tensor<64x16x!tt.ptr, #blocked2>, tensor<64x16xi32, #blocked2> 2026-02-21T08:50:14.1975595Z %125 = tt.load %124 : tensor<64x16x!tt.ptr, #blocked2> 2026-02-21T08:50:14.1975754Z %126 = arith.shli %125, %cst_2 : tensor<64x16xi8, #blocked2> 2026-02-21T08:50:14.1975919Z %127 = arith.shrsi %126, %cst_2 : tensor<64x16xi8, #blocked2> 2026-02-21T08:50:14.1976080Z %128 = arith.shrsi %125, %cst_2 : tensor<64x16xi8, #blocked2> 2026-02-21T08:50:14.1976324Z %129 = ttg.convert_layout %127 : tensor<64x16xi8, #blocked2> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:50:14.1976668Z %130 = tt.expand_dims %129 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10> 2026-02-21T08:50:14.1976976Z %131 = ttg.convert_layout %130 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11> 2026-02-21T08:50:14.1977268Z %132 = ttg.convert_layout %128 : tensor<64x16xi8, #blocked2> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T08:50:14.1977643Z %133 = tt.expand_dims %132 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10> 2026-02-21T08:50:14.1977952Z %134 = ttg.convert_layout %133 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11> 2026-02-21T08:50:14.1978196Z %135 = tt.broadcast %131 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11> 2026-02-21T08:50:14.1978445Z %136 = ttg.convert_layout %135 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked> 2026-02-21T08:50:14.1978689Z %137 = arith.select %39, %136, %cst : tensor<64x2x16xi1, #blocked>, tensor<64x2x16xi8, #blocked> 2026-02-21T08:50:14.1978930Z %138 = tt.broadcast %134 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11> 2026-02-21T08:50:14.1979172Z %139 = ttg.convert_layout %138 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked> 2026-02-21T08:50:14.1979416Z %140 = arith.select %42, %139, %137 : tensor<64x2x16xi1, #blocked>, tensor<64x2x16xi8, #blocked> 2026-02-21T08:50:14.1979653Z %141 = tt.reshape %140 : tensor<64x2x16xi8, #blocked> -> tensor<128x16xi8, #blocked2> 2026-02-21T08:50:14.1979884Z %142 = arith.sitofp %141 : tensor<128x16xi8, #blocked2> to tensor<128x16xf32, #blocked2> 2026-02-21T08:50:14.1980178Z %143 = ttg.convert_layout %116 : tensor<64x128xf32, #blocked6> -> tensor<64x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>> 2026-02-21T08:50:14.1980527Z %144 = ttg.convert_layout %142 : tensor<128x16xf32, #blocked2> -> tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>> 2026-02-21T08:50:14.1980823Z %145 = ttg.convert_layout %101 : tensor<64x16xf32, #blocked2> -> tensor<64x16xf32, #blocked12> 2026-02-21T08:50:14.1981234Z %146 = tt.dot %143, %144, %145, inputPrecision = tf32 : tensor<64x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>> * tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>> -> tensor<64x16xf32, #blocked12> 2026-02-21T08:50:14.1981636Z %147 = ttg.convert_layout %146 : tensor<64x16xf32, #blocked12> -> tensor<64x16xf32, #blocked2> 2026-02-21T08:50:14.1981866Z scf.yield %147 : tensor<64x16xf32, #blocked2> 2026-02-21T08:50:14.1981993Z } {tt.flatten} 2026-02-21T08:50:14.1982137Z %44 = arith.truncf %43 : tensor<64x16xf32, #blocked2> to tensor<64x16xbf16, #blocked2> 2026-02-21T08:50:14.1982408Z %45 = ttg.convert_layout %12 : tensor<64xi32, #blocked4> -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T08:50:14.1982726Z %46 = tt.expand_dims %45 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi32, #blocked5> 2026-02-21T08:50:14.1983011Z %47 = ttg.convert_layout %46 : tensor<64x1xi32, #blocked5> -> tensor<64x1xi32, #blocked3> 2026-02-21T08:50:14.1983209Z %48 = arith.muli %47, %cst_3 : tensor<64x1xi32, #blocked3> 2026-02-21T08:50:14.1983440Z %49 = ttg.convert_layout %16 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T08:50:14.1983754Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x16xi32, #blocked7> 2026-02-21T08:50:14.1984034Z %51 = ttg.convert_layout %50 : tensor<1x16xi32, #blocked7> -> tensor<1x16xi32, #blocked2> 2026-02-21T08:50:14.1984262Z %52 = tt.broadcast %48 : tensor<64x1xi32, #blocked3> -> tensor<64x16xi32, #blocked3> 2026-02-21T08:50:14.1984494Z %53 = ttg.convert_layout %52 : tensor<64x16xi32, #blocked3> -> tensor<64x16xi32, #blocked2> 2026-02-21T08:50:14.1984719Z %54 = tt.broadcast %51 : tensor<1x16xi32, #blocked2> -> tensor<64x16xi32, #blocked2> 2026-02-21T08:50:14.1984908Z %55 = arith.addi %53, %54 : tensor<64x16xi32, #blocked2> 2026-02-21T08:50:14.1985083Z %56 = tt.splat %arg2 : !tt.ptr -> tensor<64x16x!tt.ptr, #blocked2> 2026-02-21T08:50:14.1985304Z %57 = tt.addptr %56, %55 : tensor<64x16x!tt.ptr, #blocked2>, tensor<64x16xi32, #blocked2> 2026-02-21T08:50:14.1985537Z tt.store %57, %44 : tensor<64x16x!tt.ptr, #blocked2> 2026-02-21T08:50:14.1985670Z tt.return 2026-02-21T08:50:14.1985753Z } 2026-02-21T08:50:14.1985826Z } 2026-02-21T08:50:14.1985868Z 2026-02-21T08:50:14.1985901Z {-# 2026-02-21T08:50:14.1985979Z external_resources: { 2026-02-21T08:50:14.1986079Z mlir_reproducer: { 2026-02-21T08:50:14.1988302Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=16}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T08:50:14.1990604Z disable_threading: false, 2026-02-21T08:50:14.1990715Z verify_each: true 2026-02-21T08:50:14.1990803Z } 2026-02-21T08:50:14.1990877Z } 2026-02-21T08:50:14.1990944Z #-} 2026-02-21T08:50:14.1991223Z /tmp/torchinductor_root/6s/c6s6ohrgipzdk4omizjox23stwih4vsfrgk7j252il3i4a32r56e.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:50:14.1991985Z /tmp/torchinductor_root/6s/c6s6ohrgipzdk4omizjox23stwih4vsfrgk7j252il3i4a32r56e.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:50:14.1992536Z [252s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:50:14.1993255Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T08:50:14.1993906Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:50:14.1994072Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:50:14.4577602Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 16.3 configs/s 2026-02-21T08:50:22.8339509Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 126.4 2026-02-21T08:50:22.8340020Z configs/s 2026-02-21T08:50:23.4685467Z [261s] Generation 6 complete: 2026-02-21T08:50:23.4685861Z error=1 2026-02-21T08:50:23.4686068Z ok=72 2026-02-21T08:50:23.4686277Z min=0.1096 2026-02-21T08:50:23.4686480Z mid=0.1345 2026-02-21T08:50:23.4686679Z max=3.5852 2026-02-21T08:50:23.4686905Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:50:23.4687282Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T08:50:23.4687644Z 'l2_groupings': [64], 2026-02-21T08:50:23.4687920Z 'load_eviction_policies': ['', ''], 2026-02-21T08:50:23.4688802Z 'loop_orders': [[0, 1]], 2026-02-21T08:50:23.4689084Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:50:23.4689353Z 'num_stages': 4, 2026-02-21T08:50:23.4689607Z 'num_warps': 2, 2026-02-21T08:50:23.4689841Z 'pid_type': 'flat', 2026-02-21T08:50:23.4690099Z 'range_flattens': [None, None], 2026-02-21T08:50:23.4690402Z 'range_multi_buffers': [None, True], 2026-02-21T08:50:23.4690703Z 'range_num_stages': [0, 1], 2026-02-21T08:50:23.4690981Z 'range_unroll_factors': [0, 2], 2026-02-21T08:50:23.4691270Z 'range_warp_specializes': [], 2026-02-21T08:50:23.4691547Z 'waves_per_eu': 1} 2026-02-21T08:50:23.5585999Z [261s] Fitting surrogate: 665 points, 665 targets 2026-02-21T08:50:24.3158495Z [262s] Generation 7 starting: 74 neighbors, 4 active search path(s) 2026-02-21T08:50:37.0906402Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 6.4 configs/s 2026-02-21T08:50:41.8022291Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.2 configs/s 2026-02-21T08:50:51.1438298Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 104.3 2026-02-21T08:50:51.1438881Z configs/s 2026-02-21T08:50:51.9213975Z [290s] Generation 7 complete: 2026-02-21T08:50:51.9214206Z ok=78 2026-02-21T08:50:51.9214322Z min=0.1065 2026-02-21T08:50:51.9214445Z mid=0.1277 2026-02-21T08:50:51.9214552Z max=9.2420 2026-02-21T08:50:51.9214673Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:50:51.9214878Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T08:50:51.9215080Z 'l2_groupings': [8], 2026-02-21T08:50:51.9215229Z 'load_eviction_policies': ['', ''], 2026-02-21T08:50:51.9215393Z 'loop_orders': [[1, 0]], 2026-02-21T08:50:51.9215540Z 'matrix_instr_nonkdim': 0, 2026-02-21T08:50:51.9215689Z 'num_sm_multiplier': 16, 2026-02-21T08:50:51.9215832Z 'num_stages': 2, 2026-02-21T08:50:51.9215956Z 'num_warps': 2, 2026-02-21T08:50:51.9216102Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:50:51.9216292Z 'range_flattens': [None, True], 2026-02-21T08:50:51.9216454Z 'range_multi_buffers': [None, None], 2026-02-21T08:50:51.9216971Z 'range_num_stages': [0, 1], 2026-02-21T08:50:51.9217123Z 'range_unroll_factors': [1, 4], 2026-02-21T08:50:51.9217280Z 'range_warp_specializes': [], 2026-02-21T08:50:51.9217442Z 'waves_per_eu': 1} 2026-02-21T08:50:52.0558654Z [290s] Fitting surrogate: 743 points, 743 targets 2026-02-21T08:50:52.8536866Z [291s] Generation 8 starting: 74 neighbors, 4 active search path(s) 2026-02-21T08:51:06.4959831Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 5.7 configs/s 2026-02-21T08:51:11.0025832Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.8 configs/s 2026-02-21T08:51:21.5080026Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 102.0 2026-02-21T08:51:21.5080293Z configs/s 2026-02-21T08:51:22.3841243Z [320s] Generation 8 complete: 2026-02-21T08:51:22.3841681Z error=2 2026-02-21T08:51:22.3841886Z ok=77 2026-02-21T08:51:22.3842092Z min=0.1041 2026-02-21T08:51:22.3842297Z mid=0.1346 2026-02-21T08:51:22.3842523Z max=3.5718 2026-02-21T08:51:22.3842967Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:51:22.3843349Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T08:51:22.3843710Z 'l2_groupings': [8], 2026-02-21T08:51:22.3843985Z 'load_eviction_policies': ['', ''], 2026-02-21T08:51:22.3844291Z 'loop_orders': [[1, 0]], 2026-02-21T08:51:22.3844564Z 'matrix_instr_nonkdim': 0, 2026-02-21T08:51:22.3844841Z 'num_sm_multiplier': 16, 2026-02-21T08:51:22.3845100Z 'num_stages': 2, 2026-02-21T08:51:22.3845335Z 'num_warps': 2, 2026-02-21T08:51:22.3845569Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:51:22.3845838Z 'range_flattens': [None, None], 2026-02-21T08:51:22.3846086Z 'range_multi_buffers': [None, None], 2026-02-21T08:51:22.3846340Z 'range_num_stages': [0, 1], 2026-02-21T08:51:22.3846603Z 'range_unroll_factors': [1, 4], 2026-02-21T08:51:22.3847238Z 'range_warp_specializes': [], 2026-02-21T08:51:22.3847473Z 'waves_per_eu': 1} 2026-02-21T08:51:22.3926572Z [320s] Fitting surrogate: 822 points, 822 targets 2026-02-21T08:51:23.0313391Z [321s] Generation 9 starting: 57 neighbors, 3 active search path(s) 2026-02-21T08:51:39.1114758Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 0.3 configs/s 2026-02-21T08:51:42.7161422Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 16.5 configs/s 2026-02-21T08:51:48.7350061Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 158.1 2026-02-21T08:51:48.7350664Z configs/s 2026-02-21T08:51:49.3063776Z [347s] Generation 9 complete: 2026-02-21T08:51:49.3063920Z ok=60 2026-02-21T08:51:49.3063999Z min=0.1047 2026-02-21T08:51:49.3064085Z mid=0.1326 2026-02-21T08:51:49.3064163Z max=2.1349 2026-02-21T08:51:49.3064253Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:51:49.3064416Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T08:51:49.3064555Z 'l2_groupings': [8], 2026-02-21T08:51:49.3064680Z 'load_eviction_policies': ['', ''], 2026-02-21T08:51:49.3064799Z 'loop_orders': [[1, 0]], 2026-02-21T08:51:49.3064902Z 'matrix_instr_nonkdim': 0, 2026-02-21T08:51:49.3065008Z 'num_sm_multiplier': 16, 2026-02-21T08:51:49.3065120Z 'num_stages': 2, 2026-02-21T08:51:49.3065204Z 'num_warps': 2, 2026-02-21T08:51:49.3065302Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:51:49.3065422Z 'range_flattens': [None, None], 2026-02-21T08:51:49.3065535Z 'range_multi_buffers': [None, None], 2026-02-21T08:51:49.3065647Z 'range_num_stages': [0, 1], 2026-02-21T08:51:49.3065751Z 'range_unroll_factors': [1, 4], 2026-02-21T08:51:49.3065858Z 'range_warp_specializes': [], 2026-02-21T08:51:49.3065962Z 'waves_per_eu': 1} 2026-02-21T08:51:49.3894807Z [347s] Fitting surrogate: 882 points, 882 targets 2026-02-21T08:51:49.8227066Z [347s] Generation 10 starting: 35 neighbors, 2 active search path(s) 2026-02-21T08:51:56.6342520Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 7.7 configs/s 2026-02-21T08:51:58.8184743Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 35/35 16.3 configs/s 2026-02-21T08:52:03.3694296Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 205.6 2026-02-21T08:52:03.3694929Z configs/s 2026-02-21T08:52:03.8899804Z [362s] Generation 10 complete: 2026-02-21T08:52:03.8900140Z ok=38 2026-02-21T08:52:03.8900357Z min=0.1038 2026-02-21T08:52:03.8900570Z mid=0.1269 2026-02-21T08:52:03.8900770Z max=3.5875 2026-02-21T08:52:03.8901003Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:52:03.8901355Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T08:52:03.8901750Z 'l2_groupings': [8], 2026-02-21T08:52:03.8902025Z 'load_eviction_policies': ['', ''], 2026-02-21T08:52:03.8902337Z 'loop_orders': [[1, 0]], 2026-02-21T08:52:03.8902613Z 'matrix_instr_nonkdim': 0, 2026-02-21T08:52:03.8902929Z 'num_sm_multiplier': 16, 2026-02-21T08:52:03.8903202Z 'num_stages': 2, 2026-02-21T08:52:03.8903460Z 'num_warps': 2, 2026-02-21T08:52:03.8903723Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:52:03.8904055Z 'range_flattens': [None, None], 2026-02-21T08:52:03.8904362Z 'range_multi_buffers': [None, None], 2026-02-21T08:52:03.8904669Z 'range_num_stages': [0, 1], 2026-02-21T08:52:03.8904949Z 'range_unroll_factors': [1, 4], 2026-02-21T08:52:03.8905248Z 'range_warp_specializes': [], 2026-02-21T08:52:03.8905530Z 'waves_per_eu': 1} 2026-02-21T08:52:03.9513295Z [362s] Fitting surrogate: 920 points, 920 targets 2026-02-21T08:52:04.3783720Z [362s] Generation 11 starting: 30 neighbors, 2 active search path(s) 2026-02-21T08:52:11.9960995Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 2.5 configs/s 2026-02-21T08:52:13.9936956Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 30/30 16.2 configs/s 2026-02-21T08:52:17.4408167Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 265.3 2026-02-21T08:52:17.4408778Z configs/s 2026-02-21T08:52:17.9297161Z [376s] Generation 11 complete: 2026-02-21T08:52:17.9297508Z ok=33 2026-02-21T08:52:17.9297690Z min=0.1046 2026-02-21T08:52:17.9297872Z mid=0.1367 2026-02-21T08:52:17.9298043Z max=8.0136 2026-02-21T08:52:17.9298242Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:52:17.9298563Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T08:52:17.9298887Z 'l2_groupings': [8], 2026-02-21T08:52:17.9299115Z 'load_eviction_policies': ['', ''], 2026-02-21T08:52:17.9299406Z 'loop_orders': [[1, 0]], 2026-02-21T08:52:17.9299626Z 'matrix_instr_nonkdim': 0, 2026-02-21T08:52:17.9299849Z 'num_sm_multiplier': 16, 2026-02-21T08:52:17.9300053Z 'num_stages': 2, 2026-02-21T08:52:17.9300237Z 'num_warps': 2, 2026-02-21T08:52:17.9300446Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:52:17.9300752Z 'range_flattens': [None, None], 2026-02-21T08:52:17.9300996Z 'range_multi_buffers': [None, None], 2026-02-21T08:52:17.9301243Z 'range_num_stages': [0, 1], 2026-02-21T08:52:17.9301488Z 'range_unroll_factors': [1, 4], 2026-02-21T08:52:17.9301723Z 'range_warp_specializes': [], 2026-02-21T08:52:17.9301949Z 'waves_per_eu': 1} 2026-02-21T08:52:17.9786485Z [376s] Fitting surrogate: 953 points, 953 targets 2026-02-21T08:52:18.4570359Z [376s] Generation 12 starting: 36 neighbors, 2 active search path(s) 2026-02-21T08:52:25.5193328Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 7.0 configs/s 2026-02-21T08:52:27.8226037Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 36/36 16.7 configs/s 2026-02-21T08:52:31.4668187Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 252.6 2026-02-21T08:52:31.4668719Z configs/s 2026-02-21T08:52:32.0051473Z [390s] Generation 12 complete: 2026-02-21T08:52:32.0051756Z ok=39 2026-02-21T08:52:32.0051963Z min=0.1038 2026-02-21T08:52:32.0052143Z mid=0.1439 2026-02-21T08:52:32.0052302Z max=1.6405 2026-02-21T08:52:32.0052864Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:52:32.0053172Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T08:52:32.0053477Z 'l2_groupings': [8], 2026-02-21T08:52:32.0053701Z 'load_eviction_policies': ['', ''], 2026-02-21T08:52:32.0053959Z 'loop_orders': [[1, 0]], 2026-02-21T08:52:32.0054187Z 'matrix_instr_nonkdim': 0, 2026-02-21T08:52:32.0054416Z 'num_sm_multiplier': 16, 2026-02-21T08:52:32.0054632Z 'num_stages': 2, 2026-02-21T08:52:32.0054823Z 'num_warps': 2, 2026-02-21T08:52:32.0055045Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:52:32.0055310Z 'range_flattens': [None, None], 2026-02-21T08:52:32.0055562Z 'range_multi_buffers': [None, None], 2026-02-21T08:52:32.0055811Z 'range_num_stages': [0, 1], 2026-02-21T08:52:32.0056036Z 'range_unroll_factors': [1, 4], 2026-02-21T08:52:32.0056280Z 'range_warp_specializes': [], 2026-02-21T08:52:32.0056517Z 'waves_per_eu': 1} 2026-02-21T08:52:32.0566178Z [390s] Fitting surrogate: 992 points, 992 targets 2026-02-21T08:52:32.3493721Z [390s] Generation 13 starting: 18 neighbors, 1 active search path(s) 2026-02-21T08:52:36.5551460Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 3.8 configs/s 2026-02-21T08:52:37.7614226Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 16.9 configs/s 2026-02-21T08:52:39.5685196Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 470.4 2026-02-21T08:52:39.5685675Z configs/s 2026-02-21T08:52:40.0403076Z [398s] Generation 13 complete: 2026-02-21T08:52:40.0403357Z ok=20 2026-02-21T08:52:40.0403530Z min=0.1060 2026-02-21T08:52:40.0403694Z mid=0.1515 2026-02-21T08:52:40.0403853Z max=0.2129 2026-02-21T08:52:40.0404033Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:52:40.0404320Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T08:52:40.0404968Z 'l2_groupings': [32], 2026-02-21T08:52:40.0405190Z 'load_eviction_policies': ['', ''], 2026-02-21T08:52:40.0405455Z 'loop_orders': [[0, 1]], 2026-02-21T08:52:40.0405683Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:52:40.0405891Z 'num_stages': 3, 2026-02-21T08:52:40.0406072Z 'num_warps': 2, 2026-02-21T08:52:40.0406258Z 'pid_type': 'flat', 2026-02-21T08:52:40.0406461Z 'range_flattens': [None, True], 2026-02-21T08:52:40.0406702Z 'range_multi_buffers': [None, False], 2026-02-21T08:52:40.0406941Z 'range_num_stages': [0, 2], 2026-02-21T08:52:40.0407169Z 'range_unroll_factors': [0, 4], 2026-02-21T08:52:40.0407391Z 'range_warp_specializes': [], 2026-02-21T08:52:40.0407607Z 'waves_per_eu': 1} 2026-02-21T08:52:40.0651815Z [398s] Fitting surrogate: 1012 points, 1012 targets 2026-02-21T08:52:40.3712577Z [398s] Generation 14 starting: 19 neighbors, 1 active search path(s) 2026-02-21T08:52:44.1339685Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 6.1 configs/s 2026-02-21T08:52:45.3941534Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 17.0 configs/s 2026-02-21T08:52:46.4892523Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 719.3 2026-02-21T08:52:46.4893006Z configs/s 2026-02-21T08:52:46.9252015Z [405s] Generation 14 complete: 2026-02-21T08:52:46.9252288Z ok=20 2026-02-21T08:52:46.9252455Z min=0.1058 2026-02-21T08:52:46.9252622Z mid=0.1694 2026-02-21T08:52:46.9252770Z max=1.0152 2026-02-21T08:52:46.9252945Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:52:46.9253230Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T08:52:46.9253543Z 'l2_groupings': [16], 2026-02-21T08:52:46.9253754Z 'load_eviction_policies': ['', ''], 2026-02-21T08:52:46.9253995Z 'loop_orders': [[0, 1]], 2026-02-21T08:52:46.9254209Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:52:46.9254419Z 'num_stages': 3, 2026-02-21T08:52:46.9254595Z 'num_warps': 2, 2026-02-21T08:52:46.9254795Z 'pid_type': 'flat', 2026-02-21T08:52:46.9255000Z 'range_flattens': [None, True], 2026-02-21T08:52:46.9255248Z 'range_multi_buffers': [None, False], 2026-02-21T08:52:46.9255492Z 'range_num_stages': [0, 3], 2026-02-21T08:52:46.9255701Z 'range_unroll_factors': [0, 4], 2026-02-21T08:52:46.9255925Z 'range_warp_specializes': [], 2026-02-21T08:52:46.9256136Z 'waves_per_eu': 1} 2026-02-21T08:52:46.9424666Z [405s] Fitting surrogate: 1032 points, 1032 targets 2026-02-21T08:52:47.2413616Z [405s] Generation 15 starting: 19 neighbors, 1 active search path(s) 2026-02-21T08:53:20.0872868Z [438s] Timeout after 30s compiling Config(block_sizes=[128, 32, 8], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:53:20.0894977Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 0.2 configs/s 2026-02-21T08:53:21.1965718Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 17.9 configs/s 2026-02-21T08:53:23.0765612Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 456.5 2026-02-21T08:53:23.0765983Z configs/s 2026-02-21T08:53:23.5744959Z [441s] Generation 15 complete: 2026-02-21T08:53:23.5745353Z timeout=1 2026-02-21T08:53:23.5745561Z ok=19 2026-02-21T08:53:23.5745763Z min=0.1054 2026-02-21T08:53:23.5745966Z mid=0.1422 2026-02-21T08:53:23.5746166Z max=1.0147 2026-02-21T08:53:23.5759146Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:53:23.5759297Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T08:53:23.5759442Z 'l2_groupings': [16], 2026-02-21T08:53:23.5759557Z 'load_eviction_policies': ['', ''], 2026-02-21T08:53:23.5759691Z 'loop_orders': [[0, 1]], 2026-02-21T08:53:23.5759822Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:53:23.5759929Z 'num_stages': 3, 2026-02-21T08:53:23.5760022Z 'num_warps': 2, 2026-02-21T08:53:23.5760127Z 'pid_type': 'flat', 2026-02-21T08:53:23.5760233Z 'range_flattens': [None, None], 2026-02-21T08:53:23.5760352Z 'range_multi_buffers': [None, False], 2026-02-21T08:53:23.5760479Z 'range_num_stages': [0, 3], 2026-02-21T08:53:23.5760589Z 'range_unroll_factors': [0, 4], 2026-02-21T08:53:23.5760710Z 'range_warp_specializes': [], 2026-02-21T08:53:23.5760822Z 'waves_per_eu': 1} 2026-02-21T08:53:23.5971785Z [441s] Fitting surrogate: 1052 points, 1052 targets 2026-02-21T08:53:23.8741510Z [442s] Generation 16 starting: 17 neighbors, 1 active search path(s) 2026-02-21T08:53:27.7634671Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 4.8 configs/s 2026-02-21T08:53:28.8969161Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.2 configs/s 2026-02-21T08:53:31.7632532Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 458.8 2026-02-21T08:53:31.7632842Z configs/s 2026-02-21T08:53:32.2522007Z [450s] Generation 16 complete: 2026-02-21T08:53:32.2522381Z ok=18 2026-02-21T08:53:32.2522700Z min=0.1046 2026-02-21T08:53:32.2522917Z mid=0.1480 2026-02-21T08:53:32.2523118Z max=0.9930 2026-02-21T08:53:32.2523345Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:53:32.2523734Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T08:53:32.2524092Z 'l2_groupings': [16], 2026-02-21T08:53:32.2524374Z 'load_eviction_policies': ['', ''], 2026-02-21T08:53:32.2524688Z 'loop_orders': [[0, 1]], 2026-02-21T08:53:32.2524968Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:53:32.2525238Z 'num_stages': 3, 2026-02-21T08:53:32.2525464Z 'num_warps': 2, 2026-02-21T08:53:32.2525701Z 'pid_type': 'flat', 2026-02-21T08:53:32.2525959Z 'range_flattens': [None, None], 2026-02-21T08:53:32.2526270Z 'range_multi_buffers': [None, False], 2026-02-21T08:53:32.2526597Z 'range_num_stages': [0, 4], 2026-02-21T08:53:32.2526879Z 'range_unroll_factors': [0, 4], 2026-02-21T08:53:32.2527181Z 'range_warp_specializes': [], 2026-02-21T08:53:32.2527465Z 'waves_per_eu': 1} 2026-02-21T08:53:32.2741882Z [450s] Fitting surrogate: 1070 points, 1070 targets 2026-02-21T08:53:32.5678015Z [450s] Generation 17 starting: 16 neighbors, 1 active search path(s) 2026-02-21T08:53:36.1508996Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.5 configs/s 2026-02-21T08:53:37.2360110Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.1 configs/s 2026-02-21T08:53:39.0606077Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 462.0 2026-02-21T08:53:39.0606673Z configs/s 2026-02-21T08:53:39.5134835Z [457s] Generation 17 complete: 2026-02-21T08:53:39.5135127Z ok=17 2026-02-21T08:53:39.5135562Z min=0.1064 2026-02-21T08:53:39.5135795Z mid=0.1320 2026-02-21T08:53:39.5136456Z max=1.0106 2026-02-21T08:53:39.5136669Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:53:39.5137019Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T08:53:39.5137365Z 'l2_groupings': [8], 2026-02-21T08:53:39.5137607Z 'load_eviction_policies': ['', ''], 2026-02-21T08:53:39.5137885Z 'loop_orders': [[0, 1]], 2026-02-21T08:53:39.5138131Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:53:39.5138372Z 'num_stages': 3, 2026-02-21T08:53:39.5138570Z 'num_warps': 2, 2026-02-21T08:53:39.5138772Z 'pid_type': 'flat', 2026-02-21T08:53:39.5139000Z 'range_flattens': [None, None], 2026-02-21T08:53:39.5139269Z 'range_multi_buffers': [None, False], 2026-02-21T08:53:39.5139539Z 'range_num_stages': [0, 4], 2026-02-21T08:53:39.5139779Z 'range_unroll_factors': [0, 4], 2026-02-21T08:53:39.5140032Z 'range_warp_specializes': [], 2026-02-21T08:53:39.5140274Z 'waves_per_eu': 1} 2026-02-21T08:53:39.5328867Z [457s] Fitting surrogate: 1087 points, 1087 targets 2026-02-21T08:53:39.8288087Z [457s] Generation 18 starting: 18 neighbors, 1 active search path(s) 2026-02-21T08:53:43.2287313Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 15.0 configs/s 2026-02-21T08:53:44.4264304Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 17.1 configs/s 2026-02-21T08:53:46.0561434Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 513.1 2026-02-21T08:53:46.0561928Z configs/s 2026-02-21T08:53:46.5715120Z [464s] Generation 18 complete: 2026-02-21T08:53:46.5715486Z ok=19 2026-02-21T08:53:46.5716796Z min=0.1052 2026-02-21T08:53:46.5717122Z mid=0.1496 2026-02-21T08:53:46.5717342Z max=0.9927 2026-02-21T08:53:46.5717582Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:53:46.5718024Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T08:53:46.5718397Z 'l2_groupings': [8], 2026-02-21T08:53:46.5718676Z 'load_eviction_policies': ['', ''], 2026-02-21T08:53:46.5718991Z 'loop_orders': [[1, 0]], 2026-02-21T08:53:46.5719335Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:53:46.5719613Z 'num_stages': 3, 2026-02-21T08:53:46.5720317Z 'num_warps': 2, 2026-02-21T08:53:46.5720449Z 'pid_type': 'flat', 2026-02-21T08:53:46.5720599Z 'range_flattens': [None, None], 2026-02-21T08:53:46.5720780Z 'range_multi_buffers': [None, False], 2026-02-21T08:53:46.5720953Z 'range_num_stages': [0, 4], 2026-02-21T08:53:46.5721124Z 'range_unroll_factors': [0, 4], 2026-02-21T08:53:46.5721299Z 'range_warp_specializes': [], 2026-02-21T08:53:46.5721456Z 'waves_per_eu': 1} 2026-02-21T08:53:46.5918845Z [464s] Fitting surrogate: 1106 points, 1106 targets 2026-02-21T08:53:46.8827236Z [465s] Generation 19 starting: 16 neighbors, 1 active search path(s) 2026-02-21T08:53:50.2426165Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 4.1 configs/s 2026-02-21T08:53:51.3177653Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.3 configs/s 2026-02-21T08:53:52.8203861Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 543.0 2026-02-21T08:53:52.8204352Z configs/s 2026-02-21T08:53:53.2699510Z [471s] Generation 19 complete: 2026-02-21T08:53:53.2699778Z ok=17 2026-02-21T08:53:53.2700819Z min=0.1052 2026-02-21T08:53:53.2700981Z mid=0.1450 2026-02-21T08:53:53.2701067Z max=0.9915 2026-02-21T08:53:53.2701163Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:53:53.2701358Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T08:53:53.2701504Z 'l2_groupings': [8], 2026-02-21T08:53:53.2701615Z 'load_eviction_policies': ['', ''], 2026-02-21T08:53:53.2701735Z 'loop_orders': [[1, 0]], 2026-02-21T08:53:53.2701842Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:53:53.2701951Z 'num_stages': 3, 2026-02-21T08:53:53.2702040Z 'num_warps': 2, 2026-02-21T08:53:53.2702131Z 'pid_type': 'flat', 2026-02-21T08:53:53.2702231Z 'range_flattens': [None, None], 2026-02-21T08:53:53.2702348Z 'range_multi_buffers': [None, True], 2026-02-21T08:53:53.2702906Z 'range_num_stages': [0, 4], 2026-02-21T08:53:53.2703011Z 'range_unroll_factors': [0, 4], 2026-02-21T08:53:53.2703143Z 'range_warp_specializes': [], 2026-02-21T08:53:53.2703246Z 'waves_per_eu': 1} 2026-02-21T08:53:53.2908322Z [471s] Fitting surrogate: 1123 points, 1123 targets 2026-02-21T08:53:53.5588225Z [471s] Generation 20 starting: 17 neighbors, 1 active search path(s) 2026-02-21T08:54:25.7611975Z [503s] Timeout after 30s compiling Config(block_sizes=[128, 32, 8], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:54:25.7631262Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.2 configs/s 2026-02-21T08:54:26.7450199Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 18.2 configs/s 2026-02-21T08:54:28.7073760Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 441.3 2026-02-21T08:54:28.7076516Z configs/s 2026-02-21T08:54:29.1516359Z [507s] Generation 20 complete: 2026-02-21T08:54:29.1516740Z timeout=1 2026-02-21T08:54:29.1516972Z ok=17 2026-02-21T08:54:29.1517162Z min=0.1051 2026-02-21T08:54:29.1517363Z mid=0.1188 2026-02-21T08:54:29.1517548Z max=0.1970 2026-02-21T08:54:29.1517757Z best={'block_sizes': [64, 32, 16], 2026-02-21T08:54:29.1518106Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T08:54:29.1518449Z 'l2_groupings': [8], 2026-02-21T08:54:29.1518703Z 'load_eviction_policies': ['', ''], 2026-02-21T08:54:29.1518999Z 'loop_orders': [[1, 0]], 2026-02-21T08:54:29.1519259Z 'matrix_instr_nonkdim': 16, 2026-02-21T08:54:29.1519514Z 'num_stages': 2, 2026-02-21T08:54:29.1519723Z 'num_warps': 2, 2026-02-21T08:54:29.1519935Z 'pid_type': 'flat', 2026-02-21T08:54:29.1520226Z 'range_flattens': [None, None], 2026-02-21T08:54:29.1520508Z 'range_multi_buffers': [None, True], 2026-02-21T08:54:29.1521276Z 'range_num_stages': [0, 4], 2026-02-21T08:54:29.1521540Z 'range_unroll_factors': [0, 4], 2026-02-21T08:54:29.1521817Z 'range_warp_specializes': [], 2026-02-21T08:54:29.1522075Z 'waves_per_eu': 1} 2026-02-21T08:54:29.1745113Z [507s] Fitting surrogate: 1141 points, 1141 targets 2026-02-21T08:54:29.3091733Z [507s] Autotuning complete in 507.5s after searching 1088 configs. 2026-02-21T08:54:29.3092305Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:54:29.3094185Z @helion.kernel(config=helion.Config(block_sizes=[64, 32, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T08:54:29.3095350Z 2026-02-21T08:54:29.3095525Z [507s] Code of selected kernel: /tmp/torchinductor_root/hp/chpk4kwpoinvb37woomythwzhrrcj56jbyn2fjphaldipw3fj6kj.py 2026-02-21T08:54:30.2901284Z WARNING:tritonbench.utils.triton_op:Completed input ID 14: 2026-02-21T08:54:30.2901759Z x_val 2026-02-21T08:54:30.2901987Z ------------------- 2026-02-21T08:54:30.2902240Z (64, 1, 7168, 8192) 2026-02-21T08:54:30.2902383Z 2026-02-21T08:54:30.2930878Z 50%|█████ | 5/10 [45:09<43:42, 524.56s/it]WARNING:tritonbench.utils.triton_op:Running input ID 17: 2026-02-21T08:54:30.2931335Z x_val 2026-02-21T08:54:30.2931523Z --------------------- 2026-02-21T08:54:30.2931744Z (1, 4096, 8192, 1024) 2026-02-21T08:54:30.2935512Z INFO:tritonbench.utils.triton_op:Took 0.35ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:54:31.3312289Z INFO:tritonbench.utils.triton_op:Took 3.92ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:54:32.9403532Z Autotune Choices Stats: 2026-02-21T08:54:32.9404346Z {"num_choices": 37, "num_triton_choices": 36, "best_kernel": "mm", "best_time": 0.10955700278282166, "best_triton_pos": 1, "best_triton_time": 0.11027800291776657, "best_triton_kernel": "triton_mm_119", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8"} 2026-02-21T08:54:32.9529966Z AUTOTUNE mm(4096x1024, 1024x8192) 2026-02-21T08:54:32.9530184Z strides: [1024, 1], [8192, 1] 2026-02-21T08:54:32.9530358Z dtypes: torch.bfloat16, torch.bfloat16 2026-02-21T08:54:32.9530526Z mm 0.1096 ms 100.0% 2026-02-21T08:54:32.9531077Z triton_mm_119 0.1103 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:54:32.9531985Z triton_mm_116 0.1117 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:54:32.9532864Z triton_mm_113 0.1225 ms 89.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:54:32.9533733Z triton_mm_114 0.1323 ms 82.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:54:32.9534599Z triton_mm_118 0.1380 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:54:32.9535462Z triton_mm_107 0.1392 ms 78.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:54:32.9536828Z triton_mm_110 0.1423 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T08:54:32.9562386Z triton_mm_112 0.1536 ms 71.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:54:32.9563511Z triton_mm_117 0.1559 ms 70.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:54:32.9564181Z SingleProcess AUTOTUNE benchmarking takes 1.1024 seconds and 0.3124 seconds precompiling for 37 choices 2026-02-21T08:54:35.4036396Z INFO:tritonbench.utils.triton_op:Took 0.16ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:54:35.4057833Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:54:35.4058185Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:54:35.4058465Z 'dtype': 'torch.bfloat16', 2026-02-21T08:54:35.4058726Z 'shape': (1, 4096, 1024), 2026-02-21T08:54:35.4058982Z 'stride': (4194304, 1024, 1)}, 2026-02-21T08:54:35.4059236Z { 'device': 'cuda:0', 2026-02-21T08:54:35.4059483Z 'dtype': 'torch.int32', 2026-02-21T08:54:35.4059732Z 'shape': (1024, 8192), 2026-02-21T08:54:35.4059974Z 'stride': (8192, 1)}), 2026-02-21T08:54:35.4060206Z 'kwargs': {}} 2026-02-21T08:54:35.4104280Z INFO:tritonbench.utils.triton_op:Took 4.64ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:54:35.6033281Z [0s] Autotune random seed: 2134834638 2026-02-21T08:54:35.6923862Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:55:12.8272948Z [37s] Timeout after 30s compiling Config(block_sizes=[128, 2, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:55:12.8288791Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.4 configs/s 2026-02-21T08:55:14.0892938Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:55:14.0921687Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:55:14.0925209Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T08:55:14.0925794Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T08:55:14.0926247Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T08:55:14.0926663Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:55:14.0926958Z #smem = #ttg.shared_memory 2026-02-21T08:55:14.0927321Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:55:14.0928098Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:55:14.0929196Z %cst = arith.constant dense<0.000000e+00> : tensor<64x16xf32, #mma> 2026-02-21T08:55:14.0929448Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:55:14.0929625Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:55:14.0929793Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:55:14.0930012Z %cst_0 = arith.constant dense<0> : tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.0930245Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:55:14.0930442Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:55:14.0930599Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T08:55:14.0930820Z %c510_i32 = arith.constant 510 : i32 2026-02-21T08:55:14.0930978Z %c6_i32 = arith.constant 6 : i32 2026-02-21T08:55:14.0931129Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:55:14.0931432Z %cst_1 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0931834Z %cst_2 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.0932185Z %cst_3 = arith.constant dense<0> : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0932524Z %cst_4 = arith.constant dense<8192> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0932860Z %cst_5 = arith.constant dense<0> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0933197Z %cst_6 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0933532Z %cst_7 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0934066Z %cst_8 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0934358Z %cst_9 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1> 2026-02-21T08:55:14.0934651Z %cst_10 = arith.constant dense<4> : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0934947Z %cst_11 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:14.0935170Z %cst_12 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:14.0935408Z %cst_13 = arith.constant dense<8192> : tensor<64x1xi32, #mma> 2026-02-21T08:55:14.0935602Z %0 = tt.get_program_id x : i32 2026-02-21T08:55:14.0935751Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T08:55:14.0935912Z %2 = arith.minsi %1, %c32768_i32 : i32 2026-02-21T08:55:14.0936181Z %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:14.0936550Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:14.0936962Z %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0937385Z %6 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:14.0937736Z %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.0938066Z %8 = tt.splat %arg0 : !tt.ptr -> tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T08:55:14.0938385Z %9 = tt.splat %arg1 : !tt.ptr -> tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0938800Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0939376Z %11 = arith.extsi %10 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0940122Z %12 = arith.extsi %5 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0940679Z %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:55:14.0941106Z %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:55:14.0941536Z %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:14.0941806Z %16 = arith.cmpi eq, %15, %cst_11 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:14.0942015Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked> 2026-02-21T08:55:14.0942225Z %18 = arith.cmpi eq, %15, %cst_12 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:14.0942422Z %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked> 2026-02-21T08:55:14.0942639Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<64x16x!tt.ptr, #mma> 2026-02-21T08:55:14.0942811Z %21 = arith.subi %2, %0 : i32 2026-02-21T08:55:14.0942934Z %22 = arith.remsi %21, %c2_i32 : i32 2026-02-21T08:55:14.0943055Z %23 = arith.subi %21, %22 : i32 2026-02-21T08:55:14.0943168Z %24 = arith.addi %0, %23 : i32 2026-02-21T08:55:14.0943342Z %25 = arith.addi %7, %cst_2 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.0943629Z %26 = tt.expand_dims %25 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:55:14.0943951Z %27 = tt.broadcast %26 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.0944234Z %28 = arith.addi %11, %cst_1 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0944649Z %29 = tt.expand_dims %28 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0945019Z %30 = arith.muli %29, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0945346Z %31 = tt.broadcast %30 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0945677Z %32 = arith.cmpi sge, %29, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0945931Z %33 = arith.cmpi slt, %29, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0946173Z %34 = arith.andi %32, %33 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0946488Z %35 = tt.broadcast %34 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0946761Z scf.for %arg3 = %0 to %24 step %c2_i32 : i32 { 2026-02-21T08:55:14.0946909Z %36 = arith.remsi %arg3, %c64_i32 : i32 2026-02-21T08:55:14.0947042Z %37 = arith.divsi %arg3, %c64_i32 : i32 2026-02-21T08:55:14.0947168Z %38 = arith.muli %36, %c64_i32 : i32 2026-02-21T08:55:14.0947350Z %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:14.0947570Z %40 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:14.0947791Z %41 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:14.0948018Z %42 = arith.addi %40, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:14.0948189Z %43 = arith.muli %37, %c16_i32 : i32 2026-02-21T08:55:14.0948389Z %44 = tt.splat %43 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:14.0948599Z %45 = arith.addi %44, %6 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:14.0948891Z %46 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T08:55:14.0949151Z %47 = arith.muli %46, %cst_9 : tensor<64x1xi32, #blocked1> 2026-02-21T08:55:14.0949356Z %48 = tt.broadcast %47 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.0949545Z %49 = arith.extsi %43 : i32 to i64 2026-02-21T08:55:14.0949756Z %50 = tt.splat %49 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0950072Z %51 = arith.addi %50, %12 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0950461Z %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0950889Z %53 = tt.broadcast %52 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0951199Z %54 = arith.cmpi sge, %52, %cst_5 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0951435Z %55 = arith.cmpi slt, %52, %cst_4 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0951663Z %56 = arith.andi %54, %55 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0951994Z %57 = tt.broadcast %56 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0952333Z %58 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<64x16xf32, #mma>) : i32 { 2026-02-21T08:55:14.0952551Z %145 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:55:14.0952749Z %146 = tt.splat %145 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.0952971Z %147 = arith.addi %146, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.0953249Z %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:55:14.0953528Z %149 = tt.broadcast %148 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.0953722Z %150 = arith.addi %48, %149 : tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.0953934Z %151 = tt.addptr %8, %150 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.0954140Z %152 = tt.load %151 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T08:55:14.0954360Z %153 = ttg.local_alloc %152 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T08:55:14.0954689Z %154 = ttg.local_load %153 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.0955096Z %155 = arith.extf %154 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.0955378Z %156 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:55:14.0955588Z %157 = tt.splat %156 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0955891Z %158 = arith.addi %157, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0956283Z %159 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0956686Z %160 = arith.muli %159, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0956997Z %161 = tt.broadcast %160 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0957303Z %162 = arith.addi %161, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0957614Z %163 = tt.addptr %9, %162 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0957932Z %164 = arith.cmpi sge, %159, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0958175Z %165 = arith.cmpi slt, %159, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0958411Z %166 = arith.andi %164, %165 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0958713Z %167 = tt.broadcast %166 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0959008Z %168 = arith.andi %167, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0959253Z %169 = tt.load %163, %168, %cst_3 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0959498Z %170 = arith.shli %169, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0959734Z %171 = arith.shrsi %170, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0959971Z %172 = arith.shrsi %169, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0960292Z %173 = tt.expand_dims %171 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.0960633Z %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.0960914Z %175 = tt.broadcast %173 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.0961155Z %176 = arith.select %17, %175, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.0961392Z %177 = tt.broadcast %174 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.0961622Z %178 = arith.select %19, %177, %176 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.0961852Z %179 = tt.reshape %178 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T08:55:14.0962076Z %180 = arith.sitofp %179 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T08:55:14.0962370Z %181 = ttg.convert_layout %180 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.0962925Z %182 = tt.dot %155, %181, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma> 2026-02-21T08:55:14.0963283Z %183 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T08:55:14.0963411Z %184 = arith.muli %183, %c2_i32 : i32 2026-02-21T08:55:14.0963582Z %185 = tt.splat %184 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.0963811Z %186 = arith.addi %185, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.0964094Z %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:55:14.0964369Z %188 = tt.broadcast %187 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.0964605Z %189 = arith.addi %48, %188 : tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.0964800Z %190 = tt.addptr %8, %189 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.0965007Z %191 = tt.load %190 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T08:55:14.0965228Z %192 = ttg.local_alloc %191 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T08:55:14.0965552Z %193 = ttg.local_load %192 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.0965956Z %194 = arith.extf %193 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.0966234Z %195 = arith.extsi %183 : i32 to i64 2026-02-21T08:55:14.0966451Z %196 = tt.splat %195 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0966756Z %197 = arith.addi %196, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0967141Z %198 = tt.expand_dims %197 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0967497Z %199 = arith.muli %198, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0967803Z %200 = tt.broadcast %199 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0968147Z %201 = arith.addi %200, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0968454Z %202 = tt.addptr %9, %201 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0968772Z %203 = arith.cmpi sge, %198, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0969014Z %204 = arith.cmpi slt, %198, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0969251Z %205 = arith.andi %203, %204 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0969546Z %206 = tt.broadcast %205 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0969843Z %207 = arith.andi %206, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0970089Z %208 = tt.load %202, %207, %cst_3 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0970334Z %209 = arith.shli %208, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0970573Z %210 = arith.shrsi %209, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0970808Z %211 = arith.shrsi %208, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0971098Z %212 = tt.expand_dims %210 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.0971433Z %213 = tt.expand_dims %211 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.0971712Z %214 = tt.broadcast %212 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.0971952Z %215 = arith.select %17, %214, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.0972185Z %216 = tt.broadcast %213 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.0972461Z %217 = arith.select %19, %216, %215 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.0972688Z %218 = tt.reshape %217 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T08:55:14.0972907Z %219 = arith.sitofp %218 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T08:55:14.0973199Z %220 = ttg.convert_layout %219 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.0973658Z %221 = tt.dot %194, %220, %182, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma> 2026-02-21T08:55:14.0974001Z %222 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T08:55:14.0974126Z %223 = arith.muli %222, %c2_i32 : i32 2026-02-21T08:55:14.0974295Z %224 = tt.splat %223 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.0974521Z %225 = arith.addi %224, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.0974793Z %226 = tt.expand_dims %225 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:55:14.0975071Z %227 = tt.broadcast %226 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.0975268Z %228 = arith.addi %48, %227 : tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.0975467Z %229 = tt.addptr %8, %228 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.0975674Z %230 = tt.load %229 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T08:55:14.0975925Z %231 = ttg.local_alloc %230 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T08:55:14.0976251Z %232 = ttg.local_load %231 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.0976657Z %233 = arith.extf %232 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.0976933Z %234 = arith.extsi %222 : i32 to i64 2026-02-21T08:55:14.0977146Z %235 = tt.splat %234 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0977441Z %236 = arith.addi %235, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.0977842Z %237 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0978200Z %238 = arith.muli %237, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0978506Z %239 = tt.broadcast %238 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0978810Z %240 = arith.addi %239, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0979117Z %241 = tt.addptr %9, %240 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0979431Z %242 = arith.cmpi sge, %237, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0979673Z %243 = arith.cmpi slt, %237, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0979909Z %244 = arith.andi %242, %243 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0980208Z %245 = tt.broadcast %244 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0997869Z %246 = arith.andi %245, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0998111Z %247 = tt.load %241, %246, %cst_3 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0998364Z %248 = arith.shli %247, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0998601Z %249 = arith.shrsi %248, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0998840Z %250 = arith.shrsi %247, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.0999132Z %251 = tt.expand_dims %249 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.0999470Z %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.0999758Z %253 = tt.broadcast %251 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.0999996Z %254 = arith.select %17, %253, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1000230Z %255 = tt.broadcast %252 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1000463Z %256 = arith.select %19, %255, %254 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1000691Z %257 = tt.reshape %256 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T08:55:14.1000915Z %258 = arith.sitofp %257 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T08:55:14.1001259Z %259 = ttg.convert_layout %258 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1001720Z %260 = tt.dot %233, %259, %221, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma> 2026-02-21T08:55:14.1002068Z scf.yield %260 : tensor<64x16xf32, #mma> 2026-02-21T08:55:14.1002197Z } {tt.disallow_acc_multi_buffer} 2026-02-21T08:55:14.1002336Z %59 = arith.addi %48, %27 : tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1002530Z %60 = tt.addptr %8, %59 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1002777Z %61 = tt.load %60 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T08:55:14.1002993Z %62 = ttg.local_alloc %61 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T08:55:14.1003314Z %63 = ttg.local_load %62 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1003713Z %64 = arith.extf %63 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1004043Z %65 = arith.addi %31, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1004342Z %66 = tt.addptr %9, %65 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1004643Z %67 = arith.andi %35, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1004876Z %68 = tt.load %66, %67, %cst_3 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1005114Z %69 = arith.shli %68, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1005349Z %70 = arith.shrsi %69, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1005578Z %71 = arith.shrsi %68, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1005915Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1006307Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1006580Z %74 = tt.broadcast %72 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1006812Z %75 = arith.select %17, %74, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1007036Z %76 = tt.broadcast %73 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1007260Z %77 = arith.select %19, %76, %75 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1007475Z %78 = tt.reshape %77 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T08:55:14.1007690Z %79 = arith.sitofp %78 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T08:55:14.1007978Z %80 = ttg.convert_layout %79 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1008426Z %81 = tt.dot %64, %80, %58, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma> 2026-02-21T08:55:14.1008799Z %82 = arith.truncf %81 : tensor<64x16xf32, #mma> to tensor<64x16xbf16, #mma> 2026-02-21T08:55:14.1009056Z %83 = tt.expand_dims %42 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T08:55:14.1009319Z %84 = arith.muli %83, %cst_13 : tensor<64x1xi32, #mma> 2026-02-21T08:55:14.1009542Z %85 = tt.expand_dims %45 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma> 2026-02-21T08:55:14.1009788Z %86 = tt.broadcast %84 : tensor<64x1xi32, #mma> -> tensor<64x16xi32, #mma> 2026-02-21T08:55:14.1009981Z %87 = tt.broadcast %85 : tensor<1x16xi32, #mma> -> tensor<64x16xi32, #mma> 2026-02-21T08:55:14.1010152Z %88 = arith.addi %86, %87 : tensor<64x16xi32, #mma> 2026-02-21T08:55:14.1010330Z %89 = tt.addptr %20, %88 : tensor<64x16x!tt.ptr, #mma>, tensor<64x16xi32, #mma> 2026-02-21T08:55:14.1010518Z tt.store %89, %82 : tensor<64x16x!tt.ptr, #mma> 2026-02-21T08:55:14.1010654Z %90 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T08:55:14.1010775Z %91 = arith.remsi %90, %c64_i32 : i32 2026-02-21T08:55:14.1010888Z %92 = arith.divsi %90, %c64_i32 : i32 2026-02-21T08:55:14.1011005Z %93 = arith.muli %91, %c64_i32 : i32 2026-02-21T08:55:14.1011175Z %94 = tt.splat %93 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:14.1011386Z %95 = tt.splat %93 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:14.1011598Z %96 = arith.addi %94, %3 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:14.1011805Z %97 = arith.addi %95, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:14.1011965Z %98 = arith.muli %92, %c16_i32 : i32 2026-02-21T08:55:14.1012118Z %99 = tt.splat %98 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:14.1012321Z %100 = arith.addi %99, %6 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:14.1012592Z %101 = tt.expand_dims %96 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T08:55:14.1012844Z %102 = arith.muli %101, %cst_9 : tensor<64x1xi32, #blocked1> 2026-02-21T08:55:14.1013043Z %103 = tt.broadcast %102 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1013215Z %104 = arith.extsi %98 : i32 to i64 2026-02-21T08:55:14.1013459Z %105 = tt.splat %104 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1013758Z %106 = arith.addi %105, %12 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1014149Z %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1014575Z %108 = tt.broadcast %107 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1014885Z %109 = arith.cmpi sge, %107, %cst_5 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1015131Z %110 = arith.cmpi slt, %107, %cst_4 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1015370Z %111 = arith.andi %109, %110 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1015669Z %112 = tt.broadcast %111 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1016014Z %113 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<64x16xf32, #mma>) : i32 { 2026-02-21T08:55:14.1016228Z %145 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:55:14.1016401Z %146 = tt.splat %145 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1016626Z %147 = arith.addi %146, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1016944Z %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:55:14.1017219Z %149 = tt.broadcast %148 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1017416Z %150 = arith.addi %103, %149 : tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1017615Z %151 = tt.addptr %8, %150 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1017819Z %152 = tt.load %151 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T08:55:14.1018038Z %153 = ttg.local_alloc %152 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T08:55:14.1018368Z %154 = ttg.local_load %153 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1018774Z %155 = arith.extf %154 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1019058Z %156 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:55:14.1019268Z %157 = tt.splat %156 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1019565Z %158 = arith.addi %157, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1019953Z %159 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1020311Z %160 = arith.muli %159, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1020619Z %161 = tt.broadcast %160 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1020927Z %162 = arith.addi %161, %108 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1021233Z %163 = tt.addptr %9, %162 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1021580Z %164 = arith.cmpi sge, %159, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1021826Z %165 = arith.cmpi slt, %159, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1022057Z %166 = arith.andi %164, %165 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1022359Z %167 = tt.broadcast %166 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1022659Z %168 = arith.andi %167, %112 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1022900Z %169 = tt.load %163, %168, %cst_3 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1023147Z %170 = arith.shli %169, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1023384Z %171 = arith.shrsi %170, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1023620Z %172 = arith.shrsi %169, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1023912Z %173 = tt.expand_dims %171 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1024248Z %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1024532Z %175 = tt.broadcast %173 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1024768Z %176 = arith.select %17, %175, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1025037Z %177 = tt.broadcast %174 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1025271Z %178 = arith.select %19, %177, %176 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1025498Z %179 = tt.reshape %178 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T08:55:14.1025719Z %180 = arith.sitofp %179 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T08:55:14.1026011Z %181 = ttg.convert_layout %180 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1026477Z %182 = tt.dot %155, %181, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma> 2026-02-21T08:55:14.1026827Z %183 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T08:55:14.1026952Z %184 = arith.muli %183, %c2_i32 : i32 2026-02-21T08:55:14.1027123Z %185 = tt.splat %184 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1027346Z %186 = arith.addi %185, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1027621Z %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:55:14.1027898Z %188 = tt.broadcast %187 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1028089Z %189 = arith.addi %103, %188 : tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1028287Z %190 = tt.addptr %8, %189 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1028488Z %191 = tt.load %190 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T08:55:14.1028711Z %192 = ttg.local_alloc %191 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T08:55:14.1029038Z %193 = ttg.local_load %192 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1029468Z %194 = arith.extf %193 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1029747Z %195 = arith.extsi %183 : i32 to i64 2026-02-21T08:55:14.1029953Z %196 = tt.splat %195 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1030254Z %197 = arith.addi %196, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1030656Z %198 = tt.expand_dims %197 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1031009Z %199 = arith.muli %198, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1031319Z %200 = tt.broadcast %199 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1031623Z %201 = arith.addi %200, %108 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1031930Z %202 = tt.addptr %9, %201 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1032245Z %203 = arith.cmpi sge, %198, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1032487Z %204 = arith.cmpi slt, %198, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1032759Z %205 = arith.andi %203, %204 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1033060Z %206 = tt.broadcast %205 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1033357Z %207 = arith.andi %206, %112 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1033604Z %208 = tt.load %202, %207, %cst_3 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1033849Z %209 = arith.shli %208, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1034083Z %210 = arith.shrsi %209, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1034320Z %211 = arith.shrsi %208, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1034604Z %212 = tt.expand_dims %210 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1034939Z %213 = tt.expand_dims %211 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1035224Z %214 = tt.broadcast %212 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1035457Z %215 = arith.select %17, %214, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1035691Z %216 = tt.broadcast %213 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1035920Z %217 = arith.select %19, %216, %215 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1036147Z %218 = tt.reshape %217 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T08:55:14.1036367Z %219 = arith.sitofp %218 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T08:55:14.1036658Z %220 = ttg.convert_layout %219 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1037124Z %221 = tt.dot %194, %220, %182, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma> 2026-02-21T08:55:14.1037503Z %222 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T08:55:14.1037628Z %223 = arith.muli %222, %c2_i32 : i32 2026-02-21T08:55:14.1037799Z %224 = tt.splat %223 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1038018Z %225 = arith.addi %224, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1038296Z %226 = tt.expand_dims %225 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:55:14.1038570Z %227 = tt.broadcast %226 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1038765Z %228 = arith.addi %103, %227 : tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1038967Z %229 = tt.addptr %8, %228 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1039169Z %230 = tt.load %229 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T08:55:14.1039389Z %231 = ttg.local_alloc %230 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T08:55:14.1039711Z %232 = ttg.local_load %231 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1040118Z %233 = arith.extf %232 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1040397Z %234 = arith.extsi %222 : i32 to i64 2026-02-21T08:55:14.1040633Z %235 = tt.splat %234 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1040932Z %236 = arith.addi %235, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1041319Z %237 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1041678Z %238 = arith.muli %237, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1041986Z %239 = tt.broadcast %238 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1042287Z %240 = arith.addi %239, %108 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1042648Z %241 = tt.addptr %9, %240 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1042966Z %242 = arith.cmpi sge, %237, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1043207Z %243 = arith.cmpi slt, %237, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1043443Z %244 = arith.andi %242, %243 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1043747Z %245 = tt.broadcast %244 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1044050Z %246 = arith.andi %245, %112 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1044298Z %247 = tt.load %241, %246, %cst_3 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1044547Z %248 = arith.shli %247, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1044788Z %249 = arith.shrsi %248, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1045061Z %250 = arith.shrsi %247, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1045349Z %251 = tt.expand_dims %249 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1045682Z %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1045965Z %253 = tt.broadcast %251 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1046199Z %254 = arith.select %17, %253, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1046433Z %255 = tt.broadcast %252 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1046664Z %256 = arith.select %19, %255, %254 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1046892Z %257 = tt.reshape %256 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T08:55:14.1047115Z %258 = arith.sitofp %257 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T08:55:14.1047411Z %259 = ttg.convert_layout %258 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1047873Z %260 = tt.dot %233, %259, %221, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma> 2026-02-21T08:55:14.1048214Z scf.yield %260 : tensor<64x16xf32, #mma> 2026-02-21T08:55:14.1048341Z } {tt.disallow_acc_multi_buffer} 2026-02-21T08:55:14.1048477Z %114 = arith.addi %103, %27 : tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1048709Z %115 = tt.addptr %8, %114 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1048914Z %116 = tt.load %115 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T08:55:14.1049141Z %117 = ttg.local_alloc %116 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T08:55:14.1049465Z %118 = ttg.local_load %117 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1049867Z %119 = arith.extf %118 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1050199Z %120 = arith.addi %31, %108 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1050511Z %121 = tt.addptr %9, %120 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1050811Z %122 = arith.andi %35, %112 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1051053Z %123 = tt.load %121, %122, %cst_3 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1051296Z %124 = arith.shli %123, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1051534Z %125 = arith.shrsi %124, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1051770Z %126 = arith.shrsi %123, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1052055Z %127 = tt.expand_dims %125 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1052389Z %128 = tt.expand_dims %126 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1052670Z %129 = tt.broadcast %127 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1052940Z %130 = arith.select %17, %129, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1053177Z %131 = tt.broadcast %128 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1053404Z %132 = arith.select %19, %131, %130 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1053634Z %133 = tt.reshape %132 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T08:55:14.1053851Z %134 = arith.sitofp %133 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T08:55:14.1054142Z %135 = ttg.convert_layout %134 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1054598Z %136 = tt.dot %119, %135, %113, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma> 2026-02-21T08:55:14.1054976Z %137 = arith.truncf %136 : tensor<64x16xf32, #mma> to tensor<64x16xbf16, #mma> 2026-02-21T08:55:14.1055240Z %138 = tt.expand_dims %97 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T08:55:14.1055478Z %139 = arith.muli %138, %cst_13 : tensor<64x1xi32, #mma> 2026-02-21T08:55:14.1055705Z %140 = tt.expand_dims %100 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma> 2026-02-21T08:55:14.1055960Z %141 = tt.broadcast %139 : tensor<64x1xi32, #mma> -> tensor<64x16xi32, #mma> 2026-02-21T08:55:14.1056157Z %142 = tt.broadcast %140 : tensor<1x16xi32, #mma> -> tensor<64x16xi32, #mma> 2026-02-21T08:55:14.1056337Z %143 = arith.addi %141, %142 : tensor<64x16xi32, #mma> 2026-02-21T08:55:14.1056556Z %144 = tt.addptr %20, %143 : tensor<64x16x!tt.ptr, #mma>, tensor<64x16xi32, #mma> 2026-02-21T08:55:14.1056747Z tt.store %144, %137 : tensor<64x16x!tt.ptr, #mma> 2026-02-21T08:55:14.1056889Z } {tt.num_stages = 1 : i32} 2026-02-21T08:55:14.1057008Z scf.for %arg3 = %24 to %2 step %c1_i32 : i32 { 2026-02-21T08:55:14.1057143Z %36 = arith.remsi %arg3, %c64_i32 : i32 2026-02-21T08:55:14.1057265Z %37 = arith.divsi %arg3, %c64_i32 : i32 2026-02-21T08:55:14.1057386Z %38 = arith.muli %36, %c64_i32 : i32 2026-02-21T08:55:14.1057548Z %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:14.1057758Z %40 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:14.1057969Z %41 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:14.1058174Z %42 = arith.addi %40, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:14.1058340Z %43 = arith.muli %37, %c16_i32 : i32 2026-02-21T08:55:14.1058494Z %44 = tt.splat %43 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:14.1058695Z %45 = arith.addi %44, %6 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:14.1058959Z %46 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T08:55:14.1059202Z %47 = arith.muli %46, %cst_9 : tensor<64x1xi32, #blocked1> 2026-02-21T08:55:14.1059395Z %48 = tt.broadcast %47 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1059567Z %49 = arith.extsi %43 : i32 to i64 2026-02-21T08:55:14.1059775Z %50 = tt.splat %49 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1060067Z %51 = arith.addi %50, %12 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1060447Z %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1060904Z %53 = tt.broadcast %52 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1061209Z %54 = arith.cmpi sge, %52, %cst_5 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1061448Z %55 = arith.cmpi slt, %52, %cst_4 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1061675Z %56 = arith.andi %54, %55 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1061966Z %57 = tt.broadcast %56 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1062298Z %58 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<64x16xf32, #mma>) : i32 { 2026-02-21T08:55:14.1062511Z %90 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:55:14.1062678Z %91 = tt.splat %90 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1062894Z %92 = arith.addi %91, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1063163Z %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:55:14.1063434Z %94 = tt.broadcast %93 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1063620Z %95 = arith.addi %48, %94 : tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1063813Z %96 = tt.addptr %8, %95 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1064011Z %97 = tt.load %96 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T08:55:14.1064263Z %98 = ttg.local_alloc %97 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T08:55:14.1064586Z %99 = ttg.local_load %98 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1064983Z %100 = arith.extf %99 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1065262Z %101 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:55:14.1065472Z %102 = tt.splat %101 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1065767Z %103 = arith.addi %102, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1066159Z %104 = tt.expand_dims %103 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1066519Z %105 = arith.muli %104, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1066829Z %106 = tt.broadcast %105 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1067138Z %107 = arith.addi %106, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1067448Z %108 = tt.addptr %9, %107 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1067769Z %109 = arith.cmpi sge, %104, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1068016Z %110 = arith.cmpi slt, %104, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1068254Z %111 = arith.andi %109, %110 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1068587Z %112 = tt.broadcast %111 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1068884Z %113 = arith.andi %112, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1069124Z %114 = tt.load %108, %113, %cst_3 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1069371Z %115 = arith.shli %114, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1069608Z %116 = arith.shrsi %115, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1069847Z %117 = arith.shrsi %114, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1070137Z %118 = tt.expand_dims %116 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1070474Z %119 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1070757Z %120 = tt.broadcast %118 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1070993Z %121 = arith.select %17, %120, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1071229Z %122 = tt.broadcast %119 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1071457Z %123 = arith.select %19, %122, %121 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1071683Z %124 = tt.reshape %123 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T08:55:14.1071903Z %125 = arith.sitofp %124 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T08:55:14.1072224Z %126 = ttg.convert_layout %125 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1072687Z %127 = tt.dot %100, %126, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma> 2026-02-21T08:55:14.1073033Z %128 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T08:55:14.1073153Z %129 = arith.muli %128, %c2_i32 : i32 2026-02-21T08:55:14.1073323Z %130 = tt.splat %129 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1073546Z %131 = arith.addi %130, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1073823Z %132 = tt.expand_dims %131 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:55:14.1074110Z %133 = tt.broadcast %132 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1074305Z %134 = arith.addi %48, %133 : tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1074502Z %135 = tt.addptr %8, %134 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1074705Z %136 = tt.load %135 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T08:55:14.1074925Z %137 = ttg.local_alloc %136 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T08:55:14.1075248Z %138 = ttg.local_load %137 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1075647Z %139 = arith.extf %138 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1075929Z %140 = arith.extsi %128 : i32 to i64 2026-02-21T08:55:14.1076138Z %141 = tt.splat %140 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1076467Z %142 = arith.addi %141, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1076855Z %143 = tt.expand_dims %142 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1077206Z %144 = arith.muli %143, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1077512Z %145 = tt.broadcast %144 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1077814Z %146 = arith.addi %145, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1078119Z %147 = tt.addptr %9, %146 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1078435Z %148 = arith.cmpi sge, %143, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1078677Z %149 = arith.cmpi slt, %143, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1078913Z %150 = arith.andi %148, %149 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1079215Z %151 = tt.broadcast %150 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1079508Z %152 = arith.andi %151, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1079753Z %153 = tt.load %147, %152, %cst_3 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1080027Z %154 = arith.shli %153, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1080261Z %155 = arith.shrsi %154, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1080501Z %156 = arith.shrsi %153, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1080792Z %157 = tt.expand_dims %155 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1081129Z %158 = tt.expand_dims %156 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1081412Z %159 = tt.broadcast %157 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1081650Z %160 = arith.select %17, %159, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1081892Z %161 = tt.broadcast %158 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1082119Z %162 = arith.select %19, %161, %160 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1082352Z %163 = tt.reshape %162 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T08:55:14.1082623Z %164 = arith.sitofp %163 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T08:55:14.1082923Z %165 = ttg.convert_layout %164 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1083387Z %166 = tt.dot %139, %165, %127, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma> 2026-02-21T08:55:14.1095532Z %167 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T08:55:14.1095668Z %168 = arith.muli %167, %c2_i32 : i32 2026-02-21T08:55:14.1095856Z %169 = tt.splat %168 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1096175Z %170 = arith.addi %169, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1096456Z %171 = tt.expand_dims %170 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T08:55:14.1096739Z %172 = tt.broadcast %171 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1096947Z %173 = arith.addi %48, %172 : tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1097154Z %174 = tt.addptr %8, %173 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1097367Z %175 = tt.load %174 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T08:55:14.1097591Z %176 = ttg.local_alloc %175 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T08:55:14.1097927Z %177 = ttg.local_load %176 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1098336Z %178 = arith.extf %177 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1098623Z %179 = arith.extsi %167 : i32 to i64 2026-02-21T08:55:14.1098838Z %180 = tt.splat %179 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1099137Z %181 = arith.addi %180, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:55:14.1099534Z %182 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1099934Z %183 = arith.muli %182, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1100244Z %184 = tt.broadcast %183 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1100554Z %185 = arith.addi %184, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1100862Z %186 = tt.addptr %9, %185 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1101182Z %187 = arith.cmpi sge, %182, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1101430Z %188 = arith.cmpi slt, %182, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1101666Z %189 = arith.andi %187, %188 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1101977Z %190 = tt.broadcast %189 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1102277Z %191 = arith.andi %190, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1102522Z %192 = tt.load %186, %191, %cst_3 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1102771Z %193 = arith.shli %192, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1103006Z %194 = arith.shrsi %193, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1103244Z %195 = arith.shrsi %192, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1103532Z %196 = tt.expand_dims %194 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1103871Z %197 = tt.expand_dims %195 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1104189Z %198 = tt.broadcast %196 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1104425Z %199 = arith.select %17, %198, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1104665Z %200 = tt.broadcast %197 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1104895Z %201 = arith.select %19, %200, %199 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1105127Z %202 = tt.reshape %201 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T08:55:14.1105353Z %203 = arith.sitofp %202 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T08:55:14.1105654Z %204 = ttg.convert_layout %203 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1106130Z %205 = tt.dot %178, %204, %166, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma> 2026-02-21T08:55:14.1106485Z scf.yield %205 : tensor<64x16xf32, #mma> 2026-02-21T08:55:14.1106615Z } {tt.disallow_acc_multi_buffer} 2026-02-21T08:55:14.1106758Z %59 = arith.addi %48, %27 : tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1106953Z %60 = tt.addptr %8, %59 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T08:55:14.1107156Z %61 = tt.load %60 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T08:55:14.1107373Z %62 = ttg.local_alloc %61 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T08:55:14.1107694Z %63 = ttg.local_load %62 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1108127Z %64 = arith.extf %63 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1108453Z %65 = arith.addi %31, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1108757Z %66 = tt.addptr %9, %65 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1109060Z %67 = arith.andi %35, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1109295Z %68 = tt.load %66, %67, %cst_3 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1109538Z %69 = arith.shli %68, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1109765Z %70 = arith.shrsi %69, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1110004Z %71 = arith.shrsi %68, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1110298Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1110624Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T08:55:14.1110898Z %74 = tt.broadcast %72 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1111129Z %75 = arith.select %17, %74, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1111362Z %76 = tt.broadcast %73 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1111589Z %77 = arith.select %19, %76, %75 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T08:55:14.1111811Z %78 = tt.reshape %77 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T08:55:14.1112027Z %79 = arith.sitofp %78 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T08:55:14.1112347Z %80 = ttg.convert_layout %79 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1112804Z %81 = tt.dot %64, %80, %58, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma> 2026-02-21T08:55:14.1113180Z %82 = arith.truncf %81 : tensor<64x16xf32, #mma> to tensor<64x16xbf16, #mma> 2026-02-21T08:55:14.1113434Z %83 = tt.expand_dims %42 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T08:55:14.1113667Z %84 = arith.muli %83, %cst_13 : tensor<64x1xi32, #mma> 2026-02-21T08:55:14.1113893Z %85 = tt.expand_dims %45 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma> 2026-02-21T08:55:14.1114143Z %86 = tt.broadcast %84 : tensor<64x1xi32, #mma> -> tensor<64x16xi32, #mma> 2026-02-21T08:55:14.1114343Z %87 = tt.broadcast %85 : tensor<1x16xi32, #mma> -> tensor<64x16xi32, #mma> 2026-02-21T08:55:14.1114516Z %88 = arith.addi %86, %87 : tensor<64x16xi32, #mma> 2026-02-21T08:55:14.1114697Z %89 = tt.addptr %20, %88 : tensor<64x16x!tt.ptr, #mma>, tensor<64x16xi32, #mma> 2026-02-21T08:55:14.1114885Z tt.store %89, %82 : tensor<64x16x!tt.ptr, #mma> 2026-02-21T08:55:14.1115024Z } {tt.num_stages = 1 : i32} 2026-02-21T08:55:14.1115131Z tt.return 2026-02-21T08:55:14.1115212Z } 2026-02-21T08:55:14.1115290Z } 2026-02-21T08:55:14.1115334Z 2026-02-21T08:55:14.1115366Z {-# 2026-02-21T08:55:14.1115448Z external_resources: { 2026-02-21T08:55:14.1115548Z mlir_reproducer: { 2026-02-21T08:55:14.1116579Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:55:14.1117588Z disable_threading: false, 2026-02-21T08:55:14.1117693Z verify_each: true 2026-02-21T08:55:14.1117783Z } 2026-02-21T08:55:14.1117853Z } 2026-02-21T08:55:14.1117923Z #-} 2026-02-21T08:55:14.1118202Z /tmp/torchinductor_root/sk/cskbzeetrf2svkrdmuxrjrtonuici6bh4yb2dhzv47xc7dtiicyl.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:55:14.1118950Z /tmp/torchinductor_root/sk/cskbzeetrf2svkrdmuxrjrtonuici6bh4yb2dhzv47xc7dtiicyl.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:55:14.1119506Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:55:14.1120283Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 16], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, False], range_num_stages=[3, 0], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T08:55:14.1120993Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:55:14.1121164Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:55:14.1461058Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:55:14.1465497Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:55:14.1465870Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:55:14.1466179Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:55:14.1466475Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:55:14.1466756Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T08:55:14.1467015Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:55:14.1467205Z #smem = #ttg.shared_memory 2026-02-21T08:55:14.1467444Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:55:14.1467939Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:55:14.1468320Z %cst = arith.constant dense<8192> : tensor<16x1xi32, #mma> 2026-02-21T08:55:14.1468498Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:14.1468673Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:14.1468856Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma> 2026-02-21T08:55:14.1469121Z %cst_3 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1> 2026-02-21T08:55:14.1469299Z %cst_4 = arith.constant dense<1024> : tensor<16x1xi32, #blocked2> 2026-02-21T08:55:14.1469452Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:55:14.1469570Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:55:14.1469684Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:55:14.1469802Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:55:14.1469916Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:55:14.1470058Z %cst_5 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked> 2026-02-21T08:55:14.1470209Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:55:14.1470321Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:55:14.1470506Z %cst_6 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1470693Z %0 = tt.get_program_id x : i32 2026-02-21T08:55:14.1470846Z %1 = arith.remsi %0, %c32_i32 : i32 2026-02-21T08:55:14.1470969Z %2 = arith.divsi %0, %c32_i32 : i32 2026-02-21T08:55:14.1471080Z %3 = arith.muli %1, %c256_i32 : i32 2026-02-21T08:55:14.1471270Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:14.1471546Z %5 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1471797Z %6 = tt.splat %3 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:14.1472006Z %7 = tt.splat %3 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1472221Z %8 = arith.addi %6, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:14.1472421Z %9 = arith.addi %7, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:14.1472582Z %10 = arith.muli %2, %c16_i32 : i32 2026-02-21T08:55:14.1472779Z %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:55:14.1473050Z %12 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:14.1473333Z %13 = tt.splat %10 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:55:14.1473542Z %14 = tt.splat %10 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:14.1473746Z %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:55:14.1473951Z %16 = arith.addi %14, %12 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:14.1474185Z %17 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:14.1474452Z %18 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:55:14.1474784Z %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T08:55:14.1475033Z %20 = arith.muli %19, %cst_4 : tensor<16x1xi32, #blocked2> 2026-02-21T08:55:14.1475225Z %21 = tt.broadcast %20 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:55:14.1475436Z %22 = tt.splat %arg0 : !tt.ptr -> tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:55:14.1475699Z %23 = tt.expand_dims %9 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T08:55:14.1475978Z %24 = tt.broadcast %23 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T08:55:14.1476186Z %25 = tt.splat %arg1 : !tt.ptr -> tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T08:55:14.1476457Z %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:55:14.1476912Z %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:55:14.1477313Z %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:14.1477563Z %29 = arith.cmpi eq, %28, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:14.1477757Z %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T08:55:14.1477954Z %31 = arith.cmpi eq, %28, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:14.1478139Z %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T08:55:14.1478406Z %33 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst_2) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T08:55:14.1478670Z %43 = tt.splat %arg3 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:14.1478894Z %44 = arith.addi %43, %17 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:14.1479064Z %45 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T08:55:14.1479226Z %46 = tt.splat %45 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:55:14.1479435Z %47 = arith.addi %46, %18 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:55:14.1479704Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:55:14.1479970Z %49 = tt.broadcast %48 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:55:14.1480153Z %50 = arith.addi %21, %49 : tensor<16x4xi32, #blocked2> 2026-02-21T08:55:14.1480346Z %51 = tt.addptr %22, %50 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T08:55:14.1480545Z %52 = tt.load %51 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:55:14.1480805Z %53 = ttg.convert_layout %52 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1481235Z %54 = arith.extf %53 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1481607Z %55 = tt.expand_dims %44 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T08:55:14.1481844Z %56 = arith.muli %55, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T08:55:14.1482029Z %57 = tt.broadcast %56 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T08:55:14.1482214Z %58 = arith.addi %57, %24 : tensor<2x256xi32, #blocked1> 2026-02-21T08:55:14.1482407Z %59 = tt.addptr %25, %58 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T08:55:14.1482669Z %60 = tt.load %59 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T08:55:14.1482911Z %61 = ttg.convert_layout %60 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1483189Z %62 = arith.shli %61, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1483417Z %63 = arith.shrsi %62, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1483646Z %64 = arith.shrsi %61, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1483927Z %65 = tt.expand_dims %63 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T08:55:14.1484260Z %66 = tt.expand_dims %64 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T08:55:14.1484581Z %67 = tt.broadcast %65 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T08:55:14.1484815Z %68 = arith.select %30, %67, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T08:55:14.1485054Z %69 = tt.broadcast %66 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T08:55:14.1485278Z %70 = arith.select %32, %69, %68 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T08:55:14.1485508Z %71 = tt.reshape %70 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T08:55:14.1485725Z %72 = arith.sitofp %71 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T08:55:14.1485969Z %73 = ttg.local_alloc %72 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem> 2026-02-21T08:55:14.1486299Z %74 = ttg.local_load %73 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1486775Z %75 = tt.dot %54, %74, %arg4, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T08:55:14.1487129Z %76 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T08:55:14.1487297Z %77 = tt.splat %76 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:14.1487511Z %78 = arith.addi %77, %17 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:14.1487678Z %79 = arith.muli %76, %c2_i32 : i32 2026-02-21T08:55:14.1487839Z %80 = tt.splat %79 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:55:14.1488050Z %81 = arith.addi %80, %18 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:55:14.1488317Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:55:14.1488587Z %83 = tt.broadcast %82 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:55:14.1488772Z %84 = arith.addi %21, %83 : tensor<16x4xi32, #blocked2> 2026-02-21T08:55:14.1489002Z %85 = tt.addptr %22, %84 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T08:55:14.1489199Z %86 = tt.load %85 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:55:14.1489457Z %87 = ttg.convert_layout %86 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1489851Z %88 = arith.extf %87 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1490225Z %89 = tt.expand_dims %78 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T08:55:14.1490463Z %90 = arith.muli %89, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T08:55:14.1490648Z %91 = tt.broadcast %90 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T08:55:14.1490837Z %92 = arith.addi %91, %24 : tensor<2x256xi32, #blocked1> 2026-02-21T08:55:14.1491024Z %93 = tt.addptr %25, %92 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T08:55:14.1491218Z %94 = tt.load %93 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T08:55:14.1491450Z %95 = ttg.convert_layout %94 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1491730Z %96 = arith.shli %95, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1491962Z %97 = arith.shrsi %96, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1492187Z %98 = arith.shrsi %95, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:14.1492502Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T08:55:14.1492832Z %100 = tt.expand_dims %98 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T08:55:14.1493114Z %101 = tt.broadcast %99 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T08:55:14.1493354Z %102 = arith.select %30, %101, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T08:55:14.1493594Z %103 = tt.broadcast %100 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T08:55:14.1493828Z %104 = arith.select %32, %103, %102 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T08:55:14.1494057Z %105 = tt.reshape %104 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T08:55:14.1494280Z %106 = arith.sitofp %105 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T08:55:14.1494531Z %107 = ttg.local_alloc %106 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem> 2026-02-21T08:55:14.1494852Z %108 = ttg.local_load %107 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:14.1495326Z %109 = tt.dot %88, %108, %75, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T08:55:14.1495671Z scf.yield %109 : tensor<16x256xf32, #mma> 2026-02-21T08:55:14.1495794Z } {tt.num_stages = 1 : i32} 2026-02-21T08:55:14.1495945Z %34 = arith.truncf %33 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma> 2026-02-21T08:55:14.1496199Z %35 = tt.expand_dims %16 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T08:55:14.1496430Z %36 = arith.muli %35, %cst : tensor<16x1xi32, #mma> 2026-02-21T08:55:14.1496655Z %37 = tt.expand_dims %8 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T08:55:14.1496947Z %38 = tt.broadcast %36 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T08:55:14.1497142Z %39 = tt.broadcast %37 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T08:55:14.1497312Z %40 = arith.addi %38, %39 : tensor<16x256xi32, #mma> 2026-02-21T08:55:14.1497479Z %41 = tt.splat %arg2 : !tt.ptr -> tensor<16x256x!tt.ptr, #mma> 2026-02-21T08:55:14.1497683Z %42 = tt.addptr %41, %40 : tensor<16x256x!tt.ptr, #mma>, tensor<16x256xi32, #mma> 2026-02-21T08:55:14.1497868Z tt.store %42, %34 : tensor<16x256x!tt.ptr, #mma> 2026-02-21T08:55:14.1497992Z tt.return 2026-02-21T08:55:14.1498069Z } 2026-02-21T08:55:14.1498138Z } 2026-02-21T08:55:14.1498179Z 2026-02-21T08:55:14.1498207Z {-# 2026-02-21T08:55:14.1498285Z external_resources: { 2026-02-21T08:55:14.1498381Z mlir_reproducer: { 2026-02-21T08:55:14.1499378Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:55:14.1500378Z disable_threading: false, 2026-02-21T08:55:14.1500479Z verify_each: true 2026-02-21T08:55:14.1500566Z } 2026-02-21T08:55:14.1500635Z } 2026-02-21T08:55:14.1500701Z #-} 2026-02-21T08:55:14.1501011Z /tmp/torchinductor_root/5m/c5m2kcg5c6cotizhcpyzqmazzriuywnv35vebvsgepalqimlyqxy.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:55:14.1501696Z /tmp/torchinductor_root/5m/c5m2kcg5c6cotizhcpyzqmazzriuywnv35vebvsgepalqimlyqxy.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:55:14.1502241Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:55:14.1502959Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 256], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T08:55:14.1503609Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:55:14.1503774Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:55:15.3996627Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:55:15.3999318Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}> 2026-02-21T08:55:15.4000212Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T08:55:15.4000984Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}> 2026-02-21T08:55:15.4001663Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T08:55:15.4002293Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 32, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:55:15.4002932Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:55:15.4003655Z #smem = #ttg.shared_memory 2026-02-21T08:55:15.4004219Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:55:15.4005369Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:55:15.4006394Z %cst = arith.constant dense<4096> : tensor<512x1xi64, #mma> 2026-02-21T08:55:15.4006812Z %cst_0 = arith.constant dense<0> : tensor<512x1xi64, #mma> 2026-02-21T08:55:15.4007210Z %cst_1 = arith.constant dense<8192> : tensor<512x1xi64, #mma> 2026-02-21T08:55:15.4007619Z %cst_2 = arith.constant dense<8192> : tensor<1x256xi64, #mma> 2026-02-21T08:55:15.4008007Z %cst_3 = arith.constant dense<0> : tensor<1x256xi64, #mma> 2026-02-21T08:55:15.4008421Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:15.4008835Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:15.4009270Z %cst_6 = arith.constant dense<1024> : tensor<512x1xi32, #blocked1> 2026-02-21T08:55:15.4009637Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:55:15.4010004Z %cst_7 = arith.constant dense<0.000000e+00> : tensor<512x256xf32, #mma> 2026-02-21T08:55:15.4010328Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:55:15.4010518Z %c512_i64 = arith.constant 512 : i64 2026-02-21T08:55:15.4010710Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:55:15.4010899Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:55:15.4011087Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:55:15.4011272Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:55:15.4011657Z %cst_8 = arith.constant dense<0> : tensor<1x256xi8, #blocked2> 2026-02-21T08:55:15.4011947Z %cst_9 = arith.constant dense<8192> : tensor<1x256xi64, #blocked2> 2026-02-21T08:55:15.4012242Z %cst_10 = arith.constant dense<0> : tensor<1x256xi64, #blocked2> 2026-02-21T08:55:15.4012524Z %cst_11 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked> 2026-02-21T08:55:15.4012758Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:55:15.4012948Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:55:15.4013122Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:55:15.4013309Z %c4864_i32 = arith.constant 4864 : i32 2026-02-21T08:55:15.4013611Z %cst_12 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:15.4013923Z %0 = tt.get_program_id x : i32 2026-02-21T08:55:15.4014253Z %1 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:15.4014718Z %2 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:15.4015164Z %3 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:15.4015567Z %4 = tt.splat %arg0 : !tt.ptr -> tensor<512x2x!tt.ptr, #blocked1> 2026-02-21T08:55:15.4015888Z %5 = tt.splat %arg1 : !tt.ptr -> tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T08:55:15.4016284Z %6 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:55:15.4016815Z %7 = arith.extsi %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:55:15.4017414Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:55:15.4018108Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:55:15.4018822Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:15.4019241Z %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:15.4019566Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T08:55:15.4019883Z %13 = arith.cmpi eq, %10, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:55:15.4020213Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T08:55:15.4020507Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<512x256x!tt.ptr, #mma> 2026-02-21T08:55:15.4020847Z %16 = arith.extsi %2 : tensor<512xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:15.4021221Z %17 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:15.4021608Z %18 = arith.extsi %17 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:15.4021921Z scf.for %arg3 = %0 to %c256_i32 step %c4864_i32 : i32 { 2026-02-21T08:55:15.4022097Z %19 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T08:55:15.4022250Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T08:55:15.4022395Z %21 = arith.subi %c8_i32, %20 : i32 2026-02-21T08:55:15.4022540Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T08:55:15.4022692Z %23 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T08:55:15.4022835Z %24 = arith.remsi %23, %22 : i32 2026-02-21T08:55:15.4022971Z %25 = arith.addi %20, %24 : i32 2026-02-21T08:55:15.4023110Z %26 = arith.divsi %23, %22 : i32 2026-02-21T08:55:15.4023252Z %27 = arith.muli %25, %c512_i32 : i32 2026-02-21T08:55:15.4023517Z %28 = tt.splat %27 : i32 -> tensor<512xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:15.4023799Z %29 = arith.addi %28, %1 : tensor<512xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:55:15.4024019Z %30 = arith.muli %26, %c256_i32 : i32 2026-02-21T08:55:15.4024295Z %31 = tt.expand_dims %29 {axis = 1 : i32} : tensor<512xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<512x1xi32, #blocked1> 2026-02-21T08:55:15.4024617Z %32 = arith.muli %31, %cst_6 : tensor<512x1xi32, #blocked1> 2026-02-21T08:55:15.4024854Z %33 = tt.broadcast %32 : tensor<512x1xi32, #blocked1> -> tensor<512x2xi32, #blocked1> 2026-02-21T08:55:15.4025071Z %34 = arith.extsi %30 : i32 to i64 2026-02-21T08:55:15.4025276Z %35 = tt.splat %34 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:55:15.4025554Z %36 = arith.addi %35, %7 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:55:15.4025891Z %37 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x256xi64, #blocked2> 2026-02-21T08:55:15.4026218Z %38 = arith.cmpi sge, %37, %cst_10 : tensor<1x256xi64, #blocked2> 2026-02-21T08:55:15.4026431Z %39 = arith.cmpi slt, %37, %cst_9 : tensor<1x256xi64, #blocked2> 2026-02-21T08:55:15.4026635Z %40 = arith.andi %38, %39 : tensor<1x256xi1, #blocked2> 2026-02-21T08:55:15.4026921Z %41 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst_7) -> (tensor<512x256xf32, #mma>) : i32 { 2026-02-21T08:55:15.4027199Z %64 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:55:15.4027405Z %65 = tt.splat %64 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:15.4027671Z %66 = arith.addi %65, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:15.4028012Z %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:55:15.4028345Z %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<512x2xi32, #blocked1> 2026-02-21T08:55:15.4028627Z %69 = arith.addi %33, %68 : tensor<512x2xi32, #blocked1> 2026-02-21T08:55:15.4028865Z %70 = tt.addptr %4, %69 : tensor<512x2x!tt.ptr, #blocked1>, tensor<512x2xi32, #blocked1> 2026-02-21T08:55:15.4029118Z %71 = tt.load %70 : tensor<512x2x!tt.ptr, #blocked1> 2026-02-21T08:55:15.4029396Z %72 = ttg.local_alloc %71 : (tensor<512x2xbf16, #blocked1>) -> !ttg.memdesc<512x2xbf16, #shared, #smem> 2026-02-21T08:55:15.4029807Z %73 = ttg.local_load %72 : !ttg.memdesc<512x2xbf16, #shared, #smem> -> tensor<512x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:15.4030328Z %74 = arith.extf %73 : tensor<512x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<512x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:15.4030616Z %75 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:55:15.4030747Z %76 = arith.muli %75, %c8192_i64 : i64 2026-02-21T08:55:15.4030892Z %77 = tt.splat %76 : i64 -> tensor<1x256xi64, #blocked2> 2026-02-21T08:55:15.4031048Z %78 = arith.addi %77, %37 : tensor<1x256xi64, #blocked2> 2026-02-21T08:55:15.4031243Z %79 = tt.addptr %5, %78 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi64, #blocked2> 2026-02-21T08:55:15.4031451Z %80 = tt.load %79, %40, %cst_8 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T08:55:15.4031709Z %81 = ttg.convert_layout %80 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:15.4031992Z %82 = arith.shli %81, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:15.4032226Z %83 = arith.shrsi %82, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:15.4032499Z %84 = arith.shrsi %81, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:15.4032789Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T08:55:15.4033128Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T08:55:15.4033412Z %87 = tt.broadcast %85 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T08:55:15.4033650Z %88 = arith.select %12, %87, %cst_11 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T08:55:15.4033885Z %89 = tt.broadcast %86 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T08:55:15.4034112Z %90 = arith.select %14, %89, %88 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T08:55:15.4034343Z %91 = tt.reshape %90 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T08:55:15.4034566Z %92 = arith.sitofp %91 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T08:55:15.4034816Z %93 = ttg.local_alloc %92 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared1, #smem> 2026-02-21T08:55:15.4035141Z %94 = ttg.local_load %93 : !ttg.memdesc<2x256xf32, #shared1, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:15.4035617Z %95 = tt.dot %74, %94, %arg5, inputPrecision = tf32 : tensor<512x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<512x256xf32, #mma> 2026-02-21T08:55:15.4035965Z %96 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T08:55:15.4036089Z %97 = arith.muli %96, %c2_i32 : i32 2026-02-21T08:55:15.4036256Z %98 = tt.splat %97 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:15.4036477Z %99 = arith.addi %98, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:55:15.4036788Z %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:55:15.4037065Z %101 = tt.broadcast %100 : tensor<1x2xi32, #blocked1> -> tensor<512x2xi32, #blocked1> 2026-02-21T08:55:15.4037265Z %102 = arith.addi %33, %101 : tensor<512x2xi32, #blocked1> 2026-02-21T08:55:15.4037469Z %103 = tt.addptr %4, %102 : tensor<512x2x!tt.ptr, #blocked1>, tensor<512x2xi32, #blocked1> 2026-02-21T08:55:15.4037678Z %104 = tt.load %103 : tensor<512x2x!tt.ptr, #blocked1> 2026-02-21T08:55:15.4037905Z %105 = ttg.local_alloc %104 : (tensor<512x2xbf16, #blocked1>) -> !ttg.memdesc<512x2xbf16, #shared, #smem> 2026-02-21T08:55:15.4038241Z %106 = ttg.local_load %105 : !ttg.memdesc<512x2xbf16, #shared, #smem> -> tensor<512x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:15.4038653Z %107 = arith.extf %106 : tensor<512x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<512x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:15.4038940Z %108 = arith.extsi %96 : i32 to i64 2026-02-21T08:55:15.4039065Z %109 = arith.muli %108, %c8192_i64 : i64 2026-02-21T08:55:15.4039210Z %110 = tt.splat %109 : i64 -> tensor<1x256xi64, #blocked2> 2026-02-21T08:55:15.4039373Z %111 = arith.addi %110, %37 : tensor<1x256xi64, #blocked2> 2026-02-21T08:55:15.4039575Z %112 = tt.addptr %5, %111 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi64, #blocked2> 2026-02-21T08:55:15.4039767Z %113 = arith.cmpi slt, %108, %c512_i64 : i64 2026-02-21T08:55:15.4039914Z %114 = tt.splat %113 : i1 -> tensor<1x256xi1, #blocked2> 2026-02-21T08:55:15.4040071Z %115 = arith.andi %114, %40 : tensor<1x256xi1, #blocked2> 2026-02-21T08:55:15.4040294Z %116 = tt.load %112, %115, %cst_8 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T08:55:15.4040561Z %117 = ttg.convert_layout %116 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:15.4040842Z %118 = arith.shli %117, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:15.4041079Z %119 = arith.shrsi %118, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:15.4041314Z %120 = arith.shrsi %117, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:55:15.4041606Z %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T08:55:15.4041946Z %122 = tt.expand_dims %120 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T08:55:15.4042235Z %123 = tt.broadcast %121 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T08:55:15.4042480Z %124 = arith.select %12, %123, %cst_11 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T08:55:15.4042769Z %125 = tt.broadcast %122 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T08:55:15.4043004Z %126 = arith.select %14, %125, %124 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T08:55:15.4043235Z %127 = tt.reshape %126 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T08:55:15.4043460Z %128 = arith.sitofp %127 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T08:55:15.4043718Z %129 = ttg.local_alloc %128 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared1, #smem> 2026-02-21T08:55:15.4044044Z %130 = ttg.local_load %129 : !ttg.memdesc<2x256xf32, #shared1, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:55:15.4044518Z %131 = tt.dot %107, %130, %95, inputPrecision = tf32 : tensor<512x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<512x256xf32, #mma> 2026-02-21T08:55:15.4044909Z scf.yield %131 : tensor<512x256xf32, #mma> 2026-02-21T08:55:15.4045059Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:55:15.4045247Z %42 = arith.truncf %41 : tensor<512x256xf32, #mma> to tensor<512x256xbf16, #mma> 2026-02-21T08:55:15.4045414Z %43 = arith.extsi %27 : i32 to i64 2026-02-21T08:55:15.4045581Z %44 = tt.splat %43 : i64 -> tensor<512xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:15.4045789Z %45 = arith.addi %44, %16 : tensor<512xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:55:15.4046052Z %46 = tt.expand_dims %45 {axis = 1 : i32} : tensor<512xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<512x1xi64, #mma> 2026-02-21T08:55:15.4046287Z %47 = arith.muli %46, %cst_1 : tensor<512x1xi64, #mma> 2026-02-21T08:55:15.4046464Z %48 = tt.broadcast %47 : tensor<512x1xi64, #mma> -> tensor<512x256xi64, #mma> 2026-02-21T08:55:15.4046667Z %49 = tt.splat %34 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:15.4046874Z %50 = arith.addi %49, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:55:15.4047136Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T08:55:15.4047395Z %52 = tt.broadcast %51 : tensor<1x256xi64, #mma> -> tensor<512x256xi64, #mma> 2026-02-21T08:55:15.4047573Z %53 = arith.addi %48, %52 : tensor<512x256xi64, #mma> 2026-02-21T08:55:15.4047760Z %54 = tt.addptr %15, %53 : tensor<512x256x!tt.ptr, #mma>, tensor<512x256xi64, #mma> 2026-02-21T08:55:15.4048000Z %55 = arith.cmpi sge, %46, %cst_0 : tensor<512x1xi64, #mma> 2026-02-21T08:55:15.4048160Z %56 = arith.cmpi slt, %46, %cst : tensor<512x1xi64, #mma> 2026-02-21T08:55:15.4048312Z %57 = arith.andi %55, %56 : tensor<512x1xi1, #mma> 2026-02-21T08:55:15.4048477Z %58 = tt.broadcast %57 : tensor<512x1xi1, #mma> -> tensor<512x256xi1, #mma> 2026-02-21T08:55:15.4048658Z %59 = arith.cmpi sge, %51, %cst_3 : tensor<1x256xi64, #mma> 2026-02-21T08:55:15.4048815Z %60 = arith.cmpi slt, %51, %cst_2 : tensor<1x256xi64, #mma> 2026-02-21T08:55:15.4048966Z %61 = arith.andi %59, %60 : tensor<1x256xi1, #mma> 2026-02-21T08:55:15.4049132Z %62 = tt.broadcast %61 : tensor<1x256xi1, #mma> -> tensor<512x256xi1, #mma> 2026-02-21T08:55:15.4049302Z %63 = arith.andi %58, %62 : tensor<512x256xi1, #mma> 2026-02-21T08:55:15.4049458Z tt.store %54, %42, %63 : tensor<512x256x!tt.ptr, #mma> 2026-02-21T08:55:15.4049663Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T08:55:15.4049848Z tt.return 2026-02-21T08:55:15.4049927Z } 2026-02-21T08:55:15.4050004Z } 2026-02-21T08:55:15.4050050Z 2026-02-21T08:55:15.4050083Z {-# 2026-02-21T08:55:15.4050163Z external_resources: { 2026-02-21T08:55:15.4050265Z mlir_reproducer: { 2026-02-21T08:55:15.4051253Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:55:15.4052246Z disable_threading: false, 2026-02-21T08:55:15.4052357Z verify_each: true 2026-02-21T08:55:15.4052445Z } 2026-02-21T08:55:15.4052518Z } 2026-02-21T08:55:15.4052585Z #-} 2026-02-21T08:55:15.4052898Z /tmp/torchinductor_root/h5/ch546cwt6jmz5jakx6zpb6tj35nv6hr6ogcmrgeettbjvatzf4ob.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:55:15.4053571Z /tmp/torchinductor_root/h5/ch546cwt6jmz5jakx6zpb6tj35nv6hr6ogcmrgeettbjvatzf4ob.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:55:15.4054115Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:55:15.4054900Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 512, 256], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[2, 4], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T08:55:15.4055616Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:55:15.4055781Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:55:16.3142655Z Initial population exploring neighbors 12% ━╸ 12/100 3.3 configs/s 2026-02-21T08:55:16.3145262Z WARNING:tritonbench.utils.triton_op:Completed input ID 17: 2026-02-21T08:55:16.3145681Z x_val 2026-02-21T08:55:16.3145920Z --------------------- 2026-02-21T08:55:16.3146181Z (1, 4096, 8192, 1024) 2026-02-21T08:55:16.3146337Z 2026-02-21T08:55:16.3161607Z 60%|██████ | 6/10 [45:55<24:07, 361.86s/it]WARNING:tritonbench.utils.triton_op:Running input ID 21: 2026-02-21T08:55:16.3161909Z x_val 2026-02-21T08:55:16.3162036Z --------------------- 2026-02-21T08:55:16.3162790Z (4, 4096, 8192, 1024) 2026-02-21T08:55:16.3165170Z INFO:tritonbench.utils.triton_op:Took 0.20ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:55:17.3446484Z INFO:tritonbench.utils.triton_op:Took 4.36ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:55:19.3304995Z Autotune Choices Stats: 2026-02-21T08:55:19.3306667Z {"num_choices": 37, "num_triton_choices": 36, "best_kernel": "triton_mm_155", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.4005120098590851, "best_triton_pos": 0} 2026-02-21T08:55:19.3313854Z AUTOTUNE mm(16384x1024, 1024x8192) 2026-02-21T08:55:19.3315691Z strides: [1024, 1], [8192, 1] 2026-02-21T08:55:19.3316162Z dtypes: torch.bfloat16, torch.bfloat16 2026-02-21T08:55:19.3317262Z triton_mm_155 0.4005 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:55:19.3319039Z triton_mm_152 0.4050 ms 98.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:55:19.3320550Z triton_mm_149 0.4173 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:55:19.3321448Z mm 0.4410 ms 90.8% 2026-02-21T08:55:19.3322302Z triton_mm_143 0.4791 ms 83.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:55:19.3323760Z triton_mm_150 0.4793 ms 83.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:55:19.3325451Z triton_mm_146 0.4929 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T08:55:19.3326654Z triton_mm_154 0.4969 ms 80.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:55:19.3327853Z triton_mm_153 0.5374 ms 74.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:55:19.3329066Z triton_mm_148 0.5606 ms 71.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:55:19.3329999Z SingleProcess AUTOTUNE benchmarking takes 1.4528 seconds and 0.3548 seconds precompiling for 37 choices 2026-02-21T08:55:20.7390495Z INFO:tritonbench.utils.triton_op:Took 0.14ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:55:20.7401424Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:55:20.7401831Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:55:20.7402192Z 'dtype': 'torch.bfloat16', 2026-02-21T08:55:20.7402521Z 'shape': (4, 4096, 1024), 2026-02-21T08:55:20.7402939Z 'stride': (4194304, 1024, 1)}, 2026-02-21T08:55:20.7403219Z { 'device': 'cuda:0', 2026-02-21T08:55:20.7403464Z 'dtype': 'torch.int32', 2026-02-21T08:55:20.7404220Z 'shape': (1024, 8192), 2026-02-21T08:55:20.7404469Z 'stride': (8192, 1)}), 2026-02-21T08:55:20.7404713Z 'kwargs': {}} 2026-02-21T08:55:20.7417853Z INFO:tritonbench.utils.triton_op:Took 1.87ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:55:20.9163230Z [0s] Autotune random seed: 2134834638 2026-02-21T08:55:21.0209622Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:55:58.8921394Z [37s] Timeout after 30s compiling Config(block_sizes=[128, 2, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:56:02.9408338Z [41s] Timeout after 30s compiling Config(block_sizes=[32, 2048, 1], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[3, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:56:02.9428940Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s 2026-02-21T08:56:07.0499481Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:56:07.0508226Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:56:07.0511274Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T08:56:07.0511835Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T08:56:07.0512804Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T08:56:07.0513329Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:56:07.0514092Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:56:07.0514755Z %cst = arith.constant dense<0.000000e+00> : tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0515025Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:56:07.0515225Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:56:07.0515417Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:56:07.0515683Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:56:07.0515935Z %cst_0 = arith.constant dense<0> : tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0516191Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:56:07.0516391Z %c510_i32 = arith.constant 510 : i32 2026-02-21T08:56:07.0516556Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:56:07.0516740Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T08:56:07.0516917Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:56:07.0517097Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:56:07.0517371Z %cst_1 = arith.constant dense<0> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0517759Z %cst_2 = arith.constant dense<8192> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0518131Z %cst_3 = arith.constant dense<0> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0518638Z %cst_4 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1> 2026-02-21T08:56:07.0518941Z %cst_5 = arith.constant dense<4> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0519258Z %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:07.0519506Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:07.0519757Z %cst_8 = arith.constant dense<8192> : tensor<32x1xi32, #mma> 2026-02-21T08:56:07.0519972Z %0 = tt.get_program_id x : i32 2026-02-21T08:56:07.0520129Z %1 = arith.muli %0, %c4_i32 : i32 2026-02-21T08:56:07.0520293Z %2 = arith.addi %1, %c4_i32 : i32 2026-02-21T08:56:07.0520464Z %3 = arith.minsi %2, %c131072_i32 : i32 2026-02-21T08:56:07.0520758Z %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:07.0521159Z %5 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:07.0521654Z %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:56:07.0522102Z %7 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:07.0522486Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0522944Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:56:07.0523297Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0523873Z %11 = arith.extsi %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:56:07.0524507Z %12 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:56:07.0525160Z %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:56:07.0525749Z %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:07.0526118Z %15 = arith.cmpi eq, %14, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:07.0526390Z %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked> 2026-02-21T08:56:07.0526692Z %17 = arith.cmpi eq, %14, %cst_7 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:07.0526901Z %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked> 2026-02-21T08:56:07.0527143Z %19 = tt.splat %arg2 : !tt.ptr -> tensor<32x32x!tt.ptr, #mma> 2026-02-21T08:56:07.0527324Z %20 = arith.subi %3, %1 : i32 2026-02-21T08:56:07.0527453Z %21 = arith.remsi %20, %c2_i32 : i32 2026-02-21T08:56:07.0527598Z %22 = arith.subi %20, %21 : i32 2026-02-21T08:56:07.0527723Z %23 = arith.addi %1, %22 : i32 2026-02-21T08:56:07.0527868Z scf.for %arg3 = %1 to %23 step %c2_i32 : i32 { 2026-02-21T08:56:07.0528023Z %24 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T08:56:07.0528169Z %25 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T08:56:07.0528307Z %26 = arith.muli %24, %c32_i32 : i32 2026-02-21T08:56:07.0528499Z %27 = tt.splat %26 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:07.0528748Z %28 = tt.splat %26 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:07.0528987Z %29 = arith.addi %27, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:07.0529274Z %30 = arith.addi %28, %5 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:07.0529465Z %31 = arith.muli %25, %c32_i32 : i32 2026-02-21T08:56:07.0529649Z %32 = tt.splat %31 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:07.0529877Z %33 = arith.addi %32, %7 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:07.0530174Z %34 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T08:56:07.0530465Z %35 = arith.muli %34, %cst_4 : tensor<32x1xi32, #blocked1> 2026-02-21T08:56:07.0530680Z %36 = tt.broadcast %35 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0530884Z %37 = arith.extsi %31 : i32 to i64 2026-02-21T08:56:07.0531117Z %38 = tt.splat %37 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:56:07.0531459Z %39 = arith.addi %38, %11 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:56:07.0531901Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0532313Z %41 = arith.cmpi sge, %40, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0532587Z %42 = arith.cmpi slt, %40, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0532857Z %43 = arith.andi %41, %42 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0533152Z %44 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T08:56:07.0533400Z %85 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:56:07.0533594Z %86 = tt.splat %85 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0533852Z %87 = arith.addi %86, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0534206Z %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:07.0534519Z %89 = tt.broadcast %88 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0534740Z %90 = arith.addi %36, %89 : tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0534967Z %91 = tt.addptr %9, %90 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0535198Z %92 = tt.load %91 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:56:07.0535507Z %93 = ttg.convert_layout %92 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0535965Z %94 = arith.extf %93 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0536285Z %95 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:56:07.0536428Z %96 = arith.muli %95, %c8192_i64 : i64 2026-02-21T08:56:07.0536609Z %97 = tt.splat %96 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0536838Z %98 = arith.addi %97, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0537146Z %99 = tt.addptr %10, %98 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0537474Z %100 = tt.load %99, %43, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0537726Z %101 = arith.shli %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0538012Z %102 = arith.shrsi %101, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0538255Z %103 = arith.shrsi %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0538553Z %104 = tt.expand_dims %102 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0538894Z %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0539181Z %106 = tt.broadcast %104 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0539427Z %107 = arith.select %16, %106, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0539672Z %108 = tt.broadcast %105 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0539906Z %109 = arith.select %18, %108, %107 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0540146Z %110 = tt.reshape %109 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T08:56:07.0540371Z %111 = arith.sitofp %110 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T08:56:07.0540671Z %112 = ttg.convert_layout %111 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0541147Z %113 = tt.dot %94, %112, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0541498Z %114 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T08:56:07.0541629Z %115 = arith.muli %114, %c2_i32 : i32 2026-02-21T08:56:07.0541806Z %116 = tt.splat %115 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0542035Z %117 = arith.addi %116, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0542316Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:07.0542637Z %119 = tt.broadcast %118 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0542839Z %120 = arith.addi %36, %119 : tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0543042Z %121 = tt.addptr %9, %120 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0543257Z %122 = tt.load %121 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:56:07.0543532Z %123 = ttg.convert_layout %122 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0543941Z %124 = arith.extf %123 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0544229Z %125 = arith.extsi %114 : i32 to i64 2026-02-21T08:56:07.0544356Z %126 = arith.muli %125, %c8192_i64 : i64 2026-02-21T08:56:07.0544542Z %127 = tt.splat %126 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0544782Z %128 = arith.addi %127, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0545098Z %129 = tt.addptr %10, %128 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0545432Z %130 = tt.load %129, %43, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0545685Z %131 = arith.shli %130, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0545923Z %132 = arith.shrsi %131, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0546220Z %133 = arith.shrsi %130, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0546518Z %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0546859Z %135 = tt.expand_dims %133 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0547147Z %136 = tt.broadcast %134 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0547387Z %137 = arith.select %16, %136, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0547627Z %138 = tt.broadcast %135 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0547860Z %139 = arith.select %18, %138, %137 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0548099Z %140 = tt.reshape %139 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T08:56:07.0548325Z %141 = arith.sitofp %140 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T08:56:07.0548626Z %142 = ttg.convert_layout %141 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0549094Z %143 = tt.dot %124, %142, %113, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0549441Z %144 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T08:56:07.0549572Z %145 = arith.muli %144, %c2_i32 : i32 2026-02-21T08:56:07.0549748Z %146 = tt.splat %145 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0549974Z %147 = arith.addi %146, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0550259Z %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:07.0550576Z %149 = tt.broadcast %148 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0550780Z %150 = arith.addi %36, %149 : tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0550986Z %151 = tt.addptr %9, %150 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0551191Z %152 = tt.load %151 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:56:07.0551463Z %153 = ttg.convert_layout %152 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0551864Z %154 = arith.extf %153 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0552153Z %155 = arith.extsi %144 : i32 to i64 2026-02-21T08:56:07.0552283Z %156 = arith.muli %155, %c8192_i64 : i64 2026-02-21T08:56:07.0552464Z %157 = tt.splat %156 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0552701Z %158 = arith.addi %157, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0553013Z %159 = tt.addptr %10, %158 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0553341Z %160 = tt.load %159, %43, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0553593Z %161 = arith.shli %160, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0553830Z %162 = arith.shrsi %161, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0554101Z %163 = arith.shrsi %160, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0554392Z %164 = tt.expand_dims %162 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0554735Z %165 = tt.expand_dims %163 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0555029Z %166 = tt.broadcast %164 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0555268Z %167 = arith.select %16, %166, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0555512Z %168 = tt.broadcast %165 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0555744Z %169 = arith.select %18, %168, %167 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0555983Z %170 = tt.reshape %169 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T08:56:07.0556209Z %171 = arith.sitofp %170 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T08:56:07.0556506Z %172 = ttg.convert_layout %171 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0556972Z %173 = tt.dot %154, %172, %143, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0557323Z scf.yield %173 : tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0557456Z } {tt.disallow_acc_multi_buffer} 2026-02-21T08:56:07.0557676Z %45 = scf.for %arg4 = %c510_i32 to %c512_i32 step %c1_i32 iter_args(%arg5 = %44) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T08:56:07.0557895Z %85 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:56:07.0558075Z %86 = tt.splat %85 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0558297Z %87 = arith.addi %86, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0558610Z %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:07.0558888Z %89 = tt.broadcast %88 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0559081Z %90 = arith.addi %36, %89 : tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0559282Z %91 = tt.addptr %9, %90 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0559484Z %92 = tt.load %91 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:56:07.0559750Z %93 = ttg.convert_layout %92 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0560153Z %94 = arith.extf %93 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0560438Z %95 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:56:07.0560564Z %96 = arith.muli %95, %c8192_i64 : i64 2026-02-21T08:56:07.0560734Z %97 = tt.splat %96 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0560964Z %98 = arith.addi %97, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0561271Z %99 = tt.addptr %10, %98 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0561592Z %100 = tt.load %99, %43, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0561842Z %101 = arith.shli %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0562117Z %102 = arith.shrsi %101, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0562392Z %103 = arith.shrsi %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0562732Z %104 = tt.expand_dims %102 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0563072Z %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0563361Z %106 = tt.broadcast %104 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0563607Z %107 = arith.select %16, %106, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0563846Z %108 = tt.broadcast %105 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0564085Z %109 = arith.select %18, %108, %107 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0564318Z %110 = tt.reshape %109 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T08:56:07.0564548Z %111 = arith.sitofp %110 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T08:56:07.0564848Z %112 = ttg.convert_layout %111 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0565314Z %113 = tt.dot %94, %112, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0565669Z scf.yield %113 : tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0565821Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:56:07.0566013Z %46 = arith.truncf %45 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma> 2026-02-21T08:56:07.0566280Z %47 = tt.expand_dims %30 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi32, #mma> 2026-02-21T08:56:07.0566558Z %48 = arith.muli %47, %cst_8 : tensor<32x1xi32, #mma> 2026-02-21T08:56:07.0566790Z %49 = tt.expand_dims %33 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma> 2026-02-21T08:56:07.0567045Z %50 = tt.broadcast %48 : tensor<32x1xi32, #mma> -> tensor<32x32xi32, #mma> 2026-02-21T08:56:07.0567246Z %51 = tt.broadcast %49 : tensor<1x32xi32, #mma> -> tensor<32x32xi32, #mma> 2026-02-21T08:56:07.0567426Z %52 = arith.addi %50, %51 : tensor<32x32xi32, #mma> 2026-02-21T08:56:07.0567609Z %53 = tt.addptr %19, %52 : tensor<32x32x!tt.ptr, #mma>, tensor<32x32xi32, #mma> 2026-02-21T08:56:07.0567802Z tt.store %53, %46 : tensor<32x32x!tt.ptr, #mma> 2026-02-21T08:56:07.0567941Z %54 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T08:56:07.0568070Z %55 = arith.remsi %54, %c512_i32 : i32 2026-02-21T08:56:07.0568191Z %56 = arith.divsi %54, %c512_i32 : i32 2026-02-21T08:56:07.0568314Z %57 = arith.muli %55, %c32_i32 : i32 2026-02-21T08:56:07.0568488Z %58 = tt.splat %57 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:07.0568701Z %59 = tt.splat %57 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:07.0568919Z %60 = arith.addi %58, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:07.0569129Z %61 = arith.addi %59, %5 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:07.0569300Z %62 = arith.muli %56, %c32_i32 : i32 2026-02-21T08:56:07.0569457Z %63 = tt.splat %62 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:07.0569663Z %64 = arith.addi %63, %7 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:07.0569977Z %65 = tt.expand_dims %60 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T08:56:07.0570227Z %66 = arith.muli %65, %cst_4 : tensor<32x1xi32, #blocked1> 2026-02-21T08:56:07.0570426Z %67 = tt.broadcast %66 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0570598Z %68 = arith.extsi %62 : i32 to i64 2026-02-21T08:56:07.0570807Z %69 = tt.splat %68 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:56:07.0571107Z %70 = arith.addi %69, %11 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:56:07.0571493Z %71 = tt.expand_dims %70 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0571856Z %72 = arith.cmpi sge, %71, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0572098Z %73 = arith.cmpi slt, %71, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0594337Z %74 = arith.andi %72, %73 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0594634Z %75 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T08:56:07.0594858Z %85 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:56:07.0595045Z %86 = tt.splat %85 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0595279Z %87 = arith.addi %86, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0595561Z %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:07.0595849Z %89 = tt.broadcast %88 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0596051Z %90 = arith.addi %67, %89 : tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0596257Z %91 = tt.addptr %9, %90 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0625171Z %92 = tt.load %91 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:56:07.0625454Z %93 = ttg.convert_layout %92 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0625873Z %94 = arith.extf %93 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0626163Z %95 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:56:07.0626295Z %96 = arith.muli %95, %c8192_i64 : i64 2026-02-21T08:56:07.0626474Z %97 = tt.splat %96 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0626715Z %98 = arith.addi %97, %71 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0627026Z %99 = tt.addptr %10, %98 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0627355Z %100 = tt.load %99, %74, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0627608Z %101 = arith.shli %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0627845Z %102 = arith.shrsi %101, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0628085Z %103 = arith.shrsi %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0628384Z %104 = tt.expand_dims %102 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0628843Z %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0629129Z %106 = tt.broadcast %104 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0629371Z %107 = arith.select %16, %106, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0629608Z %108 = tt.broadcast %105 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0629839Z %109 = arith.select %18, %108, %107 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0630066Z %110 = tt.reshape %109 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T08:56:07.0630295Z %111 = arith.sitofp %110 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T08:56:07.0630591Z %112 = ttg.convert_layout %111 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0631067Z %113 = tt.dot %94, %112, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0631424Z %114 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T08:56:07.0631549Z %115 = arith.muli %114, %c2_i32 : i32 2026-02-21T08:56:07.0631724Z %116 = tt.splat %115 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0631951Z %117 = arith.addi %116, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0632222Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:07.0632496Z %119 = tt.broadcast %118 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0632694Z %120 = arith.addi %67, %119 : tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0632902Z %121 = tt.addptr %9, %120 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0633161Z %122 = tt.load %121 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:56:07.0633425Z %123 = ttg.convert_layout %122 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0633827Z %124 = arith.extf %123 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0634107Z %125 = arith.extsi %114 : i32 to i64 2026-02-21T08:56:07.0634233Z %126 = arith.muli %125, %c8192_i64 : i64 2026-02-21T08:56:07.0634409Z %127 = tt.splat %126 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0634646Z %128 = arith.addi %127, %71 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0634960Z %129 = tt.addptr %10, %128 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0635287Z %130 = tt.load %129, %74, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0635536Z %131 = arith.shli %130, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0635777Z %132 = arith.shrsi %131, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0636014Z %133 = arith.shrsi %130, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0636306Z %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0636675Z %135 = tt.expand_dims %133 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0636963Z %136 = tt.broadcast %134 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0637207Z %137 = arith.select %16, %136, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0637442Z %138 = tt.broadcast %135 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0637676Z %139 = arith.select %18, %138, %137 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0637907Z %140 = tt.reshape %139 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T08:56:07.0638133Z %141 = arith.sitofp %140 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T08:56:07.0638431Z %142 = ttg.convert_layout %141 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0638895Z %143 = tt.dot %124, %142, %113, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0639245Z %144 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T08:56:07.0639368Z %145 = arith.muli %144, %c2_i32 : i32 2026-02-21T08:56:07.0639543Z %146 = tt.splat %145 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0639771Z %147 = arith.addi %146, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0640045Z %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:07.0640323Z %149 = tt.broadcast %148 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0640517Z %150 = arith.addi %67, %149 : tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0640723Z %151 = tt.addptr %9, %150 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0640968Z %152 = tt.load %151 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:56:07.0641231Z %153 = ttg.convert_layout %152 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0641634Z %154 = arith.extf %153 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0641911Z %155 = arith.extsi %144 : i32 to i64 2026-02-21T08:56:07.0642040Z %156 = arith.muli %155, %c8192_i64 : i64 2026-02-21T08:56:07.0642218Z %157 = tt.splat %156 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0642443Z %158 = arith.addi %157, %71 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0642825Z %159 = tt.addptr %10, %158 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0643154Z %160 = tt.load %159, %74, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0643401Z %161 = arith.shli %160, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0643640Z %162 = arith.shrsi %161, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0643879Z %163 = arith.shrsi %160, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0644171Z %164 = tt.expand_dims %162 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0644504Z %165 = tt.expand_dims %163 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0644866Z %166 = tt.broadcast %164 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0645105Z %167 = arith.select %16, %166, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0645344Z %168 = tt.broadcast %165 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0645573Z %169 = arith.select %18, %168, %167 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0645805Z %170 = tt.reshape %169 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T08:56:07.0646028Z %171 = arith.sitofp %170 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T08:56:07.0646323Z %172 = ttg.convert_layout %171 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0646787Z %173 = tt.dot %154, %172, %143, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0647136Z scf.yield %173 : tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0647266Z } {tt.disallow_acc_multi_buffer} 2026-02-21T08:56:07.0647479Z %76 = scf.for %arg4 = %c510_i32 to %c512_i32 step %c1_i32 iter_args(%arg5 = %75) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T08:56:07.0647693Z %85 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:56:07.0647864Z %86 = tt.splat %85 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0648081Z %87 = arith.addi %86, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0648356Z %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:07.0648637Z %89 = tt.broadcast %88 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0648828Z %90 = arith.addi %67, %89 : tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0649067Z %91 = tt.addptr %9, %90 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0649265Z %92 = tt.load %91 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:56:07.0649532Z %93 = ttg.convert_layout %92 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0649932Z %94 = arith.extf %93 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0650212Z %95 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:56:07.0650339Z %96 = arith.muli %95, %c8192_i64 : i64 2026-02-21T08:56:07.0650507Z %97 = tt.splat %96 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0650737Z %98 = arith.addi %97, %71 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0651046Z %99 = tt.addptr %10, %98 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0651363Z %100 = tt.load %99, %74, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0651612Z %101 = arith.shli %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0651849Z %102 = arith.shrsi %101, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0652087Z %103 = arith.shrsi %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0652379Z %104 = tt.expand_dims %102 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0652746Z %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0653036Z %106 = tt.broadcast %104 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0653273Z %107 = arith.select %16, %106, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0653509Z %108 = tt.broadcast %105 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0653743Z %109 = arith.select %18, %108, %107 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0653973Z %110 = tt.reshape %109 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T08:56:07.0654200Z %111 = arith.sitofp %110 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T08:56:07.0654497Z %112 = ttg.convert_layout %111 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0654961Z %113 = tt.dot %94, %112, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0655313Z scf.yield %113 : tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0655461Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:56:07.0655649Z %77 = arith.truncf %76 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma> 2026-02-21T08:56:07.0655911Z %78 = tt.expand_dims %61 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi32, #mma> 2026-02-21T08:56:07.0656145Z %79 = arith.muli %78, %cst_8 : tensor<32x1xi32, #mma> 2026-02-21T08:56:07.0656373Z %80 = tt.expand_dims %64 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma> 2026-02-21T08:56:07.0656625Z %81 = tt.broadcast %79 : tensor<32x1xi32, #mma> -> tensor<32x32xi32, #mma> 2026-02-21T08:56:07.0656824Z %82 = tt.broadcast %80 : tensor<1x32xi32, #mma> -> tensor<32x32xi32, #mma> 2026-02-21T08:56:07.0657036Z %83 = arith.addi %81, %82 : tensor<32x32xi32, #mma> 2026-02-21T08:56:07.0657219Z %84 = tt.addptr %19, %83 : tensor<32x32x!tt.ptr, #mma>, tensor<32x32xi32, #mma> 2026-02-21T08:56:07.0657412Z tt.store %84, %77 : tensor<32x32x!tt.ptr, #mma> 2026-02-21T08:56:07.0657548Z } {tt.num_stages = 1 : i32} 2026-02-21T08:56:07.0657672Z scf.for %arg3 = %23 to %3 step %c1_i32 : i32 { 2026-02-21T08:56:07.0657806Z %24 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T08:56:07.0657933Z %25 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T08:56:07.0658054Z %26 = arith.muli %24, %c32_i32 : i32 2026-02-21T08:56:07.0658221Z %27 = tt.splat %26 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:07.0658435Z %28 = tt.splat %26 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:07.0658644Z %29 = arith.addi %27, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:07.0658859Z %30 = arith.addi %28, %5 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:07.0659021Z %31 = arith.muli %25, %c32_i32 : i32 2026-02-21T08:56:07.0659182Z %32 = tt.splat %31 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:07.0659385Z %33 = arith.addi %32, %7 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:07.0659649Z %34 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T08:56:07.0659902Z %35 = arith.muli %34, %cst_4 : tensor<32x1xi32, #blocked1> 2026-02-21T08:56:07.0660090Z %36 = tt.broadcast %35 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0660323Z %37 = arith.extsi %31 : i32 to i64 2026-02-21T08:56:07.0660526Z %38 = tt.splat %37 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:56:07.0660823Z %39 = arith.addi %38, %11 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:56:07.0661208Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0661565Z %41 = arith.cmpi sge, %40, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0661808Z %42 = arith.cmpi slt, %40, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0662041Z %43 = arith.andi %41, %42 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0662302Z %44 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T08:56:07.0662516Z %54 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:56:07.0662686Z %55 = tt.splat %54 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0662910Z %56 = arith.addi %55, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0663183Z %57 = tt.expand_dims %56 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:07.0663453Z %58 = tt.broadcast %57 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0663644Z %59 = arith.addi %36, %58 : tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0663838Z %60 = tt.addptr %9, %59 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0664040Z %61 = tt.load %60 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:56:07.0664305Z %62 = ttg.convert_layout %61 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0664732Z %63 = arith.extf %62 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0665012Z %64 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:56:07.0665135Z %65 = arith.muli %64, %c8192_i64 : i64 2026-02-21T08:56:07.0665309Z %66 = tt.splat %65 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0665537Z %67 = arith.addi %66, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0665843Z %68 = tt.addptr %10, %67 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0666166Z %69 = tt.load %68, %43, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0666407Z %70 = arith.shli %69, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0666648Z %71 = arith.shrsi %70, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0666887Z %72 = arith.shrsi %69, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0667171Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0667508Z %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0667784Z %75 = tt.broadcast %73 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0668021Z %76 = arith.select %16, %75, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0668311Z %77 = tt.broadcast %74 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0668537Z %78 = arith.select %18, %77, %76 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0668764Z %79 = tt.reshape %78 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T08:56:07.0668981Z %80 = arith.sitofp %79 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T08:56:07.0669277Z %81 = ttg.convert_layout %80 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0669743Z %82 = tt.dot %63, %81, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0670085Z %83 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T08:56:07.0670219Z %84 = arith.muli %83, %c2_i32 : i32 2026-02-21T08:56:07.0670385Z %85 = tt.splat %84 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0670607Z %86 = arith.addi %85, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0670887Z %87 = tt.expand_dims %86 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:07.0671158Z %88 = tt.broadcast %87 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0671353Z %89 = arith.addi %36, %88 : tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0671548Z %90 = tt.addptr %9, %89 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0671755Z %91 = tt.load %90 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:56:07.0672029Z %92 = ttg.convert_layout %91 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0672423Z %93 = arith.extf %92 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0672739Z %94 = arith.extsi %83 : i32 to i64 2026-02-21T08:56:07.0672860Z %95 = arith.muli %94, %c8192_i64 : i64 2026-02-21T08:56:07.0673036Z %96 = tt.splat %95 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0673263Z %97 = arith.addi %96, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0673569Z %98 = tt.addptr %10, %97 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0673893Z %99 = tt.load %98, %43, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0674140Z %100 = arith.shli %99, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0674383Z %101 = arith.shrsi %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0674632Z %102 = arith.shrsi %99, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0674925Z %103 = tt.expand_dims %101 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0675264Z %104 = tt.expand_dims %102 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0675547Z %105 = tt.broadcast %103 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0675792Z %106 = arith.select %16, %105, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0676072Z %107 = tt.broadcast %104 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0676306Z %108 = arith.select %18, %107, %106 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0676544Z %109 = tt.reshape %108 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T08:56:07.0676768Z %110 = arith.sitofp %109 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T08:56:07.0677070Z %111 = ttg.convert_layout %110 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0677529Z %112 = tt.dot %93, %111, %82, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0677871Z %113 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T08:56:07.0678001Z %114 = arith.muli %113, %c2_i32 : i32 2026-02-21T08:56:07.0678179Z %115 = tt.splat %114 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0678404Z %116 = arith.addi %115, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0678686Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:07.0678964Z %118 = tt.broadcast %117 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0679164Z %119 = arith.addi %36, %118 : tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0679364Z %120 = tt.addptr %9, %119 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0679576Z %121 = tt.load %120 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:56:07.0679852Z %122 = ttg.convert_layout %121 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0680264Z %123 = arith.extf %122 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0680589Z %124 = arith.extsi %113 : i32 to i64 2026-02-21T08:56:07.0680718Z %125 = arith.muli %124, %c8192_i64 : i64 2026-02-21T08:56:07.0680897Z %126 = tt.splat %125 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0681133Z %127 = arith.addi %126, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0681445Z %128 = tt.addptr %10, %127 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0681775Z %129 = tt.load %128, %43, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0682029Z %130 = arith.shli %129, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0682267Z %131 = arith.shrsi %130, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0682514Z %132 = arith.shrsi %129, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0682846Z %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0683185Z %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0683475Z %135 = tt.broadcast %133 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0683714Z %136 = arith.select %16, %135, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0683955Z %137 = tt.broadcast %134 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0684237Z %138 = arith.select %18, %137, %136 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0684475Z %139 = tt.reshape %138 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T08:56:07.0684703Z %140 = arith.sitofp %139 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T08:56:07.0684999Z %141 = ttg.convert_layout %140 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0685462Z %142 = tt.dot %123, %141, %112, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0685808Z scf.yield %142 : tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0685941Z } {tt.disallow_acc_multi_buffer} 2026-02-21T08:56:07.0686157Z %45 = scf.for %arg4 = %c510_i32 to %c512_i32 step %c1_i32 iter_args(%arg5 = %44) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T08:56:07.0686372Z %54 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:56:07.0686546Z %55 = tt.splat %54 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0686764Z %56 = arith.addi %55, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.0687041Z %57 = tt.expand_dims %56 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:07.0687319Z %58 = tt.broadcast %57 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0687509Z %59 = arith.addi %36, %58 : tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0687709Z %60 = tt.addptr %9, %59 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T08:56:07.0687909Z %61 = tt.load %60 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T08:56:07.0688180Z %62 = ttg.convert_layout %61 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0688620Z %63 = arith.extf %62 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0688896Z %64 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:56:07.0689023Z %65 = arith.muli %64, %c8192_i64 : i64 2026-02-21T08:56:07.0689195Z %66 = tt.splat %65 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0689421Z %67 = arith.addi %66, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0689727Z %68 = tt.addptr %10, %67 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0690048Z %69 = tt.load %68, %43, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0690295Z %70 = arith.shli %69, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0690530Z %71 = arith.shrsi %70, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0690766Z %72 = arith.shrsi %69, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.0691052Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0691379Z %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T08:56:07.0691662Z %75 = tt.broadcast %73 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0691894Z %76 = arith.select %16, %75, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0692169Z %77 = tt.broadcast %74 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0692402Z %78 = arith.select %18, %77, %76 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T08:56:07.0692625Z %79 = tt.reshape %78 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T08:56:07.0692846Z %80 = arith.sitofp %79 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T08:56:07.0693135Z %81 = ttg.convert_layout %80 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.0693594Z %82 = tt.dot %63, %81, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0693945Z scf.yield %82 : tensor<32x32xf32, #mma> 2026-02-21T08:56:07.0694094Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:56:07.0694282Z %46 = arith.truncf %45 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma> 2026-02-21T08:56:07.0694542Z %47 = tt.expand_dims %30 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi32, #mma> 2026-02-21T08:56:07.0694779Z %48 = arith.muli %47, %cst_8 : tensor<32x1xi32, #mma> 2026-02-21T08:56:07.0695010Z %49 = tt.expand_dims %33 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma> 2026-02-21T08:56:07.0695260Z %50 = tt.broadcast %48 : tensor<32x1xi32, #mma> -> tensor<32x32xi32, #mma> 2026-02-21T08:56:07.0695459Z %51 = tt.broadcast %49 : tensor<1x32xi32, #mma> -> tensor<32x32xi32, #mma> 2026-02-21T08:56:07.0695631Z %52 = arith.addi %50, %51 : tensor<32x32xi32, #mma> 2026-02-21T08:56:07.0695815Z %53 = tt.addptr %19, %52 : tensor<32x32x!tt.ptr, #mma>, tensor<32x32xi32, #mma> 2026-02-21T08:56:07.0696010Z tt.store %53, %46 : tensor<32x32x!tt.ptr, #mma> 2026-02-21T08:56:07.0696145Z } {tt.num_stages = 1 : i32} 2026-02-21T08:56:07.0696284Z tt.return 2026-02-21T08:56:07.0696367Z } 2026-02-21T08:56:07.0696453Z } 2026-02-21T08:56:07.0696497Z 2026-02-21T08:56:07.0696530Z {-# 2026-02-21T08:56:07.0696619Z external_resources: { 2026-02-21T08:56:07.0696720Z mlir_reproducer: { 2026-02-21T08:56:07.0697723Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:56:07.0698726Z disable_threading: false, 2026-02-21T08:56:07.0698838Z verify_each: true 2026-02-21T08:56:07.0698933Z } 2026-02-21T08:56:07.0699013Z } 2026-02-21T08:56:07.0699089Z #-} 2026-02-21T08:56:07.0699375Z /tmp/torchinductor_root/ab/cab4wikgpuaqbr5wve3okzh76jhlc7ysmubfwuocu4fyavuwbuse.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:56:07.0700098Z /tmp/torchinductor_root/ab/cab4wikgpuaqbr5wve3okzh76jhlc7ysmubfwuocu4fyavuwbuse.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:56:07.0700648Z [46s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:56:07.0701466Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 32], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, False], range_num_stages=[3, 0], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T08:56:07.0702181Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:56:07.0702350Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:56:07.1074981Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:56:07.1079093Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:56:07.1079419Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:56:07.1079730Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:56:07.1080037Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:56:07.1080323Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T08:56:07.1080586Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:56:07.1080770Z #smem = #ttg.shared_memory 2026-02-21T08:56:07.1081010Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:56:07.1081490Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:56:07.1081864Z %cst = arith.constant dense<8192> : tensor<16x1xi32, #mma> 2026-02-21T08:56:07.1082155Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:07.1082330Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:07.1082519Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma> 2026-02-21T08:56:07.1082790Z %cst_3 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1> 2026-02-21T08:56:07.1082972Z %cst_4 = arith.constant dense<1024> : tensor<16x1xi32, #blocked2> 2026-02-21T08:56:07.1083129Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:56:07.1083249Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:56:07.1083373Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:56:07.1083490Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:56:07.1083613Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:56:07.1083763Z %cst_5 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked> 2026-02-21T08:56:07.1083912Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:56:07.1084031Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:56:07.1084218Z %cst_6 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.1084415Z %0 = tt.get_program_id x : i32 2026-02-21T08:56:07.1084529Z %1 = arith.remsi %0, %c32_i32 : i32 2026-02-21T08:56:07.1084650Z %2 = arith.divsi %0, %c32_i32 : i32 2026-02-21T08:56:07.1084763Z %3 = arith.muli %1, %c256_i32 : i32 2026-02-21T08:56:07.1084960Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:07.1085244Z %5 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.1085493Z %6 = tt.splat %3 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:07.1085761Z %7 = tt.splat %3 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.1085972Z %8 = arith.addi %6, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:07.1086184Z %9 = arith.addi %7, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:07.1086356Z %10 = arith.muli %2, %c16_i32 : i32 2026-02-21T08:56:07.1086577Z %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:07.1086852Z %12 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:07.1087100Z %13 = tt.splat %10 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:07.1087312Z %14 = tt.splat %10 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:07.1087528Z %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:07.1087747Z %16 = arith.addi %14, %12 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:07.1087988Z %17 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:07.1088267Z %18 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:07.1088573Z %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T08:56:07.1088830Z %20 = arith.muli %19, %cst_4 : tensor<16x1xi32, #blocked2> 2026-02-21T08:56:07.1089021Z %21 = tt.broadcast %20 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:56:07.1089239Z %22 = tt.splat %arg0 : !tt.ptr -> tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:56:07.1089512Z %23 = tt.expand_dims %9 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T08:56:07.1089798Z %24 = tt.broadcast %23 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T08:56:07.1090052Z %25 = tt.splat %arg1 : !tt.ptr -> tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T08:56:07.1090326Z %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:56:07.1090740Z %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:56:07.1091141Z %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:07.1091393Z %29 = arith.cmpi eq, %28, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:07.1091594Z %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T08:56:07.1091794Z %31 = arith.cmpi eq, %28, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:07.1091982Z %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T08:56:07.1092252Z %33 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst_2) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T08:56:07.1092529Z %43 = tt.splat %arg3 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:07.1092756Z %44 = arith.addi %43, %17 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:07.1092935Z %45 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T08:56:07.1093104Z %46 = tt.splat %45 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:07.1093324Z %47 = arith.addi %46, %18 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:07.1093642Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:56:07.1093918Z %49 = tt.broadcast %48 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:56:07.1094114Z %50 = arith.addi %21, %49 : tensor<16x4xi32, #blocked2> 2026-02-21T08:56:07.1094315Z %51 = tt.addptr %22, %50 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T08:56:07.1094527Z %52 = tt.load %51 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:56:07.1094794Z %53 = ttg.convert_layout %52 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.1095202Z %54 = arith.extf %53 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.1095590Z %55 = tt.expand_dims %44 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T08:56:07.1095840Z %56 = arith.muli %55, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T08:56:07.1096034Z %57 = tt.broadcast %56 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T08:56:07.1096225Z %58 = arith.addi %57, %24 : tensor<2x256xi32, #blocked1> 2026-02-21T08:56:07.1096423Z %59 = tt.addptr %25, %58 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T08:56:07.1096621Z %60 = tt.load %59 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T08:56:07.1096864Z %61 = ttg.convert_layout %60 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.1097145Z %62 = arith.shli %61, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.1097376Z %63 = arith.shrsi %62, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.1097616Z %64 = arith.shrsi %61, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.1097904Z %65 = tt.expand_dims %63 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T08:56:07.1098469Z %66 = tt.expand_dims %64 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T08:56:07.1098755Z %67 = tt.broadcast %65 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T08:56:07.1098992Z %68 = arith.select %30, %67, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T08:56:07.1099232Z %69 = tt.broadcast %66 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T08:56:07.1099464Z %70 = arith.select %32, %69, %68 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T08:56:07.1099692Z %71 = tt.reshape %70 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T08:56:07.1099918Z %72 = arith.sitofp %71 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T08:56:07.1100165Z %73 = ttg.local_alloc %72 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem> 2026-02-21T08:56:07.1100491Z %74 = ttg.local_load %73 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.1100968Z %75 = tt.dot %54, %74, %arg4, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T08:56:07.1101312Z %76 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T08:56:07.1101487Z %77 = tt.splat %76 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:07.1101705Z %78 = arith.addi %77, %17 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:07.1101918Z %79 = arith.muli %76, %c2_i32 : i32 2026-02-21T08:56:07.1102087Z %80 = tt.splat %79 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:07.1102300Z %81 = arith.addi %80, %18 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:07.1102574Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:56:07.1102845Z %83 = tt.broadcast %82 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:56:07.1103039Z %84 = arith.addi %21, %83 : tensor<16x4xi32, #blocked2> 2026-02-21T08:56:07.1103238Z %85 = tt.addptr %22, %84 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T08:56:07.1103438Z %86 = tt.load %85 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:56:07.1103702Z %87 = ttg.convert_layout %86 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.1104097Z %88 = arith.extf %87 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.1104477Z %89 = tt.expand_dims %78 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T08:56:07.1104724Z %90 = arith.muli %89, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T08:56:07.1104914Z %91 = tt.broadcast %90 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T08:56:07.1105113Z %92 = arith.addi %91, %24 : tensor<2x256xi32, #blocked1> 2026-02-21T08:56:07.1105306Z %93 = tt.addptr %25, %92 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T08:56:07.1105509Z %94 = tt.load %93 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T08:56:07.1105755Z %95 = ttg.convert_layout %94 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.1106036Z %96 = arith.shli %95, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.1106306Z %97 = arith.shrsi %96, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.1106541Z %98 = arith.shrsi %95, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:07.1106829Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T08:56:07.1107168Z %100 = tt.expand_dims %98 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T08:56:07.1107453Z %101 = tt.broadcast %99 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T08:56:07.1107702Z %102 = arith.select %30, %101, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T08:56:07.1107949Z %103 = tt.broadcast %100 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T08:56:07.1108189Z %104 = arith.select %32, %103, %102 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T08:56:07.1108431Z %105 = tt.reshape %104 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T08:56:07.1108658Z %106 = arith.sitofp %105 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T08:56:07.1108916Z %107 = ttg.local_alloc %106 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem> 2026-02-21T08:56:07.1109239Z %108 = ttg.local_load %107 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:07.1109748Z %109 = tt.dot %88, %108, %75, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T08:56:07.1110098Z scf.yield %109 : tensor<16x256xf32, #mma> 2026-02-21T08:56:07.1110228Z } {tt.num_stages = 1 : i32} 2026-02-21T08:56:07.1110388Z %34 = arith.truncf %33 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma> 2026-02-21T08:56:07.1110647Z %35 = tt.expand_dims %16 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T08:56:07.1110883Z %36 = arith.muli %35, %cst : tensor<16x1xi32, #mma> 2026-02-21T08:56:07.1111116Z %37 = tt.expand_dims %8 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T08:56:07.1111370Z %38 = tt.broadcast %36 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T08:56:07.1111574Z %39 = tt.broadcast %37 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T08:56:07.1111750Z %40 = arith.addi %38, %39 : tensor<16x256xi32, #mma> 2026-02-21T08:56:07.1111931Z %41 = tt.splat %arg2 : !tt.ptr -> tensor<16x256x!tt.ptr, #mma> 2026-02-21T08:56:07.1112142Z %42 = tt.addptr %41, %40 : tensor<16x256x!tt.ptr, #mma>, tensor<16x256xi32, #mma> 2026-02-21T08:56:07.1112340Z tt.store %42, %34 : tensor<16x256x!tt.ptr, #mma> 2026-02-21T08:56:07.1112475Z tt.return 2026-02-21T08:56:07.1112560Z } 2026-02-21T08:56:07.1112640Z } 2026-02-21T08:56:07.1112684Z 2026-02-21T08:56:07.1112717Z {-# 2026-02-21T08:56:07.1112805Z external_resources: { 2026-02-21T08:56:07.1112906Z mlir_reproducer: { 2026-02-21T08:56:07.1113922Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:56:07.1114951Z disable_threading: false, 2026-02-21T08:56:07.1115060Z verify_each: true 2026-02-21T08:56:07.1115158Z } 2026-02-21T08:56:07.1115237Z } 2026-02-21T08:56:07.1115310Z #-} 2026-02-21T08:56:07.1115596Z /tmp/torchinductor_root/ws/cwsiwzhboqoddose42mt7vdu26zo3fsbcmzz47doftpr32jj3n5r.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:56:07.1116283Z /tmp/torchinductor_root/ws/cwsiwzhboqoddose42mt7vdu26zo3fsbcmzz47doftpr32jj3n5r.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:56:07.1116842Z [46s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:56:07.1117569Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 256], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T08:56:07.1118222Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:56:07.1118398Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:56:39.2865476Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:56:39.2870515Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 16], order = [2, 1, 0]}> 2026-02-21T08:56:39.2874649Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [1, 64], warpsPerCTA = [2, 8], order = [1, 0]}> 2026-02-21T08:56:39.2875583Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T08:56:39.2876374Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 16], order = [1, 0]}> 2026-02-21T08:56:39.2877097Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 16], instrShape = [32, 32], isTransposed = true}> 2026-02-21T08:56:39.2877749Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:56:39.2878220Z #smem = #ttg.shared_memory 2026-02-21T08:56:39.2878805Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:56:39.2880004Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:56:39.2880992Z %cst = arith.constant dense<8192> : tensor<1x4096xi64, #mma> 2026-02-21T08:56:39.2881434Z %cst_0 = arith.constant dense<0> : tensor<1x4096xi64, #mma> 2026-02-21T08:56:39.2881872Z %cst_1 = arith.constant dense<16384> : tensor<16x1xi64, #mma> 2026-02-21T08:56:39.2882287Z %cst_2 = arith.constant dense<0> : tensor<16x1xi64, #mma> 2026-02-21T08:56:39.2882837Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi64, #mma> 2026-02-21T08:56:39.2883150Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:39.2883457Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:39.2883782Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<16x4096xf32, #mma> 2026-02-21T08:56:39.2884119Z %cst_7 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1> 2026-02-21T08:56:39.2884439Z %cst_8 = arith.constant dense<1024> : tensor<16x1xi32, #blocked2> 2026-02-21T08:56:39.2884823Z %cst_9 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:39.2885477Z %cst_10 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:39.2885818Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:56:39.2886023Z %c6_i32 = arith.constant 6 : i32 2026-02-21T08:56:39.2886234Z %c510_i32 = arith.constant 510 : i32 2026-02-21T08:56:39.2886445Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:56:39.2886659Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:56:39.2886865Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:56:39.2887072Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:56:39.2887270Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:56:39.2887525Z %cst_11 = arith.constant dense<0> : tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2887791Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:56:39.2887994Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:56:39.2888199Z %c38912_i32 = arith.constant 38912 : i32 2026-02-21T08:56:39.2888533Z %cst_12 = arith.constant dense<4> : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2888876Z %0 = tt.get_program_id x : i32 2026-02-21T08:56:39.2889216Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:39.2889700Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:39.2890175Z %3 = tt.make_range {end = 4096 : i32, start = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:39.2890671Z %4 = tt.make_range {end = 4096 : i32, start = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:39.2891243Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:39.2891705Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:39.2892128Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:56:39.2892484Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<2x4096x!tt.ptr, #blocked1> 2026-02-21T08:56:39.2892959Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:56:39.2893505Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:56:39.2894022Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:39.2894367Z %12 = arith.cmpi eq, %11, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:39.2894630Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x4096xi1, #blocked> 2026-02-21T08:56:39.2894889Z %14 = arith.cmpi eq, %11, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:39.2895147Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x4096xi1, #blocked> 2026-02-21T08:56:39.2895422Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<16x4096x!tt.ptr, #mma> 2026-02-21T08:56:39.2895771Z %17 = arith.extsi %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:39.2896215Z %18 = arith.extsi %3 : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<4096xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:39.2896578Z %19 = arith.addi %5, %cst_9 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:39.2896880Z %20 = arith.addi %6, %cst_10 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:39.2897242Z %21 = tt.expand_dims %20 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:56:39.2897653Z %22 = tt.broadcast %21 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:56:39.2898010Z %23 = tt.expand_dims %19 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T08:56:39.2898326Z %24 = arith.muli %23, %cst_7 : tensor<2x1xi32, #blocked1> 2026-02-21T08:56:39.2898579Z %25 = tt.broadcast %24 : tensor<2x1xi32, #blocked1> -> tensor<2x4096xi32, #blocked1> 2026-02-21T08:56:39.2898834Z scf.for %arg3 = %0 to %c2048_i32 step %c38912_i32 : i32 { 2026-02-21T08:56:39.2899031Z %26 = arith.divsi %arg3, %c32_i32 : i32 2026-02-21T08:56:39.2899193Z %27 = arith.muli %26, %c16_i32 : i32 2026-02-21T08:56:39.2899350Z %28 = arith.subi %c1024_i32, %27 : i32 2026-02-21T08:56:39.2899509Z %29 = arith.minsi %28, %c16_i32 : i32 2026-02-21T08:56:39.2899667Z %30 = arith.remsi %arg3, %c32_i32 : i32 2026-02-21T08:56:39.2899831Z %31 = arith.remsi %30, %29 : i32 2026-02-21T08:56:39.2899978Z %32 = arith.addi %27, %31 : i32 2026-02-21T08:56:39.2900130Z %33 = arith.divsi %30, %29 : i32 2026-02-21T08:56:39.2900277Z %34 = arith.muli %32, %c16_i32 : i32 2026-02-21T08:56:39.2900505Z %35 = tt.splat %34 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:39.2900788Z %36 = arith.addi %35, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:39.2901019Z %37 = arith.muli %33, %c4096_i32 : i32 2026-02-21T08:56:39.2901241Z %38 = tt.splat %37 : i32 -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:39.2901538Z %39 = arith.addi %38, %4 : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:39.2901953Z %40 = tt.expand_dims %36 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T08:56:39.2902283Z %41 = arith.muli %40, %cst_8 : tensor<16x1xi32, #blocked2> 2026-02-21T08:56:39.2902531Z %42 = tt.broadcast %41 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:56:39.2902934Z %43 = tt.expand_dims %39 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4096xi32, #blocked1> 2026-02-21T08:56:39.2903279Z %44 = tt.broadcast %43 : tensor<1x4096xi32, #blocked1> -> tensor<2x4096xi32, #blocked1> 2026-02-21T08:56:39.2903568Z %45 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_6) -> (tensor<16x4096xf32, #mma>) : i32 { 2026-02-21T08:56:39.2903850Z %92 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:39.2904091Z %93 = arith.addi %92, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:39.2904274Z %94 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:56:39.2904456Z %95 = tt.splat %94 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:39.2904681Z %96 = arith.addi %95, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:39.2904966Z %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:56:39.2905253Z %98 = tt.broadcast %97 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:56:39.2905451Z %99 = arith.addi %42, %98 : tensor<16x4xi32, #blocked2> 2026-02-21T08:56:39.2905665Z %100 = tt.addptr %7, %99 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T08:56:39.2905885Z %101 = tt.load %100 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:56:39.2906174Z %102 = ttg.convert_layout %101 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:39.2906602Z %103 = arith.extf %102 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:39.2907048Z %104 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T08:56:39.2907310Z %105 = arith.muli %104, %cst_7 : tensor<2x1xi32, #blocked1> 2026-02-21T08:56:39.2907522Z %106 = tt.broadcast %105 : tensor<2x1xi32, #blocked1> -> tensor<2x4096xi32, #blocked1> 2026-02-21T08:56:39.2907730Z %107 = arith.addi %106, %44 : tensor<2x4096xi32, #blocked1> 2026-02-21T08:56:39.2907947Z %108 = tt.addptr %8, %107 : tensor<2x4096x!tt.ptr, #blocked1>, tensor<2x4096xi32, #blocked1> 2026-02-21T08:56:39.2908161Z %109 = tt.load %108 : tensor<2x4096x!tt.ptr, #blocked1> 2026-02-21T08:56:39.2908428Z %110 = ttg.convert_layout %109 : tensor<2x4096xi8, #blocked1> -> tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2908739Z %111 = arith.shli %110, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2908994Z %112 = arith.shrsi %111, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2909255Z %113 = arith.shrsi %110, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2909568Z %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked> 2026-02-21T08:56:39.2909934Z %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked> 2026-02-21T08:56:39.2910246Z %116 = tt.broadcast %114 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2910547Z %117 = arith.select %13, %116, %cst_11 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2910816Z %118 = tt.broadcast %115 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2911072Z %119 = arith.select %15, %118, %117 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2911330Z %120 = tt.reshape %119 : tensor<2x2x4096xi8, #blocked> -> tensor<4x4096xi8, #blocked3> 2026-02-21T08:56:39.2911581Z %121 = arith.sitofp %120 : tensor<4x4096xi8, #blocked3> to tensor<4x4096xf32, #blocked3> 2026-02-21T08:56:39.2911854Z %122 = ttg.local_alloc %121 : (tensor<4x4096xf32, #blocked3>) -> !ttg.memdesc<4x4096xf32, #shared, #smem> 2026-02-21T08:56:39.2912211Z %123 = ttg.local_load %122 : !ttg.memdesc<4x4096xf32, #shared, #smem> -> tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:39.2912724Z %124 = tt.dot %103, %123, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x4096xf32, #mma> 2026-02-21T08:56:39.2913106Z %125 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T08:56:39.2913291Z %126 = tt.splat %125 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:39.2913516Z %127 = arith.addi %126, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:39.2913697Z %128 = arith.muli %125, %c2_i32 : i32 2026-02-21T08:56:39.2913866Z %129 = tt.splat %128 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:39.2914091Z %130 = arith.addi %129, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:39.2914374Z %131 = tt.expand_dims %130 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:56:39.2914835Z %132 = tt.broadcast %131 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:56:39.2915291Z %133 = arith.addi %42, %132 : tensor<16x4xi32, #blocked2> 2026-02-21T08:56:39.2915527Z %134 = tt.addptr %7, %133 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T08:56:39.2915769Z %135 = tt.load %134 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:56:39.2916093Z %136 = ttg.convert_layout %135 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:39.2916524Z %137 = arith.extf %136 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:39.2916956Z %138 = tt.expand_dims %127 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T08:56:39.2917467Z %139 = arith.muli %138, %cst_7 : tensor<2x1xi32, #blocked1> 2026-02-21T08:56:39.2917693Z %140 = tt.broadcast %139 : tensor<2x1xi32, #blocked1> -> tensor<2x4096xi32, #blocked1> 2026-02-21T08:56:39.2917922Z %141 = arith.addi %140, %44 : tensor<2x4096xi32, #blocked1> 2026-02-21T08:56:39.2918154Z %142 = tt.addptr %8, %141 : tensor<2x4096x!tt.ptr, #blocked1>, tensor<2x4096xi32, #blocked1> 2026-02-21T08:56:39.2918402Z %143 = tt.load %142 : tensor<2x4096x!tt.ptr, #blocked1> 2026-02-21T08:56:39.2918767Z %144 = ttg.convert_layout %143 : tensor<2x4096xi8, #blocked1> -> tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2919105Z %145 = arith.shli %144, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2919392Z %146 = arith.shrsi %145, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2919690Z %147 = arith.shrsi %144, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2920036Z %148 = tt.expand_dims %146 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked> 2026-02-21T08:56:39.2920421Z %149 = tt.expand_dims %147 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked> 2026-02-21T08:56:39.2920736Z %150 = tt.broadcast %148 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2921221Z %151 = arith.select %13, %150, %cst_11 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2921491Z %152 = tt.broadcast %149 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2921772Z %153 = arith.select %15, %152, %151 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2922055Z %154 = tt.reshape %153 : tensor<2x2x4096xi8, #blocked> -> tensor<4x4096xi8, #blocked3> 2026-02-21T08:56:39.2922308Z %155 = arith.sitofp %154 : tensor<4x4096xi8, #blocked3> to tensor<4x4096xf32, #blocked3> 2026-02-21T08:56:39.2922645Z %156 = ttg.local_alloc %155 : (tensor<4x4096xf32, #blocked3>) -> !ttg.memdesc<4x4096xf32, #shared, #smem> 2026-02-21T08:56:39.2923037Z %157 = ttg.local_load %156 : !ttg.memdesc<4x4096xf32, #shared, #smem> -> tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:39.2923543Z %158 = tt.dot %137, %157, %124, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x4096xf32, #mma> 2026-02-21T08:56:39.2923932Z %159 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T08:56:39.2924124Z %160 = tt.splat %159 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:39.2924395Z %161 = arith.addi %160, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:39.2924605Z %162 = arith.muli %159, %c2_i32 : i32 2026-02-21T08:56:39.2925019Z %163 = tt.splat %162 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:39.2925279Z %164 = arith.addi %163, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:39.2925581Z %165 = tt.expand_dims %164 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:56:39.2925894Z %166 = tt.broadcast %165 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T08:56:39.2926124Z %167 = arith.addi %42, %166 : tensor<16x4xi32, #blocked2> 2026-02-21T08:56:39.2926422Z %168 = tt.addptr %7, %167 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T08:56:39.2926658Z %169 = tt.load %168 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:56:39.2926952Z %170 = ttg.convert_layout %169 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:39.2927400Z %171 = arith.extf %170 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:39.2927830Z %172 = tt.expand_dims %161 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T08:56:39.2928125Z %173 = arith.muli %172, %cst_7 : tensor<2x1xi32, #blocked1> 2026-02-21T08:56:39.2928337Z %174 = tt.broadcast %173 : tensor<2x1xi32, #blocked1> -> tensor<2x4096xi32, #blocked1> 2026-02-21T08:56:39.2928766Z %175 = arith.addi %174, %44 : tensor<2x4096xi32, #blocked1> 2026-02-21T08:56:39.2928998Z %176 = tt.addptr %8, %175 : tensor<2x4096x!tt.ptr, #blocked1>, tensor<2x4096xi32, #blocked1> 2026-02-21T08:56:39.2929276Z %177 = tt.load %176 : tensor<2x4096x!tt.ptr, #blocked1> 2026-02-21T08:56:39.2929556Z %178 = ttg.convert_layout %177 : tensor<2x4096xi8, #blocked1> -> tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2929875Z %179 = arith.shli %178, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2930150Z %180 = arith.shrsi %179, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2930434Z %181 = arith.shrsi %178, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2930761Z %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked> 2026-02-21T08:56:39.2931140Z %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked> 2026-02-21T08:56:39.2931446Z %184 = tt.broadcast %182 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2931763Z %185 = arith.select %13, %184, %cst_11 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2932232Z %186 = tt.broadcast %183 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2932508Z %187 = arith.select %15, %186, %185 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2932803Z %188 = tt.reshape %187 : tensor<2x2x4096xi8, #blocked> -> tensor<4x4096xi8, #blocked3> 2026-02-21T08:56:39.2933050Z %189 = arith.sitofp %188 : tensor<4x4096xi8, #blocked3> to tensor<4x4096xf32, #blocked3> 2026-02-21T08:56:39.2933345Z %190 = ttg.local_alloc %189 : (tensor<4x4096xf32, #blocked3>) -> !ttg.memdesc<4x4096xf32, #shared, #smem> 2026-02-21T08:56:39.2933726Z %191 = ttg.local_load %190 : !ttg.memdesc<4x4096xf32, #shared, #smem> -> tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:39.2934222Z %192 = tt.dot %171, %191, %158, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x4096xf32, #mma> 2026-02-21T08:56:39.2934643Z scf.yield %192 : tensor<16x4096xf32, #mma> 2026-02-21T08:56:39.2934837Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:56:39.2935018Z %46 = arith.addi %42, %22 : tensor<16x4xi32, #blocked2> 2026-02-21T08:56:39.2935245Z %47 = tt.addptr %7, %46 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T08:56:39.2935477Z %48 = tt.load %47 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T08:56:39.2935969Z %49 = ttg.convert_layout %48 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:39.2936393Z %50 = arith.extf %49 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:39.2936715Z %51 = arith.addi %25, %44 : tensor<2x4096xi32, #blocked1> 2026-02-21T08:56:39.2936951Z %52 = tt.addptr %8, %51 : tensor<2x4096x!tt.ptr, #blocked1>, tensor<2x4096xi32, #blocked1> 2026-02-21T08:56:39.2937173Z %53 = tt.load %52 : tensor<2x4096x!tt.ptr, #blocked1> 2026-02-21T08:56:39.2937449Z %54 = ttg.convert_layout %53 : tensor<2x4096xi8, #blocked1> -> tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2937770Z %55 = arith.shli %54, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2938032Z %56 = arith.shrsi %55, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2938306Z %57 = arith.shrsi %54, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:39.2938668Z %58 = tt.expand_dims %56 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked> 2026-02-21T08:56:39.2939027Z %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked> 2026-02-21T08:56:39.2939374Z %60 = tt.broadcast %58 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2939820Z %61 = arith.select %13, %60, %cst_11 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2940096Z %62 = tt.broadcast %59 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2940371Z %63 = arith.select %15, %62, %61 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked> 2026-02-21T08:56:39.2940619Z %64 = tt.reshape %63 : tensor<2x2x4096xi8, #blocked> -> tensor<4x4096xi8, #blocked3> 2026-02-21T08:56:39.2940879Z %65 = arith.sitofp %64 : tensor<4x4096xi8, #blocked3> to tensor<4x4096xf32, #blocked3> 2026-02-21T08:56:39.2941146Z %66 = ttg.local_alloc %65 : (tensor<4x4096xf32, #blocked3>) -> !ttg.memdesc<4x4096xf32, #shared, #smem> 2026-02-21T08:56:39.2941511Z %67 = ttg.local_load %66 : !ttg.memdesc<4x4096xf32, #shared, #smem> -> tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:39.2942008Z %68 = tt.dot %50, %67, %45, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x4096xf32, #mma> 2026-02-21T08:56:39.2942413Z %69 = arith.truncf %68 : tensor<16x4096xf32, #mma> to tensor<16x4096xbf16, #mma> 2026-02-21T08:56:39.2942625Z %70 = arith.extsi %34 : i32 to i64 2026-02-21T08:56:39.2942774Z %71 = arith.extsi %37 : i32 to i64 2026-02-21T08:56:39.2942953Z %72 = tt.splat %70 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:39.2943383Z %73 = arith.addi %72, %17 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:39.2943657Z %74 = tt.expand_dims %73 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma> 2026-02-21T08:56:39.2943969Z %75 = arith.muli %74, %cst_3 : tensor<16x1xi64, #mma> 2026-02-21T08:56:39.2944184Z %76 = tt.broadcast %75 : tensor<16x1xi64, #mma> -> tensor<16x4096xi64, #mma> 2026-02-21T08:56:39.2944411Z %77 = tt.splat %71 : i64 -> tensor<4096xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:39.2944650Z %78 = arith.addi %77, %18 : tensor<4096xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:39.2944940Z %79 = tt.expand_dims %78 {axis = 0 : i32} : tensor<4096xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x4096xi64, #mma> 2026-02-21T08:56:39.2945239Z %80 = tt.broadcast %79 : tensor<1x4096xi64, #mma> -> tensor<16x4096xi64, #mma> 2026-02-21T08:56:39.2945446Z %81 = arith.addi %76, %80 : tensor<16x4096xi64, #mma> 2026-02-21T08:56:39.2945663Z %82 = tt.addptr %16, %81 : tensor<16x4096x!tt.ptr, #mma>, tensor<16x4096xi64, #mma> 2026-02-21T08:56:39.2945902Z %83 = arith.cmpi sge, %74, %cst_2 : tensor<16x1xi64, #mma> 2026-02-21T08:56:39.2946076Z %84 = arith.cmpi slt, %74, %cst_1 : tensor<16x1xi64, #mma> 2026-02-21T08:56:39.2946543Z %85 = arith.andi %83, %84 : tensor<16x1xi1, #mma> 2026-02-21T08:56:39.2946756Z %86 = tt.broadcast %85 : tensor<16x1xi1, #mma> -> tensor<16x4096xi1, #mma> 2026-02-21T08:56:39.2946951Z %87 = arith.cmpi sge, %79, %cst_0 : tensor<1x4096xi64, #mma> 2026-02-21T08:56:39.2947155Z %88 = arith.cmpi slt, %79, %cst : tensor<1x4096xi64, #mma> 2026-02-21T08:56:39.2947331Z %89 = arith.andi %87, %88 : tensor<1x4096xi1, #mma> 2026-02-21T08:56:39.2947525Z %90 = tt.broadcast %89 : tensor<1x4096xi1, #mma> -> tensor<16x4096xi1, #mma> 2026-02-21T08:56:39.2947730Z %91 = arith.andi %86, %90 : tensor<16x4096xi1, #mma> 2026-02-21T08:56:39.2947961Z tt.store %82, %69, %91 : tensor<16x4096x!tt.ptr, #mma> 2026-02-21T08:56:39.2948165Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32} 2026-02-21T08:56:39.2948348Z tt.return 2026-02-21T08:56:39.2948462Z } 2026-02-21T08:56:39.2948558Z } 2026-02-21T08:56:39.2948802Z 2026-02-21T08:56:39.2948853Z {-# 2026-02-21T08:56:39.2948954Z external_resources: { 2026-02-21T08:56:39.2949084Z mlir_reproducer: { 2026-02-21T08:56:39.2950127Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:56:39.2951152Z disable_threading: false, 2026-02-21T08:56:39.2951278Z verify_each: true 2026-02-21T08:56:39.2951408Z } 2026-02-21T08:56:39.2951503Z } 2026-02-21T08:56:39.2951606Z #-} 2026-02-21T08:56:39.2951914Z /tmp/torchinductor_root/xo/cxobs6s22xqqcebp7prdcno4xjeayhmev525ddwkr7mbhksquilk.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:56:39.2952660Z /tmp/torchinductor_root/xo/cxobs6s22xqqcebp7prdcno4xjeayhmev525ddwkr7mbhksquilk.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:56:39.2953432Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:56:39.2954262Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 4096], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[0, 3], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T08:56:39.2955058Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:56:39.2955263Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:56:51.2251313Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:56:51.2260173Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:56:51.2261737Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [64, 1], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T08:56:51.2263048Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T08:56:51.2264194Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = true}> 2026-02-21T08:56:51.2265313Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 32, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:56:51.2266175Z #smem = #ttg.shared_memory 2026-02-21T08:56:51.2267216Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:56:51.2278209Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:56:51.2279819Z %cst = arith.constant dense<0.000000e+00> : tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2280105Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:56:51.2280306Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:56:51.2280504Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:56:51.2280776Z %cst_0 = arith.constant dense<0> : tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2281181Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:56:51.2281489Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:56:51.2281758Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:56:51.2281988Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:56:51.2282216Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:56:51.2282443Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:56:51.2282922Z %cst_1 = arith.constant dense<4161536> : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2283499Z %cst_2 = arith.constant dense<4128768> : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2283871Z %cst_3 = arith.constant dense<4096000> : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2284131Z %c15_i32 = arith.constant 15 : i32 2026-02-21T08:56:51.2284287Z %c14_i32 = arith.constant 14 : i32 2026-02-21T08:56:51.2284435Z %c13_i32 = arith.constant 13 : i32 2026-02-21T08:56:51.2284791Z %cst_4 = arith.constant dense<22> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2285292Z %cst_5 = arith.constant dense<20> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2285794Z %cst_6 = arith.constant dense<18> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2286281Z %cst_7 = arith.constant dense<16> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2286727Z %cst_8 = arith.constant dense<14> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2287221Z %cst_9 = arith.constant dense<12> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2287713Z %cst_10 = arith.constant dense<10> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2288346Z %cst_11 = arith.constant dense<8> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2288659Z %cst_12 = arith.constant dense<6> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2288974Z %cst_13 = arith.constant dense<4> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2289281Z %cst_14 = arith.constant dense<2> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2289524Z %c12_i32 = arith.constant 12 : i32 2026-02-21T08:56:51.2289677Z %c500_i32 = arith.constant 500 : i32 2026-02-21T08:56:51.2289833Z %c11_i32 = arith.constant 11 : i32 2026-02-21T08:56:51.2289981Z %c10_i32 = arith.constant 10 : i32 2026-02-21T08:56:51.2290130Z %c9_i32 = arith.constant 9 : i32 2026-02-21T08:56:51.2290280Z %c7_i32 = arith.constant 7 : i32 2026-02-21T08:56:51.2290418Z %c6_i32 = arith.constant 6 : i32 2026-02-21T08:56:51.2290567Z %c5_i32 = arith.constant 5 : i32 2026-02-21T08:56:51.2290756Z %cst_15 = arith.constant dense<1024> : tensor<2048x1xi32, #blocked1> 2026-02-21T08:56:51.2291045Z %cst_16 = arith.constant dense<4> : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2291321Z %cst_17 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:51.2291557Z %cst_18 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:51.2291781Z %cst_19 = arith.constant dense<8192> : tensor<2048x1xi64, #mma> 2026-02-21T08:56:51.2291996Z %cst_20 = arith.constant dense<0> : tensor<2048x1xi64, #mma> 2026-02-21T08:56:51.2292239Z %cst_21 = arith.constant dense<16384> : tensor<2048x1xi64, #mma> 2026-02-21T08:56:51.2292414Z %cst_22 = arith.constant dense<0> : tensor<1x16xi64, #mma> 2026-02-21T08:56:51.2292640Z %cst_23 = arith.constant dense<8192> : tensor<1x16xi64, #mma> 2026-02-21T08:56:51.2292793Z %0 = tt.get_program_id x : i32 2026-02-21T08:56:51.2292920Z %1 = arith.divsi %0, %c2048_i32 : i32 2026-02-21T08:56:51.2293048Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T08:56:51.2293162Z %3 = arith.subi %c8_i32, %2 : i32 2026-02-21T08:56:51.2293280Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T08:56:51.2293396Z %5 = arith.remsi %0, %c2048_i32 : i32 2026-02-21T08:56:51.2293519Z %6 = arith.remsi %5, %4 : i32 2026-02-21T08:56:51.2293632Z %7 = arith.addi %2, %6 : i32 2026-02-21T08:56:51.2293745Z %8 = arith.divsi %5, %4 : i32 2026-02-21T08:56:51.2293856Z %9 = arith.muli %7, %c2048_i32 : i32 2026-02-21T08:56:51.2294075Z %10 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:51.2294371Z %11 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:51.2294640Z %12 = tt.splat %9 : i32 -> tensor<2048xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:51.2294881Z %13 = arith.addi %12, %10 : tensor<2048xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:51.2295064Z %14 = arith.muli %8, %c16_i32 : i32 2026-02-21T08:56:51.2295310Z %15 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:56:51.2295632Z %16 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:51.2295920Z %17 = tt.splat %14 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:56:51.2296223Z %18 = arith.addi %17, %15 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T08:56:51.2296514Z %19 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2296837Z %20 = tt.expand_dims %13 {axis = 1 : i32} : tensor<2048xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2048x1xi32, #blocked1> 2026-02-21T08:56:51.2297138Z %21 = arith.muli %20, %cst_15 : tensor<2048x1xi32, #blocked1> 2026-02-21T08:56:51.2297345Z %22 = tt.broadcast %21 : tensor<2048x1xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2297578Z %23 = tt.splat %arg0 : !tt.ptr -> tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2297936Z %24 = tt.expand_dims %18 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2298324Z %25 = tt.splat %arg1 : !tt.ptr -> tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2298642Z %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:56:51.2299063Z %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:56:51.2299478Z %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:51.2299740Z %29 = arith.cmpi eq, %28, %cst_17 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:51.2299944Z %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x16xi1, #blocked> 2026-02-21T08:56:51.2300152Z %31 = arith.cmpi eq, %28, %cst_18 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:56:51.2300348Z %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x16xi1, #blocked> 2026-02-21T08:56:51.2300576Z %33 = ttg.local_alloc : () -> !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> 2026-02-21T08:56:51.2300833Z %34 = ttg.local_alloc : () -> !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> 2026-02-21T08:56:51.2301096Z %35 = ttg.local_alloc : () -> !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> 2026-02-21T08:56:51.2301310Z %36 = ttg.local_alloc : () -> !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> 2026-02-21T08:56:51.2301580Z %37 = tt.expand_dims %19 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2301859Z %38 = tt.broadcast %37 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2302056Z %39 = arith.addi %22, %38 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2302258Z %40 = tt.addptr %23, %39 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2302471Z %41 = tt.load %40 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2302669Z %42 = arith.addi %19, %cst_14 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2302953Z %43 = tt.expand_dims %42 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2303229Z %44 = tt.broadcast %43 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2303420Z %45 = arith.addi %22, %44 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2303626Z %46 = tt.addptr %23, %45 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2303831Z %47 = tt.load %46 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2304025Z %48 = arith.addi %19, %cst_13 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2304297Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2304567Z %50 = tt.broadcast %49 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2304763Z %51 = arith.addi %22, %50 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2304961Z %52 = tt.addptr %23, %51 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2305206Z %53 = tt.load %52 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2305395Z %54 = arith.addi %19, %cst_12 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2305671Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2305943Z %56 = tt.broadcast %55 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2306131Z %57 = arith.addi %22, %56 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2306329Z %58 = tt.addptr %23, %57 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2306528Z %59 = tt.load %58 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2306825Z %60 = ttg.memdesc_index %33[%c0_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2307201Z ttg.local_store %41, %60 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2307566Z %61 = ttg.memdesc_index %34[%c0_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2307931Z ttg.local_store %47, %61 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2308297Z %62 = ttg.memdesc_index %35[%c0_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2308700Z ttg.local_store %53, %62 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2309063Z %63 = ttg.memdesc_index %36[%c0_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2309423Z ttg.local_store %59, %63 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2309706Z %64 = arith.addi %19, %cst_11 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2309987Z %65 = tt.expand_dims %64 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2310260Z %66 = tt.broadcast %65 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2310456Z %67 = arith.addi %22, %66 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2310655Z %68 = tt.addptr %23, %67 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2310864Z %69 = tt.load %68 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2311058Z %70 = arith.addi %19, %cst_10 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2311331Z %71 = tt.expand_dims %70 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2311605Z %72 = tt.broadcast %71 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2311795Z %73 = arith.addi %22, %72 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2311996Z %74 = tt.addptr %23, %73 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2312202Z %75 = tt.load %74 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2312392Z %76 = arith.addi %19, %cst_9 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2312671Z %77 = tt.expand_dims %76 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2312940Z %78 = tt.broadcast %77 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2333763Z %79 = arith.addi %22, %78 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2333966Z %80 = tt.addptr %23, %79 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2334172Z %81 = tt.load %80 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2334367Z %82 = arith.addi %19, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2334645Z %83 = tt.expand_dims %82 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2334922Z %84 = tt.broadcast %83 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2335117Z %85 = arith.addi %22, %84 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2335318Z %86 = tt.addptr %23, %85 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2335526Z %87 = tt.load %86 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2335816Z %88 = ttg.memdesc_index %33[%c1_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2336190Z ttg.local_store %69, %88 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2336559Z %89 = ttg.memdesc_index %34[%c1_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2336921Z ttg.local_store %75, %89 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2337336Z %90 = ttg.memdesc_index %35[%c1_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2337696Z ttg.local_store %81, %90 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2338064Z %91 = ttg.memdesc_index %36[%c1_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2338434Z ttg.local_store %87, %91 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2338713Z %92 = arith.addi %19, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2338996Z %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2339276Z %94 = tt.broadcast %93 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2339469Z %95 = arith.addi %22, %94 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2339677Z %96 = tt.addptr %23, %95 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2339879Z %97 = tt.load %96 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2340073Z %98 = arith.addi %19, %cst_6 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2340344Z %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2340620Z %100 = tt.broadcast %99 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2340823Z %101 = arith.addi %22, %100 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2341031Z %102 = tt.addptr %23, %101 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2341249Z %103 = tt.load %102 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2341453Z %104 = arith.addi %19, %cst_5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2341737Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2342060Z %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2342260Z %107 = arith.addi %22, %106 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2342472Z %108 = tt.addptr %23, %107 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2342684Z %109 = tt.load %108 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2342886Z %110 = arith.addi %19, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2343169Z %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2343451Z %112 = tt.broadcast %111 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2343651Z %113 = arith.addi %22, %112 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2343860Z %114 = tt.addptr %23, %113 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2344075Z %115 = tt.load %114 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2344366Z %116 = ttg.memdesc_index %33[%c2_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2344734Z ttg.local_store %97, %116 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2345107Z %117 = ttg.memdesc_index %34[%c2_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2345512Z ttg.local_store %103, %117 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2345885Z %118 = ttg.memdesc_index %35[%c2_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2346258Z ttg.local_store %109, %118 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2346623Z %119 = ttg.memdesc_index %36[%c2_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2346998Z ttg.local_store %115, %119 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2348710Z %120:23 = scf.for %arg3 = %c0_i32 to %c500_i32 step %c4_i32 iter_args(%arg4 = %cst, %arg5 = %c2_i32, %arg6 = %60, %arg7 = %88, %arg8 = %116, %arg9 = %61, %arg10 = %89, %arg11 = %117, %arg12 = %c1_i32, %arg13 = %c5_i32, %arg14 = %c9_i32, %arg15 = %62, %arg16 = %90, %arg17 = %118, %arg18 = %c2_i32, %arg19 = %c6_i32, %arg20 = %c10_i32, %arg21 = %63, %arg22 = %91, %arg23 = %119, %arg24 = %c3_i32, %arg25 = %c7_i32, %arg26 = %c11_i32) -> (tensor<2048x16xf32, #mma>, i32, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, i32, i32, i32, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, i32, i32, i32, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, i32, i32, i32) : i32 { 2026-02-21T08:56:51.2350407Z %381 = arith.addi %arg3, %c12_i32 : i32 2026-02-21T08:56:51.2350539Z %382 = arith.muli %381, %c2_i32 : i32 2026-02-21T08:56:51.2350750Z %383 = tt.splat %382 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2350979Z %384 = arith.addi %383, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2351259Z %385 = tt.expand_dims %384 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2351550Z %386 = tt.broadcast %385 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2351761Z %387 = arith.addi %22, %386 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2351973Z %388 = tt.addptr %23, %387 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2352192Z %389 = tt.load %388 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2352503Z %390 = ttg.local_load %arg6 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2352962Z %391 = arith.extf %390 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2353260Z %392 = arith.muli %arg3, %c8192_i32 : i32 2026-02-21T08:56:51.2353442Z %393 = tt.splat %392 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2353679Z %394 = arith.addi %393, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2353992Z %395 = tt.addptr %25, %394 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2354309Z %396 = tt.load %395 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2354586Z %397 = arith.shli %396, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2354833Z %398 = arith.shrsi %397, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2355077Z %399 = arith.shrsi %396, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2355371Z %400 = tt.expand_dims %398 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2355712Z %401 = tt.expand_dims %399 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2356001Z %402 = tt.broadcast %400 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2356244Z %403 = arith.select %30, %402, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2356491Z %404 = tt.broadcast %401 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2356730Z %405 = arith.select %32, %404, %403 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2356964Z %406 = tt.reshape %405 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2357194Z %407 = arith.sitofp %406 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2357493Z %408 = ttg.convert_layout %407 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2357980Z %409 = tt.dot %391, %408, %arg4, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2358346Z %410 = arith.addi %arg3, %c13_i32 : i32 2026-02-21T08:56:51.2358472Z %411 = arith.muli %410, %c2_i32 : i32 2026-02-21T08:56:51.2358651Z %412 = tt.splat %411 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2358879Z %413 = arith.addi %412, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2359199Z %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2359486Z %415 = tt.broadcast %414 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2359687Z %416 = arith.addi %22, %415 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2359901Z %417 = tt.addptr %23, %416 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2360116Z %418 = tt.load %417 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2360430Z %419 = ttg.local_load %arg9 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2360882Z %420 = arith.extf %419 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2361181Z %421 = arith.muli %arg12, %c8192_i32 : i32 2026-02-21T08:56:51.2361365Z %422 = tt.splat %421 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2361593Z %423 = arith.addi %422, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2361907Z %424 = tt.addptr %25, %423 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2362222Z %425 = tt.load %424 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2362456Z %426 = arith.shli %425, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2362828Z %427 = arith.shrsi %426, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2363066Z %428 = arith.shrsi %425, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2363360Z %429 = tt.expand_dims %427 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2363696Z %430 = tt.expand_dims %428 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2363973Z %431 = tt.broadcast %429 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2364211Z %432 = arith.select %30, %431, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2364445Z %433 = tt.broadcast %430 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2364681Z %434 = arith.select %32, %433, %432 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2364909Z %435 = tt.reshape %434 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2365132Z %436 = arith.sitofp %435 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2365430Z %437 = ttg.convert_layout %436 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2365895Z %438 = tt.dot %420, %437, %409, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2366248Z %439 = arith.addi %arg3, %c14_i32 : i32 2026-02-21T08:56:51.2366371Z %440 = arith.muli %439, %c2_i32 : i32 2026-02-21T08:56:51.2366539Z %441 = tt.splat %440 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2366769Z %442 = arith.addi %441, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2367044Z %443 = tt.expand_dims %442 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2367382Z %444 = tt.broadcast %443 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2367581Z %445 = arith.addi %22, %444 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2367783Z %446 = tt.addptr %23, %445 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2367995Z %447 = tt.load %446 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2368302Z %448 = ttg.local_load %arg15 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2368749Z %449 = arith.extf %448 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2369037Z %450 = arith.muli %arg18, %c8192_i32 : i32 2026-02-21T08:56:51.2369215Z %451 = tt.splat %450 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2369444Z %452 = arith.addi %451, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2369755Z %453 = tt.addptr %25, %452 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2370059Z %454 = tt.load %453 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2370291Z %455 = arith.shli %454, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2370528Z %456 = arith.shrsi %455, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2370806Z %457 = arith.shrsi %454, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2371093Z %458 = tt.expand_dims %456 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2371426Z %459 = tt.expand_dims %457 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2371707Z %460 = tt.broadcast %458 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2371943Z %461 = arith.select %30, %460, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2372183Z %462 = tt.broadcast %459 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2372418Z %463 = arith.select %32, %462, %461 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2372651Z %464 = tt.reshape %463 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2372873Z %465 = arith.sitofp %464 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2373167Z %466 = ttg.convert_layout %465 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2373638Z %467 = tt.dot %449, %466, %438, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2373986Z %468 = arith.addi %arg3, %c15_i32 : i32 2026-02-21T08:56:51.2374107Z %469 = arith.muli %468, %c2_i32 : i32 2026-02-21T08:56:51.2374277Z %470 = tt.splat %469 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2374500Z %471 = arith.addi %470, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:56:51.2374782Z %472 = tt.expand_dims %471 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T08:56:51.2375096Z %473 = tt.broadcast %472 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2375292Z %474 = arith.addi %22, %473 : tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2375501Z %475 = tt.addptr %23, %474 : tensor<2048x2x!tt.ptr, #blocked1>, tensor<2048x2xi32, #blocked1> 2026-02-21T08:56:51.2375713Z %476 = tt.load %475 : tensor<2048x2x!tt.ptr, #blocked1> 2026-02-21T08:56:51.2376024Z %477 = ttg.local_load %arg21 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2376475Z %478 = arith.extf %477 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2376765Z %479 = arith.muli %arg24, %c8192_i32 : i32 2026-02-21T08:56:51.2376946Z %480 = tt.splat %479 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2377175Z %481 = arith.addi %480, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2377481Z %482 = tt.addptr %25, %481 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2377790Z %483 = tt.load %482 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2378020Z %484 = arith.shli %483, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2378259Z %485 = arith.shrsi %484, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2378494Z %486 = arith.shrsi %483, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2378814Z %487 = tt.expand_dims %485 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2379151Z %488 = tt.expand_dims %486 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2379430Z %489 = tt.broadcast %487 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2379668Z %490 = arith.select %30, %489, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2379904Z %491 = tt.broadcast %488 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2380131Z %492 = arith.select %32, %491, %490 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2380362Z %493 = tt.reshape %492 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2380584Z %494 = arith.sitofp %493 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2380880Z %495 = ttg.convert_layout %494 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2381347Z %496 = tt.dot %478, %495, %467, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2381692Z %497 = arith.addi %arg5, %c1_i32 : i32 2026-02-21T08:56:51.2381821Z %498 = arith.cmpi slt, %497, %c3_i32 : i32 2026-02-21T08:56:51.2381949Z %499 = arith.select %498, %497, %c0_i32 : i32 2026-02-21T08:56:51.2382229Z %500 = ttg.memdesc_index %33[%499] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2382602Z ttg.local_store %389, %500 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2382970Z %501 = ttg.memdesc_index %34[%499] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2383376Z ttg.local_store %418, %501 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2383743Z %502 = ttg.memdesc_index %35[%499] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2384108Z ttg.local_store %447, %502 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2384474Z %503 = ttg.memdesc_index %36[%499] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2384839Z ttg.local_store %476, %503 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> 2026-02-21T08:56:51.2386301Z scf.yield %496, %499, %arg7, %arg8, %500, %arg10, %arg11, %501, %arg13, %arg14, %410, %arg16, %arg17, %502, %arg19, %arg20, %439, %arg22, %arg23, %503, %arg25, %arg26, %468 : tensor<2048x16xf32, #mma>, i32, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, i32, i32, i32, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, i32, i32, i32, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, i32, i32, i32 2026-02-21T08:56:51.2387585Z } 2026-02-21T08:56:51.2387832Z %121 = ttg.local_load %120#2 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2388274Z %122 = arith.extf %121 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2388614Z %123 = arith.addi %24, %cst_3 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2388928Z %124 = tt.addptr %25, %123 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2389233Z %125 = tt.load %124 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2389464Z %126 = arith.shli %125, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2389699Z %127 = arith.shrsi %126, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2389939Z %128 = arith.shrsi %125, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2390226Z %129 = tt.expand_dims %127 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2390553Z %130 = tt.expand_dims %128 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2390833Z %131 = tt.broadcast %129 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2391071Z %132 = arith.select %30, %131, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2391303Z %133 = tt.broadcast %130 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2391539Z %134 = arith.select %32, %133, %132 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2391763Z %135 = tt.reshape %134 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2392024Z %136 = arith.sitofp %135 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2392319Z %137 = ttg.convert_layout %136 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2392787Z %138 = tt.dot %122, %137, %120#0, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2393300Z %139 = ttg.local_load %120#5 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2393747Z %140 = arith.extf %139 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2394033Z %141 = arith.muli %120#8, %c8192_i32 : i32 2026-02-21T08:56:51.2394209Z %142 = tt.splat %141 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2394433Z %143 = arith.addi %142, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2394741Z %144 = tt.addptr %25, %143 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2395049Z %145 = tt.load %144 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2395279Z %146 = arith.shli %145, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2395515Z %147 = arith.shrsi %146, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2395786Z %148 = arith.shrsi %145, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2396079Z %149 = tt.expand_dims %147 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2396410Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2396685Z %151 = tt.broadcast %149 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2396922Z %152 = arith.select %30, %151, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2397155Z %153 = tt.broadcast %150 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2397386Z %154 = arith.select %32, %153, %152 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2397617Z %155 = tt.reshape %154 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2397835Z %156 = arith.sitofp %155 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2398129Z %157 = ttg.convert_layout %156 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2398589Z %158 = tt.dot %140, %157, %138, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2399089Z %159 = ttg.local_load %120#11 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2399530Z %160 = arith.extf %159 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2399818Z %161 = arith.muli %120#14, %c8192_i32 : i32 2026-02-21T08:56:51.2399994Z %162 = tt.splat %161 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2400252Z %163 = arith.addi %162, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2400554Z %164 = tt.addptr %25, %163 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2400859Z %165 = tt.load %164 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2401088Z %166 = arith.shli %165, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2401326Z %167 = arith.shrsi %166, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2401560Z %168 = arith.shrsi %165, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2401846Z %169 = tt.expand_dims %167 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2402178Z %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2402459Z %171 = tt.broadcast %169 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2402808Z %172 = arith.select %30, %171, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2403046Z %173 = tt.broadcast %170 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2403275Z %174 = arith.select %32, %173, %172 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2403518Z %175 = tt.reshape %174 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2403733Z %176 = arith.sitofp %175 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2404065Z %177 = ttg.convert_layout %176 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2404528Z %178 = tt.dot %160, %177, %158, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2405027Z %179 = ttg.local_load %120#17 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2405464Z %180 = arith.extf %179 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2405748Z %181 = arith.muli %120#20, %c8192_i32 : i32 2026-02-21T08:56:51.2405919Z %182 = tt.splat %181 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2406142Z %183 = arith.addi %182, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2406444Z %184 = tt.addptr %25, %183 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2406748Z %185 = tt.load %184 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2406980Z %186 = arith.shli %185, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2407212Z %187 = arith.shrsi %186, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2407445Z %188 = arith.shrsi %185, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2407726Z %189 = tt.expand_dims %187 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2408061Z %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2408378Z %191 = tt.broadcast %189 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2408611Z %192 = arith.select %30, %191, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2408848Z %193 = tt.broadcast %190 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2409083Z %194 = arith.select %32, %193, %192 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2409322Z %195 = tt.reshape %194 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2409546Z %196 = arith.sitofp %195 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2409840Z %197 = ttg.convert_layout %196 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2410314Z %198 = tt.dot %180, %197, %178, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2410823Z %199 = ttg.local_load %120#3 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2411262Z %200 = arith.extf %199 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2411602Z %201 = arith.addi %24, %cst_2 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2411915Z %202 = tt.addptr %25, %201 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2412277Z %203 = tt.load %202 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2412514Z %204 = arith.shli %203, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2412752Z %205 = arith.shrsi %204, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2412994Z %206 = arith.shrsi %203, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2413284Z %207 = tt.expand_dims %205 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2413622Z %208 = tt.expand_dims %206 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2413912Z %209 = tt.broadcast %207 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2414149Z %210 = arith.select %30, %209, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2414391Z %211 = tt.broadcast %208 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2414624Z %212 = arith.select %32, %211, %210 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2414856Z %213 = tt.reshape %212 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2415080Z %214 = arith.sitofp %213 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2415371Z %215 = ttg.convert_layout %214 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2415837Z %216 = tt.dot %200, %215, %198, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2416351Z %217 = ttg.local_load %120#6 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2416799Z %218 = arith.extf %217 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2417122Z %219 = arith.muli %120#9, %c8192_i32 : i32 2026-02-21T08:56:51.2417297Z %220 = tt.splat %219 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2417527Z %221 = arith.addi %220, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2417841Z %222 = tt.addptr %25, %221 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2418150Z %223 = tt.load %222 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2418387Z %224 = arith.shli %223, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2418626Z %225 = arith.shrsi %224, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2418871Z %226 = arith.shrsi %223, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2419162Z %227 = tt.expand_dims %225 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2419495Z %228 = tt.expand_dims %226 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2419781Z %229 = tt.broadcast %227 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2420020Z %230 = arith.select %30, %229, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2420265Z %231 = tt.broadcast %228 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2420548Z %232 = arith.select %32, %231, %230 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2420776Z %233 = tt.reshape %232 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2421002Z %234 = arith.sitofp %233 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2421294Z %235 = ttg.convert_layout %234 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2421761Z %236 = tt.dot %218, %235, %216, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2422265Z %237 = ttg.local_load %120#12 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2422710Z %238 = arith.extf %237 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2423003Z %239 = arith.muli %120#15, %c8192_i32 : i32 2026-02-21T08:56:51.2423184Z %240 = tt.splat %239 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2423407Z %241 = arith.addi %240, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2423720Z %242 = tt.addptr %25, %241 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2424030Z %243 = tt.load %242 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2424267Z %244 = arith.shli %243, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2424508Z %245 = arith.shrsi %244, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2424751Z %246 = arith.shrsi %243, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2425074Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2425411Z %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2425692Z %249 = tt.broadcast %247 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2425932Z %250 = arith.select %30, %249, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2426167Z %251 = tt.broadcast %248 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2426402Z %252 = arith.select %32, %251, %250 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2426635Z %253 = tt.reshape %252 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2426855Z %254 = arith.sitofp %253 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2427155Z %255 = ttg.convert_layout %254 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2427623Z %256 = tt.dot %238, %255, %236, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2428130Z %257 = ttg.local_load %120#18 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2428577Z %258 = arith.extf %257 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2428899Z %259 = arith.muli %120#21, %c8192_i32 : i32 2026-02-21T08:56:51.2429080Z %260 = tt.splat %259 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2429310Z %261 = arith.addi %260, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2429621Z %262 = tt.addptr %25, %261 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2429931Z %263 = tt.load %262 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2430163Z %264 = arith.shli %263, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2430403Z %265 = arith.shrsi %264, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2430639Z %266 = arith.shrsi %263, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2430931Z %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2431275Z %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2431556Z %269 = tt.broadcast %267 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2431796Z %270 = arith.select %30, %269, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2432037Z %271 = tt.broadcast %268 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2432266Z %272 = arith.select %32, %271, %270 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2432500Z %273 = tt.reshape %272 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2432719Z %274 = arith.sitofp %273 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2433026Z %275 = ttg.convert_layout %274 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2433530Z %276 = tt.dot %258, %275, %256, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2434030Z %277 = ttg.local_load %120#4 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2434473Z %278 = arith.extf %277 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2434810Z %279 = arith.addi %24, %cst_1 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2435127Z %280 = tt.addptr %25, %279 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2435440Z %281 = tt.load %280 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2435672Z %282 = arith.shli %281, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2435912Z %283 = arith.shrsi %282, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2436151Z %284 = arith.shrsi %281, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2436440Z %285 = tt.expand_dims %283 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2436772Z %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2437084Z %287 = tt.broadcast %285 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2437325Z %288 = arith.select %30, %287, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2437566Z %289 = tt.broadcast %286 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2437796Z %290 = arith.select %32, %289, %288 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2438031Z %291 = tt.reshape %290 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2438251Z %292 = arith.sitofp %291 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2438555Z %293 = ttg.convert_layout %292 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2439024Z %294 = tt.dot %278, %293, %276, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2439527Z %295 = ttg.local_load %120#7 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2439973Z %296 = arith.extf %295 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2440261Z %297 = arith.muli %120#10, %c8192_i32 : i32 2026-02-21T08:56:51.2440437Z %298 = tt.splat %297 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2440666Z %299 = arith.addi %298, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2440970Z %300 = tt.addptr %25, %299 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2441286Z %301 = tt.load %300 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2441522Z %302 = arith.shli %301, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2441792Z %303 = arith.shrsi %302, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2442033Z %304 = arith.shrsi %301, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2442322Z %305 = tt.expand_dims %303 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2442878Z %306 = tt.expand_dims %304 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2443218Z %307 = tt.broadcast %305 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2443464Z %308 = arith.select %30, %307, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2443705Z %309 = tt.broadcast %306 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2443940Z %310 = arith.select %32, %309, %308 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2444171Z %311 = tt.reshape %310 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2444395Z %312 = arith.sitofp %311 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2444686Z %313 = ttg.convert_layout %312 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2445153Z %314 = tt.dot %296, %313, %294, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2445772Z %315 = ttg.local_load %120#13 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2446222Z %316 = arith.extf %315 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2446515Z %317 = arith.muli %120#16, %c8192_i32 : i32 2026-02-21T08:56:51.2446691Z %318 = tt.splat %317 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2446924Z %319 = arith.addi %318, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2447233Z %320 = tt.addptr %25, %319 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2447540Z %321 = tt.load %320 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2447780Z %322 = arith.shli %321, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2448015Z %323 = arith.shrsi %322, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2448257Z %324 = arith.shrsi %321, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2448547Z %325 = tt.expand_dims %323 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2448876Z %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2449158Z %327 = tt.broadcast %325 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2449393Z %328 = arith.select %30, %327, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2449630Z %329 = tt.broadcast %326 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2449866Z %330 = arith.select %32, %329, %328 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2450139Z %331 = tt.reshape %330 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2450360Z %332 = arith.sitofp %331 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2450650Z %333 = ttg.convert_layout %332 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2451114Z %334 = tt.dot %316, %333, %314, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2451615Z %335 = ttg.local_load %120#19 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2452053Z %336 = arith.extf %335 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2452344Z %337 = arith.muli %120#22, %c8192_i32 : i32 2026-02-21T08:56:51.2452523Z %338 = tt.splat %337 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2452750Z %339 = arith.addi %338, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2453059Z %340 = tt.addptr %25, %339 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2453366Z %341 = tt.load %340 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2453604Z %342 = arith.shli %341, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2453845Z %343 = arith.shrsi %342, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2454289Z %344 = arith.shrsi %341, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:51.2454584Z %345 = tt.expand_dims %343 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2454912Z %346 = tt.expand_dims %344 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T08:56:51.2455194Z %347 = tt.broadcast %345 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2455433Z %348 = arith.select %30, %347, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2455670Z %349 = tt.broadcast %346 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2455905Z %350 = arith.select %32, %349, %348 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T08:56:51.2456135Z %351 = tt.reshape %350 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T08:56:51.2456359Z %352 = arith.sitofp %351 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T08:56:51.2456658Z %353 = ttg.convert_layout %352 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:51.2457119Z %354 = tt.dot %336, %353, %334, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma> 2026-02-21T08:56:51.2457511Z ttg.local_dealloc %36 : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> 2026-02-21T08:56:51.2457725Z ttg.local_dealloc %35 : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> 2026-02-21T08:56:51.2457930Z ttg.local_dealloc %34 : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> 2026-02-21T08:56:51.2458139Z ttg.local_dealloc %33 : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> 2026-02-21T08:56:51.2458352Z %355 = arith.truncf %354 : tensor<2048x16xf32, #mma> to tensor<2048x16xbf16, #mma> 2026-02-21T08:56:51.2458564Z %356 = arith.extsi %9 : i32 to i64 2026-02-21T08:56:51.2458684Z %357 = arith.extsi %14 : i32 to i64 2026-02-21T08:56:51.2458848Z %358 = tt.splat %arg2 : !tt.ptr -> tensor<2048x16x!tt.ptr, #mma> 2026-02-21T08:56:51.2459058Z %359 = tt.splat %356 : i64 -> tensor<2048xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:51.2459341Z %360 = arith.extsi %11 : tensor<2048xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<2048xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:51.2459631Z %361 = arith.addi %359, %360 : tensor<2048xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:51.2459908Z %362 = tt.expand_dims %361 {axis = 1 : i32} : tensor<2048xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<2048x1xi64, #mma> 2026-02-21T08:56:51.2460159Z %363 = arith.muli %362, %cst_19 : tensor<2048x1xi64, #mma> 2026-02-21T08:56:51.2460353Z %364 = tt.broadcast %363 : tensor<2048x1xi64, #mma> -> tensor<2048x16xi64, #mma> 2026-02-21T08:56:51.2460573Z %365 = tt.splat %357 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:51.2460849Z %366 = arith.extsi %16 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:51.2461127Z %367 = arith.addi %365, %366 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:51.2461395Z %368 = tt.expand_dims %367 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi64, #mma> 2026-02-21T08:56:51.2461661Z %369 = tt.broadcast %368 : tensor<1x16xi64, #mma> -> tensor<2048x16xi64, #mma> 2026-02-21T08:56:51.2461845Z %370 = arith.addi %364, %369 : tensor<2048x16xi64, #mma> 2026-02-21T08:56:51.2462044Z %371 = tt.addptr %358, %370 : tensor<2048x16x!tt.ptr, #mma>, tensor<2048x16xi64, #mma> 2026-02-21T08:56:51.2462284Z %372 = arith.cmpi sge, %362, %cst_20 : tensor<2048x1xi64, #mma> 2026-02-21T08:56:51.2462461Z %373 = arith.cmpi slt, %362, %cst_21 : tensor<2048x1xi64, #mma> 2026-02-21T08:56:51.2462628Z %374 = arith.andi %372, %373 : tensor<2048x1xi1, #mma> 2026-02-21T08:56:51.2462814Z %375 = tt.broadcast %374 : tensor<2048x1xi1, #mma> -> tensor<2048x16xi1, #mma> 2026-02-21T08:56:51.2463008Z %376 = arith.cmpi sge, %368, %cst_22 : tensor<1x16xi64, #mma> 2026-02-21T08:56:51.2463175Z %377 = arith.cmpi slt, %368, %cst_23 : tensor<1x16xi64, #mma> 2026-02-21T08:56:51.2463337Z %378 = arith.andi %376, %377 : tensor<1x16xi1, #mma> 2026-02-21T08:56:51.2463505Z %379 = tt.broadcast %378 : tensor<1x16xi1, #mma> -> tensor<2048x16xi1, #mma> 2026-02-21T08:56:51.2463686Z %380 = arith.andi %375, %379 : tensor<2048x16xi1, #mma> 2026-02-21T08:56:51.2463851Z tt.store %371, %355, %380 : tensor<2048x16x!tt.ptr, #mma> 2026-02-21T08:56:51.2463994Z tt.return 2026-02-21T08:56:51.2464090Z } 2026-02-21T08:56:51.2464174Z } 2026-02-21T08:56:51.2464219Z 2026-02-21T08:56:51.2464259Z {-# 2026-02-21T08:56:51.2464345Z external_resources: { 2026-02-21T08:56:51.2464454Z mlir_reproducer: { 2026-02-21T08:56:51.2465455Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:56:51.2466459Z disable_threading: false, 2026-02-21T08:56:51.2466569Z verify_each: true 2026-02-21T08:56:51.2466668Z } 2026-02-21T08:56:51.2466747Z } 2026-02-21T08:56:51.2466825Z #-} 2026-02-21T08:56:51.2467105Z /tmp/torchinductor_root/w6/cw6ainxzjafdthoodfldr5o2eaj3ll2vel4ift4ys25lrduk4l3i.py:12:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:56:51.2467858Z /tmp/torchinductor_root/w6/cw6ainxzjafdthoodfldr5o2eaj3ll2vel4ift4ys25lrduk4l3i.py:12:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:56:51.2468421Z [90s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:56:51.2469152Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T08:56:51.2469814Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:56:51.2469988Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:56:53.2573820Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:56:53.2580571Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 4], order = [1, 0]}> 2026-02-21T08:56:53.2581104Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 4], order = [2, 1, 0]}> 2026-02-21T08:56:53.2581770Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T08:56:53.2582788Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = true}> 2026-02-21T08:56:53.2583275Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:56:53.2583717Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:56:53.2584048Z #smem = #ttg.shared_memory 2026-02-21T08:56:53.2584479Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:56:53.2585365Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:56:53.2586222Z %cst = arith.constant dense<16384> : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2586532Z %cst_0 = arith.constant dense<0> : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2586840Z %cst_1 = arith.constant dense<8192> : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2587153Z %cst_2 = arith.constant dense<8192> : tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2587470Z %cst_3 = arith.constant dense<0> : tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2587780Z %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2588096Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2588421Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2588740Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T08:56:53.2589063Z %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T08:56:53.2589387Z %cst_9 = arith.constant dense<1024> : tensor<32x1xi32, #blocked2> 2026-02-21T08:56:53.2589668Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:56:53.2589957Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<32x256xf32, #mma> 2026-02-21T08:56:53.2590266Z %c16991_i32 = arith.constant 16991 : i32 2026-02-21T08:56:53.2590434Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:56:53.2590733Z %c-1_i32 = arith.constant -1 : i32 2026-02-21T08:56:53.2590910Z %cst_11 = arith.constant dense<0> : tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2591120Z %cst_12 = arith.constant dense<0> : tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2591357Z %cst_13 = arith.constant dense : tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2591662Z %c255_i32 = arith.constant 255 : i32 2026-02-21T08:56:53.2591857Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:56:53.2592042Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:56:53.2592227Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:56:53.2592416Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:56:53.2592598Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T08:56:53.2592789Z %c1216_i32 = arith.constant 1216 : i32 2026-02-21T08:56:53.2592971Z %c1824_i32 = arith.constant 1824 : i32 2026-02-21T08:56:53.2593199Z %cst_14 = arith.constant dense<0> : tensor<2x256xi8, #blocked> 2026-02-21T08:56:53.2593478Z %cst_15 = arith.constant dense<8192> : tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2593756Z %cst_16 = arith.constant dense<0> : tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2594040Z %cst_17 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2594277Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:56:53.2594462Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:56:53.2594622Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:56:53.2594743Z %c608_i32 = arith.constant 608 : i32 2026-02-21T08:56:53.2594984Z %cst_18 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2595288Z %0 = tt.get_program_id x : i32 2026-02-21T08:56:53.2595569Z %1 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:53.2596044Z %2 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:53.2596326Z %3 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:53.2596578Z %4 = tt.splat %arg0 : !tt.ptr -> tensor<32x4x!tt.ptr, #blocked2> 2026-02-21T08:56:53.2596787Z %5 = tt.splat %arg1 : !tt.ptr -> tensor<2x256x!tt.ptr, #blocked> 2026-02-21T08:56:53.2597032Z %6 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:53.2597353Z %7 = arith.extsi %6 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:53.2597684Z %8 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T08:56:53.2598017Z %9 = arith.extsi %8 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T08:56:53.2598387Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T08:56:53.2598870Z %11 = tt.expand_dims %10 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T08:56:53.2599281Z %12 = tt.expand_dims %11 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T08:56:53.2599546Z %13 = arith.cmpi eq, %12, %cst_8 : tensor<1x2x1xi32, #blocked1> 2026-02-21T08:56:53.2599762Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1> 2026-02-21T08:56:53.2599970Z %15 = arith.cmpi eq, %12, %cst_7 : tensor<1x2x1xi32, #blocked1> 2026-02-21T08:56:53.2600177Z %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1> 2026-02-21T08:56:53.2600393Z %17 = tt.splat %arg2 : !tt.ptr -> tensor<32x256x!tt.ptr, #mma> 2026-02-21T08:56:53.2600722Z %18 = arith.extsi %2 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:53.2601042Z %19 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:53.2601350Z %20 = arith.extsi %19 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:53.2601601Z %21 = arith.subi %c16991_i32, %0 : i32 2026-02-21T08:56:53.2601742Z %22 = arith.divui %21, %c608_i32 : i32 2026-02-21T08:56:53.2601865Z %23 = arith.remsi %22, %c4_i32 : i32 2026-02-21T08:56:53.2601987Z %24 = arith.subi %22, %23 : i32 2026-02-21T08:56:53.2602105Z %25 = arith.muli %24, %c608_i32 : i32 2026-02-21T08:56:53.2602230Z %26 = arith.addi %0, %25 : i32 2026-02-21T08:56:53.2602363Z scf.for %arg3 = %0 to %26 step %c2432_i32 : i32 { 2026-02-21T08:56:53.2602513Z %32 = arith.divsi %arg3, %c1024_i32 : i32 2026-02-21T08:56:53.2602725Z %33 = arith.muli %32, %c32_i32 : i32 2026-02-21T08:56:53.2602850Z %34 = arith.subi %c512_i32, %33 : i32 2026-02-21T08:56:53.2602969Z %35 = arith.minsi %34, %c32_i32 : i32 2026-02-21T08:56:53.2603099Z %36 = arith.remsi %arg3, %c1024_i32 : i32 2026-02-21T08:56:53.2603227Z %37 = arith.remsi %36, %35 : i32 2026-02-21T08:56:53.2603345Z %38 = arith.addi %33, %37 : i32 2026-02-21T08:56:53.2603466Z %39 = arith.divsi %36, %35 : i32 2026-02-21T08:56:53.2603580Z %40 = arith.muli %38, %c32_i32 : i32 2026-02-21T08:56:53.2603759Z %41 = tt.splat %40 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:53.2604035Z %42 = arith.addi %41, %1 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:53.2604213Z %43 = arith.muli %39, %c256_i32 : i32 2026-02-21T08:56:53.2604441Z %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<32x1xi32, #blocked2> 2026-02-21T08:56:53.2604694Z %45 = arith.muli %44, %cst_9 : tensor<32x1xi32, #blocked2> 2026-02-21T08:56:53.2604893Z %46 = tt.broadcast %45 : tensor<32x1xi32, #blocked2> -> tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2605068Z %47 = arith.extsi %43 : i32 to i64 2026-02-21T08:56:53.2605240Z %48 = tt.splat %47 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T08:56:53.2605457Z %49 = arith.addi %48, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T08:56:53.2605734Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2606019Z %51 = tt.broadcast %50 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2606219Z %52 = arith.cmpi sge, %50, %cst_16 : tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2606399Z %53 = arith.cmpi slt, %50, %cst_15 : tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2606563Z %54 = arith.andi %52, %53 : tensor<1x256xi1, #blocked> 2026-02-21T08:56:53.2606752Z %55 = tt.broadcast %54 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2607028Z %56 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst_10) -> (tensor<32x256xf32, #mma>) : i32 { 2026-02-21T08:56:53.2607250Z %223 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:56:53.2607434Z %224 = tt.splat %223 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:53.2607663Z %225 = arith.addi %224, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:53.2607949Z %226 = tt.expand_dims %225 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:56:53.2608232Z %227 = tt.broadcast %226 : tensor<1x4xi32, #blocked2> -> tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2608475Z %228 = arith.addi %46, %227 : tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2608682Z %229 = tt.addptr %4, %228 : tensor<32x4x!tt.ptr, #blocked2>, tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2608890Z %230 = tt.load %229 : tensor<32x4x!tt.ptr, #blocked2> 2026-02-21T08:56:53.2609118Z %231 = ttg.local_alloc %230 : (tensor<32x4xbf16, #blocked2>) -> !ttg.memdesc<32x4xbf16, #shared, #smem> 2026-02-21T08:56:53.2609455Z %232 = ttg.local_load %231 : !ttg.memdesc<32x4xbf16, #shared, #smem> -> tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2609868Z %233 = arith.extf %232 : tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2610156Z %234 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:56:53.2610337Z %235 = tt.splat %234 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:53.2610564Z %236 = arith.addi %235, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:53.2610841Z %237 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2611088Z %238 = arith.muli %237, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2611288Z %239 = tt.broadcast %238 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2611481Z %240 = arith.addi %239, %51 : tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2611686Z %241 = tt.addptr %5, %240 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2611939Z %242 = arith.cmpi sge, %237, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2612111Z %243 = arith.cmpi slt, %237, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2612283Z %244 = arith.andi %242, %243 : tensor<2x1xi1, #blocked> 2026-02-21T08:56:53.2612471Z %245 = tt.broadcast %244 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2612665Z %246 = arith.andi %245, %55 : tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2612836Z %247 = tt.load %241, %246, %cst_14 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T08:56:53.2613107Z %248 = ttg.convert_layout %247 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2613407Z %249 = arith.shli %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2613657Z %250 = arith.shrsi %249, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2613916Z %251 = arith.shrsi %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2614222Z %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T08:56:53.2614577Z %253 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T08:56:53.2614880Z %254 = tt.broadcast %252 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2615134Z %255 = arith.select %14, %254, %cst_17 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2615391Z %256 = tt.broadcast %253 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2615643Z %257 = arith.select %16, %256, %255 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2615887Z %258 = tt.reshape %257 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked> 2026-02-21T08:56:53.2616117Z %259 = arith.sitofp %258 : tensor<4x256xi8, #blocked> to tensor<4x256xf32, #blocked> 2026-02-21T08:56:53.2616418Z %260 = ttg.local_alloc %259 : (tensor<4x256xf32, #blocked>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T08:56:53.2616754Z %261 = ttg.local_load %260 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2617249Z %262 = tt.dot %233, %261, %arg5, inputPrecision = tf32 : tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x256xf32, #mma> 2026-02-21T08:56:53.2617608Z scf.yield %262 : tensor<32x256xf32, #mma> 2026-02-21T08:56:53.2617782Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32} 2026-02-21T08:56:53.2617991Z %57 = arith.truncf %56 : tensor<32x256xf32, #mma> to tensor<32x256xbf16, #mma> 2026-02-21T08:56:53.2618166Z %58 = arith.extsi %40 : i32 to i64 2026-02-21T08:56:53.2618337Z %59 = tt.splat %58 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:53.2618545Z %60 = arith.addi %59, %18 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:53.2618807Z %61 = tt.expand_dims %60 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2619042Z %62 = arith.muli %61, %cst_1 : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2619222Z %63 = tt.broadcast %62 : tensor<32x1xi64, #mma> -> tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2619431Z %64 = tt.splat %47 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:53.2619638Z %65 = arith.addi %64, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:53.2619943Z %66 = tt.expand_dims %65 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2620414Z %67 = tt.broadcast %66 : tensor<1x256xi64, #mma> -> tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2620613Z %68 = arith.addi %63, %67 : tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2620851Z %69 = tt.addptr %17, %68 : tensor<32x256x!tt.ptr, #mma>, tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2628256Z %70 = arith.cmpi sge, %61, %cst_0 : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2628450Z %71 = arith.cmpi slt, %61, %cst : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2628609Z %72 = arith.andi %70, %71 : tensor<32x1xi1, #mma> 2026-02-21T08:56:53.2628785Z %73 = tt.broadcast %72 : tensor<32x1xi1, #mma> -> tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2628973Z %74 = arith.cmpi sge, %66, %cst_3 : tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2629144Z %75 = arith.cmpi slt, %66, %cst_2 : tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2629310Z %76 = arith.andi %74, %75 : tensor<1x256xi1, #mma> 2026-02-21T08:56:53.2629488Z %77 = tt.broadcast %76 : tensor<1x256xi1, #mma> -> tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2629668Z %78 = arith.andi %73, %77 : tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2629826Z tt.store %69, %57, %78 : tensor<32x256x!tt.ptr, #mma> 2026-02-21T08:56:53.2629981Z %79 = arith.addi %arg3, %c608_i32 : i32 2026-02-21T08:56:53.2630108Z %80 = arith.divsi %79, %c1024_i32 : i32 2026-02-21T08:56:53.2630237Z %81 = arith.muli %80, %c32_i32 : i32 2026-02-21T08:56:53.2630361Z %82 = arith.subi %c512_i32, %81 : i32 2026-02-21T08:56:53.2630488Z %83 = arith.minsi %82, %c32_i32 : i32 2026-02-21T08:56:53.2630612Z %84 = arith.remsi %79, %c1024_i32 : i32 2026-02-21T08:56:53.2630735Z %85 = arith.remsi %84, %83 : i32 2026-02-21T08:56:53.2630857Z %86 = arith.addi %81, %85 : i32 2026-02-21T08:56:53.2630972Z %87 = arith.divsi %84, %83 : i32 2026-02-21T08:56:53.2631100Z %88 = arith.muli %86, %c32_i32 : i32 2026-02-21T08:56:53.2631274Z %89 = tt.splat %88 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:53.2631566Z %90 = arith.addi %89, %1 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:53.2631742Z %91 = arith.muli %87, %c256_i32 : i32 2026-02-21T08:56:53.2631974Z %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<32x1xi32, #blocked2> 2026-02-21T08:56:53.2632232Z %93 = arith.muli %92, %cst_9 : tensor<32x1xi32, #blocked2> 2026-02-21T08:56:53.2632427Z %94 = tt.broadcast %93 : tensor<32x1xi32, #blocked2> -> tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2632606Z %95 = arith.extsi %91 : i32 to i64 2026-02-21T08:56:53.2632777Z %96 = tt.splat %95 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T08:56:53.2633002Z %97 = arith.addi %96, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T08:56:53.2633289Z %98 = tt.expand_dims %97 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2633573Z %99 = tt.broadcast %98 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2633783Z %100 = arith.cmpi sge, %98, %cst_16 : tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2633964Z %101 = arith.cmpi slt, %98, %cst_15 : tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2634141Z %102 = arith.andi %100, %101 : tensor<1x256xi1, #blocked> 2026-02-21T08:56:53.2634339Z %103 = tt.broadcast %102 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2634610Z %104 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst_10) -> (tensor<32x256xf32, #mma>) : i32 { 2026-02-21T08:56:53.2634835Z %223 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:56:53.2635062Z %224 = tt.splat %223 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:53.2635297Z %225 = arith.addi %224, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:53.2635586Z %226 = tt.expand_dims %225 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:56:53.2635868Z %227 = tt.broadcast %226 : tensor<1x4xi32, #blocked2> -> tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2636072Z %228 = arith.addi %94, %227 : tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2636278Z %229 = tt.addptr %4, %228 : tensor<32x4x!tt.ptr, #blocked2>, tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2636492Z %230 = tt.load %229 : tensor<32x4x!tt.ptr, #blocked2> 2026-02-21T08:56:53.2636718Z %231 = ttg.local_alloc %230 : (tensor<32x4xbf16, #blocked2>) -> !ttg.memdesc<32x4xbf16, #shared, #smem> 2026-02-21T08:56:53.2637063Z %232 = ttg.local_load %231 : !ttg.memdesc<32x4xbf16, #shared, #smem> -> tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2637481Z %233 = arith.extf %232 : tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2637771Z %234 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:56:53.2637953Z %235 = tt.splat %234 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:53.2638183Z %236 = arith.addi %235, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:53.2638462Z %237 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2638718Z %238 = arith.muli %237, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2638917Z %239 = tt.broadcast %238 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2639122Z %240 = arith.addi %239, %99 : tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2639323Z %241 = tt.addptr %5, %240 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2639575Z %242 = arith.cmpi sge, %237, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2639755Z %243 = arith.cmpi slt, %237, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2639922Z %244 = arith.andi %242, %243 : tensor<2x1xi1, #blocked> 2026-02-21T08:56:53.2640118Z %245 = tt.broadcast %244 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2640313Z %246 = arith.andi %245, %103 : tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2640492Z %247 = tt.load %241, %246, %cst_14 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T08:56:53.2640765Z %248 = ttg.convert_layout %247 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2641059Z %249 = arith.shli %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2641314Z %250 = arith.shrsi %249, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2641569Z %251 = arith.shrsi %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2641878Z %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T08:56:53.2642235Z %253 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T08:56:53.2642535Z %254 = tt.broadcast %252 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2642864Z %255 = arith.select %14, %254, %cst_17 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2643160Z %256 = tt.broadcast %253 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2643414Z %257 = arith.select %16, %256, %255 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2643662Z %258 = tt.reshape %257 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked> 2026-02-21T08:56:53.2643891Z %259 = arith.sitofp %258 : tensor<4x256xi8, #blocked> to tensor<4x256xf32, #blocked> 2026-02-21T08:56:53.2644150Z %260 = ttg.local_alloc %259 : (tensor<4x256xf32, #blocked>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T08:56:53.2644483Z %261 = ttg.local_load %260 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2644974Z %262 = tt.dot %233, %261, %arg5, inputPrecision = tf32 : tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x256xf32, #mma> 2026-02-21T08:56:53.2645347Z scf.yield %262 : tensor<32x256xf32, #mma> 2026-02-21T08:56:53.2645519Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32} 2026-02-21T08:56:53.2645737Z %105 = arith.truncf %104 : tensor<32x256xf32, #mma> to tensor<32x256xbf16, #mma> 2026-02-21T08:56:53.2645918Z %106 = arith.extsi %88 : i32 to i64 2026-02-21T08:56:53.2646088Z %107 = tt.splat %106 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:53.2646308Z %108 = arith.addi %107, %18 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:53.2646581Z %109 = tt.expand_dims %108 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2646827Z %110 = arith.muli %109, %cst_1 : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2647013Z %111 = tt.broadcast %110 : tensor<32x1xi64, #mma> -> tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2647231Z %112 = tt.splat %95 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:53.2647451Z %113 = arith.addi %112, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:53.2647764Z %114 = tt.expand_dims %113 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2648035Z %115 = tt.broadcast %114 : tensor<1x256xi64, #mma> -> tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2648221Z %116 = arith.addi %111, %115 : tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2648418Z %117 = tt.addptr %17, %116 : tensor<32x256x!tt.ptr, #mma>, tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2648630Z %118 = arith.cmpi sge, %109, %cst_0 : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2648797Z %119 = arith.cmpi slt, %109, %cst : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2648963Z %120 = arith.andi %118, %119 : tensor<32x1xi1, #mma> 2026-02-21T08:56:53.2649141Z %121 = tt.broadcast %120 : tensor<32x1xi1, #mma> -> tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2649336Z %122 = arith.cmpi sge, %114, %cst_3 : tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2649508Z %123 = arith.cmpi slt, %114, %cst_2 : tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2649674Z %124 = arith.andi %122, %123 : tensor<1x256xi1, #mma> 2026-02-21T08:56:53.2649856Z %125 = tt.broadcast %124 : tensor<1x256xi1, #mma> -> tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2650039Z %126 = arith.andi %121, %125 : tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2650210Z tt.store %117, %105, %126 : tensor<32x256x!tt.ptr, #mma> 2026-02-21T08:56:53.2650367Z %127 = arith.addi %arg3, %c1216_i32 : i32 2026-02-21T08:56:53.2650502Z %128 = arith.divsi %127, %c1024_i32 : i32 2026-02-21T08:56:53.2650629Z %129 = arith.muli %128, %c32_i32 : i32 2026-02-21T08:56:53.2650757Z %130 = arith.subi %c512_i32, %129 : i32 2026-02-21T08:56:53.2650885Z %131 = arith.minsi %130, %c32_i32 : i32 2026-02-21T08:56:53.2651049Z %132 = arith.remsi %127, %c1024_i32 : i32 2026-02-21T08:56:53.2651177Z %133 = arith.remsi %132, %131 : i32 2026-02-21T08:56:53.2651301Z %134 = arith.addi %129, %133 : i32 2026-02-21T08:56:53.2651424Z %135 = arith.divsi %132, %131 : i32 2026-02-21T08:56:53.2651545Z %136 = arith.muli %134, %c32_i32 : i32 2026-02-21T08:56:53.2651727Z %137 = tt.splat %136 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:53.2651964Z %138 = arith.addi %137, %1 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:53.2652144Z %139 = arith.muli %135, %c256_i32 : i32 2026-02-21T08:56:53.2652381Z %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<32x1xi32, #blocked2> 2026-02-21T08:56:53.2652640Z %141 = arith.muli %140, %cst_9 : tensor<32x1xi32, #blocked2> 2026-02-21T08:56:53.2652850Z %142 = tt.broadcast %141 : tensor<32x1xi32, #blocked2> -> tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2653032Z %143 = arith.extsi %139 : i32 to i64 2026-02-21T08:56:53.2653209Z %144 = tt.splat %143 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T08:56:53.2653440Z %145 = arith.addi %144, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T08:56:53.2653723Z %146 = tt.expand_dims %145 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2654013Z %147 = tt.broadcast %146 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2654222Z %148 = arith.cmpi sge, %146, %cst_16 : tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2654407Z %149 = arith.cmpi slt, %146, %cst_15 : tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2654583Z %150 = arith.andi %148, %149 : tensor<1x256xi1, #blocked> 2026-02-21T08:56:53.2654783Z %151 = tt.broadcast %150 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2655061Z %152 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst_10) -> (tensor<32x256xf32, #mma>) : i32 { 2026-02-21T08:56:53.2679480Z %223 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:56:53.2679665Z %224 = tt.splat %223 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:53.2679901Z %225 = arith.addi %224, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:53.2680187Z %226 = tt.expand_dims %225 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:56:53.2680475Z %227 = tt.broadcast %226 : tensor<1x4xi32, #blocked2> -> tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2680675Z %228 = arith.addi %142, %227 : tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2680886Z %229 = tt.addptr %4, %228 : tensor<32x4x!tt.ptr, #blocked2>, tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2681104Z %230 = tt.load %229 : tensor<32x4x!tt.ptr, #blocked2> 2026-02-21T08:56:53.2681331Z %231 = ttg.local_alloc %230 : (tensor<32x4xbf16, #blocked2>) -> !ttg.memdesc<32x4xbf16, #shared, #smem> 2026-02-21T08:56:53.2681674Z %232 = ttg.local_load %231 : !ttg.memdesc<32x4xbf16, #shared, #smem> -> tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2682087Z %233 = arith.extf %232 : tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2682380Z %234 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:56:53.2682610Z %235 = tt.splat %234 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:53.2682842Z %236 = arith.addi %235, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:53.2683177Z %237 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2683428Z %238 = arith.muli %237, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2683631Z %239 = tt.broadcast %238 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2683835Z %240 = arith.addi %239, %147 : tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2684042Z %241 = tt.addptr %5, %240 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2684265Z %242 = arith.cmpi sge, %237, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2684447Z %243 = arith.cmpi slt, %237, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2684624Z %244 = arith.andi %242, %243 : tensor<2x1xi1, #blocked> 2026-02-21T08:56:53.2684822Z %245 = tt.broadcast %244 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2685028Z %246 = arith.andi %245, %151 : tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2685211Z %247 = tt.load %241, %246, %cst_14 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T08:56:53.2685482Z %248 = ttg.convert_layout %247 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2685781Z %249 = arith.shli %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2686031Z %250 = arith.shrsi %249, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2686285Z %251 = arith.shrsi %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2686597Z %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T08:56:53.2686952Z %253 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T08:56:53.2687264Z %254 = tt.broadcast %252 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2687574Z %255 = arith.select %14, %254, %cst_17 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2687830Z %256 = tt.broadcast %253 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2688086Z %257 = arith.select %16, %256, %255 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2688330Z %258 = tt.reshape %257 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked> 2026-02-21T08:56:53.2688563Z %259 = arith.sitofp %258 : tensor<4x256xi8, #blocked> to tensor<4x256xf32, #blocked> 2026-02-21T08:56:53.2688829Z %260 = ttg.local_alloc %259 : (tensor<4x256xf32, #blocked>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T08:56:53.2689169Z %261 = ttg.local_load %260 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2689657Z %262 = tt.dot %233, %261, %arg5, inputPrecision = tf32 : tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x256xf32, #mma> 2026-02-21T08:56:53.2690021Z scf.yield %262 : tensor<32x256xf32, #mma> 2026-02-21T08:56:53.2690195Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32} 2026-02-21T08:56:53.2690415Z %153 = arith.truncf %152 : tensor<32x256xf32, #mma> to tensor<32x256xbf16, #mma> 2026-02-21T08:56:53.2690592Z %154 = arith.extsi %136 : i32 to i64 2026-02-21T08:56:53.2690765Z %155 = tt.splat %154 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:53.2690982Z %156 = arith.addi %155, %18 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:53.2691295Z %157 = tt.expand_dims %156 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2691540Z %158 = arith.muli %157, %cst_1 : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2691727Z %159 = tt.broadcast %158 : tensor<32x1xi64, #mma> -> tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2691943Z %160 = tt.splat %143 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:53.2692158Z %161 = arith.addi %160, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:53.2692435Z %162 = tt.expand_dims %161 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2692710Z %163 = tt.broadcast %162 : tensor<1x256xi64, #mma> -> tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2692898Z %164 = arith.addi %159, %163 : tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2693097Z %165 = tt.addptr %17, %164 : tensor<32x256x!tt.ptr, #mma>, tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2693305Z %166 = arith.cmpi sge, %157, %cst_0 : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2693477Z %167 = arith.cmpi slt, %157, %cst : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2693650Z %168 = arith.andi %166, %167 : tensor<32x1xi1, #mma> 2026-02-21T08:56:53.2693830Z %169 = tt.broadcast %168 : tensor<32x1xi1, #mma> -> tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2694021Z %170 = arith.cmpi sge, %162, %cst_3 : tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2694194Z %171 = arith.cmpi slt, %162, %cst_2 : tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2694354Z %172 = arith.andi %170, %171 : tensor<1x256xi1, #mma> 2026-02-21T08:56:53.2694537Z %173 = tt.broadcast %172 : tensor<1x256xi1, #mma> -> tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2694757Z %174 = arith.andi %169, %173 : tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2694931Z tt.store %165, %153, %174 : tensor<32x256x!tt.ptr, #mma> 2026-02-21T08:56:53.2695093Z %175 = arith.addi %arg3, %c1824_i32 : i32 2026-02-21T08:56:53.2695230Z %176 = arith.divsi %175, %c1024_i32 : i32 2026-02-21T08:56:53.2695357Z %177 = arith.muli %176, %c32_i32 : i32 2026-02-21T08:56:53.2695525Z %178 = arith.subi %c512_i32, %177 : i32 2026-02-21T08:56:53.2695653Z %179 = arith.minsi %178, %c32_i32 : i32 2026-02-21T08:56:53.2695780Z %180 = arith.remsi %175, %c1024_i32 : i32 2026-02-21T08:56:53.2695910Z %181 = arith.remsi %180, %179 : i32 2026-02-21T08:56:53.2696030Z %182 = arith.addi %177, %181 : i32 2026-02-21T08:56:53.2696154Z %183 = arith.divsi %180, %179 : i32 2026-02-21T08:56:53.2696276Z %184 = arith.muli %182, %c32_i32 : i32 2026-02-21T08:56:53.2696457Z %185 = tt.splat %184 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:53.2696694Z %186 = arith.addi %185, %1 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:53.2696875Z %187 = arith.muli %183, %c256_i32 : i32 2026-02-21T08:56:53.2697117Z %188 = tt.expand_dims %186 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<32x1xi32, #blocked2> 2026-02-21T08:56:53.2697378Z %189 = arith.muli %188, %cst_9 : tensor<32x1xi32, #blocked2> 2026-02-21T08:56:53.2697592Z %190 = tt.broadcast %189 : tensor<32x1xi32, #blocked2> -> tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2697777Z %191 = arith.extsi %187 : i32 to i64 2026-02-21T08:56:53.2697951Z %192 = tt.splat %191 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T08:56:53.2698182Z %193 = arith.addi %192, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T08:56:53.2698464Z %194 = tt.expand_dims %193 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2698752Z %195 = tt.broadcast %194 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2699010Z %196 = arith.cmpi sge, %194, %cst_16 : tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2699196Z %197 = arith.cmpi slt, %194, %cst_15 : tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2699374Z %198 = arith.andi %196, %197 : tensor<1x256xi1, #blocked> 2026-02-21T08:56:53.2699565Z %199 = tt.broadcast %198 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2699839Z %200 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst_10) -> (tensor<32x256xf32, #mma>) : i32 { 2026-02-21T08:56:53.2700059Z %223 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:56:53.2700239Z %224 = tt.splat %223 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:53.2700470Z %225 = arith.addi %224, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:53.2700750Z %226 = tt.expand_dims %225 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:56:53.2701041Z %227 = tt.broadcast %226 : tensor<1x4xi32, #blocked2> -> tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2701240Z %228 = arith.addi %190, %227 : tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2701450Z %229 = tt.addptr %4, %228 : tensor<32x4x!tt.ptr, #blocked2>, tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2701666Z %230 = tt.load %229 : tensor<32x4x!tt.ptr, #blocked2> 2026-02-21T08:56:53.2701896Z %231 = ttg.local_alloc %230 : (tensor<32x4xbf16, #blocked2>) -> !ttg.memdesc<32x4xbf16, #shared, #smem> 2026-02-21T08:56:53.2702235Z %232 = ttg.local_load %231 : !ttg.memdesc<32x4xbf16, #shared, #smem> -> tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2702649Z %233 = arith.extf %232 : tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2702945Z %234 = arith.extsi %arg4 : i32 to i64 2026-02-21T08:56:53.2703123Z %235 = tt.splat %234 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:53.2703385Z %236 = arith.addi %235, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:53.2703663Z %237 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2703912Z %238 = arith.muli %237, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2704115Z %239 = tt.broadcast %238 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2704312Z %240 = arith.addi %239, %195 : tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2704512Z %241 = tt.addptr %5, %240 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2704725Z %242 = arith.cmpi sge, %237, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2704901Z %243 = arith.cmpi slt, %237, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2705072Z %244 = arith.andi %242, %243 : tensor<2x1xi1, #blocked> 2026-02-21T08:56:53.2705268Z %245 = tt.broadcast %244 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2705463Z %246 = arith.andi %245, %199 : tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2705641Z %247 = tt.load %241, %246, %cst_14 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T08:56:53.2705910Z %248 = ttg.convert_layout %247 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2706206Z %249 = arith.shli %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2706454Z %250 = arith.shrsi %249, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2706704Z %251 = arith.shrsi %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2707052Z %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T08:56:53.2707408Z %253 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T08:56:53.2707710Z %254 = tt.broadcast %252 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2707966Z %255 = arith.select %14, %254, %cst_17 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2708220Z %256 = tt.broadcast %253 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2708473Z %257 = arith.select %16, %256, %255 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2708715Z %258 = tt.reshape %257 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked> 2026-02-21T08:56:53.2708948Z %259 = arith.sitofp %258 : tensor<4x256xi8, #blocked> to tensor<4x256xf32, #blocked> 2026-02-21T08:56:53.2709211Z %260 = ttg.local_alloc %259 : (tensor<4x256xf32, #blocked>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T08:56:53.2709542Z %261 = ttg.local_load %260 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2710032Z %262 = tt.dot %233, %261, %arg5, inputPrecision = tf32 : tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x256xf32, #mma> 2026-02-21T08:56:53.2710390Z scf.yield %262 : tensor<32x256xf32, #mma> 2026-02-21T08:56:53.2710563Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32} 2026-02-21T08:56:53.2710780Z %201 = arith.truncf %200 : tensor<32x256xf32, #mma> to tensor<32x256xbf16, #mma> 2026-02-21T08:56:53.2710958Z %202 = arith.extsi %184 : i32 to i64 2026-02-21T08:56:53.2711129Z %203 = tt.splat %202 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:53.2711375Z %204 = arith.addi %203, %18 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:53.2711646Z %205 = tt.expand_dims %204 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2711889Z %206 = arith.muli %205, %cst_1 : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2712073Z %207 = tt.broadcast %206 : tensor<32x1xi64, #mma> -> tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2712288Z %208 = tt.splat %191 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:53.2712503Z %209 = arith.addi %208, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:53.2712780Z %210 = tt.expand_dims %209 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2713047Z %211 = tt.broadcast %210 : tensor<1x256xi64, #mma> -> tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2713236Z %212 = arith.addi %207, %211 : tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2713437Z %213 = tt.addptr %17, %212 : tensor<32x256x!tt.ptr, #mma>, tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2713638Z %214 = arith.cmpi sge, %205, %cst_0 : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2713810Z %215 = arith.cmpi slt, %205, %cst : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2713970Z %216 = arith.andi %214, %215 : tensor<32x1xi1, #mma> 2026-02-21T08:56:53.2714152Z %217 = tt.broadcast %216 : tensor<32x1xi1, #mma> -> tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2714346Z %218 = arith.cmpi sge, %210, %cst_3 : tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2714517Z %219 = arith.cmpi slt, %210, %cst_2 : tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2714683Z %220 = arith.andi %218, %219 : tensor<1x256xi1, #mma> 2026-02-21T08:56:53.2714903Z %221 = tt.broadcast %220 : tensor<1x256xi1, #mma> -> tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2715091Z %222 = arith.andi %217, %221 : tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2715257Z tt.store %213, %201, %222 : tensor<32x256x!tt.ptr, #mma> 2026-02-21T08:56:53.2715414Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:56:53.2715544Z %27 = arith.subi %c16384_i32, %26 : i32 2026-02-21T08:56:53.2715675Z %28 = arith.ceildivsi %27, %c608_i32 : i32 2026-02-21T08:56:53.2715803Z %29 = arith.muli %28, %c256_i32 : i32 2026-02-21T08:56:53.2715922Z %30 = arith.subi %26, %c608_i32 : i32 2026-02-21T08:56:53.2716429Z %31:9 = scf.for %arg3 = %c0_i32 to %29 step %c1_i32 iter_args(%arg4 = %c-1_i32, %arg5 = %30, %arg6 = %c0_i32, %arg7 = %cst_10, %arg8 = %c0_i32, %arg9 = %c0_i32, %arg10 = %cst_11, %arg11 = %cst_12, %arg12 = %cst_13) -> (i32, i32, i32, tensor<32x256xf32, #mma>, i32, i32, tensor<32x4xi32, #blocked2>, tensor<2x256xi64, #blocked>, tensor<2x256xi1, #blocked>) : i32 { 2026-02-21T08:56:53.2716934Z %32 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T08:56:53.2717071Z %33 = arith.cmpi eq, %arg4, %c255_i32 : i32 2026-02-21T08:56:53.2717211Z %34 = arith.select %33, %c0_i32, %32 : i32 2026-02-21T08:56:53.2717341Z %35 = arith.cmpi eq, %34, %c0_i32 : i32 2026-02-21T08:56:53.2717474Z %36 = arith.select %35, %c0_i32, %arg6 : i32 2026-02-21T08:56:53.2717707Z %37:6 = scf.if %35 -> (i32, i32, tensor<32x4xi32, #blocked2>, tensor<2x256xi64, #blocked>, tensor<2x256xi1, #blocked>, i32) { 2026-02-21T08:56:53.2717943Z %81 = arith.addi %arg5, %c608_i32 : i32 2026-02-21T08:56:53.2718075Z %82 = arith.divsi %81, %c1024_i32 : i32 2026-02-21T08:56:53.2718200Z %83 = arith.muli %82, %c32_i32 : i32 2026-02-21T08:56:53.2718326Z %84 = arith.subi %c512_i32, %83 : i32 2026-02-21T08:56:53.2718448Z %85 = arith.minsi %84, %c32_i32 : i32 2026-02-21T08:56:53.2718576Z %86 = arith.remsi %81, %c1024_i32 : i32 2026-02-21T08:56:53.2718700Z %87 = arith.remsi %86, %85 : i32 2026-02-21T08:56:53.2718823Z %88 = arith.addi %83, %87 : i32 2026-02-21T08:56:53.2718988Z %89 = arith.divsi %86, %85 : i32 2026-02-21T08:56:53.2719107Z %90 = arith.muli %88, %c32_i32 : i32 2026-02-21T08:56:53.2719283Z %91 = tt.splat %90 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:53.2719511Z %92 = arith.addi %91, %1 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:56:53.2719690Z %93 = arith.muli %89, %c256_i32 : i32 2026-02-21T08:56:53.2719918Z %94 = tt.expand_dims %92 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<32x1xi32, #blocked2> 2026-02-21T08:56:53.2720175Z %95 = arith.muli %94, %cst_9 : tensor<32x1xi32, #blocked2> 2026-02-21T08:56:53.2720373Z %96 = tt.broadcast %95 : tensor<32x1xi32, #blocked2> -> tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2720552Z %97 = arith.extsi %93 : i32 to i64 2026-02-21T08:56:53.2720724Z %98 = tt.splat %97 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T08:56:53.2720947Z %99 = arith.addi %98, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T08:56:53.2721254Z %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2721543Z %101 = tt.broadcast %100 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2721751Z %102 = arith.cmpi sge, %100, %cst_16 : tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2721935Z %103 = arith.cmpi slt, %100, %cst_15 : tensor<1x256xi64, #blocked> 2026-02-21T08:56:53.2722114Z %104 = arith.andi %102, %103 : tensor<1x256xi1, #blocked> 2026-02-21T08:56:53.2722306Z %105 = tt.broadcast %104 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2722705Z scf.yield %90, %93, %96, %101, %105, %81 : i32, i32, tensor<32x4xi32, #blocked2>, tensor<2x256xi64, #blocked>, tensor<2x256xi1, #blocked>, i32 2026-02-21T08:56:53.2722949Z } else { 2026-02-21T08:56:53.2723190Z scf.yield %arg8, %arg9, %arg10, %arg11, %arg12, %arg5 : i32, i32, tensor<32x4xi32, #blocked2>, tensor<2x256xi64, #blocked>, tensor<2x256xi1, #blocked>, i32 2026-02-21T08:56:53.2723452Z } 2026-02-21T08:56:53.2723540Z %38 = arith.muli %36, %c2_i32 : i32 2026-02-21T08:56:53.2723716Z %39 = tt.splat %38 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:53.2723935Z %40 = arith.addi %39, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:56:53.2724212Z %41 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T08:56:53.2724493Z %42 = tt.broadcast %41 : tensor<1x4xi32, #blocked2> -> tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2724698Z %43 = arith.addi %37#2, %42 : tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2724904Z %44 = tt.addptr %4, %43 : tensor<32x4x!tt.ptr, #blocked2>, tensor<32x4xi32, #blocked2> 2026-02-21T08:56:53.2725105Z %45 = tt.load %44 : tensor<32x4x!tt.ptr, #blocked2> 2026-02-21T08:56:53.2725325Z %46 = ttg.local_alloc %45 : (tensor<32x4xbf16, #blocked2>) -> !ttg.memdesc<32x4xbf16, #shared, #smem> 2026-02-21T08:56:53.2725651Z %47 = ttg.local_load %46 : !ttg.memdesc<32x4xbf16, #shared, #smem> -> tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2726056Z %48 = arith.extf %47 : tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2726336Z %49 = arith.extsi %36 : i32 to i64 2026-02-21T08:56:53.2726505Z %50 = tt.splat %49 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:53.2726728Z %51 = arith.addi %50, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:56:53.2727044Z %52 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2727288Z %53 = arith.muli %52, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2727480Z %54 = tt.broadcast %53 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2727671Z %55 = arith.addi %54, %37#3 : tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2727870Z %56 = tt.addptr %5, %55 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T08:56:53.2728075Z %57 = arith.cmpi sge, %52, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2728247Z %58 = arith.cmpi slt, %52, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T08:56:53.2728412Z %59 = arith.andi %57, %58 : tensor<2x1xi1, #blocked> 2026-02-21T08:56:53.2728594Z %60 = tt.broadcast %59 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2728785Z %61 = arith.andi %60, %37#4 : tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2728955Z %62 = tt.load %56, %61, %cst_14 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T08:56:53.2729219Z %63 = ttg.convert_layout %62 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2729510Z %64 = arith.shli %63, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2729750Z %65 = arith.shrsi %64, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2729996Z %66 = arith.shrsi %63, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:56:53.2730293Z %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T08:56:53.2730675Z %68 = tt.expand_dims %66 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T08:56:53.2730969Z %69 = tt.broadcast %67 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2731215Z %70 = arith.select %14, %69, %cst_17 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2731461Z %71 = tt.broadcast %68 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2731700Z %72 = arith.select %16, %71, %70 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T08:56:53.2731932Z %73 = tt.reshape %72 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked> 2026-02-21T08:56:53.2732152Z %74 = arith.sitofp %73 : tensor<4x256xi8, #blocked> to tensor<4x256xf32, #blocked> 2026-02-21T08:56:53.2732394Z %75 = ttg.local_alloc %74 : (tensor<4x256xf32, #blocked>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T08:56:53.2732726Z %76 = ttg.local_load %75 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:56:53.2733201Z %77 = tt.dot %48, %76, %arg7, inputPrecision = tf32 : tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x256xf32, #mma> 2026-02-21T08:56:53.2733550Z %78 = arith.addi %36, %c2_i32 : i32 2026-02-21T08:56:53.2733683Z %79 = arith.cmpi eq, %34, %c255_i32 : i32 2026-02-21T08:56:53.2733835Z %80 = arith.select %79, %cst_10, %77 : tensor<32x256xf32, #mma> 2026-02-21T08:56:53.2733980Z scf.if %79 { 2026-02-21T08:56:53.2734119Z %81 = arith.truncf %77 : tensor<32x256xf32, #mma> to tensor<32x256xbf16, #mma> 2026-02-21T08:56:53.2734292Z %82 = arith.extsi %37#0 : i32 to i64 2026-02-21T08:56:53.2734415Z %83 = arith.extsi %37#1 : i32 to i64 2026-02-21T08:56:53.2734586Z %84 = tt.splat %82 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:53.2734806Z %85 = arith.addi %84, %18 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:56:53.2735108Z %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2735352Z %87 = arith.muli %86, %cst_1 : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2735535Z %88 = tt.broadcast %87 : tensor<32x1xi64, #mma> -> tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2735749Z %89 = tt.splat %83 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:53.2735967Z %90 = arith.addi %89, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:56:53.2736235Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2736504Z %92 = tt.broadcast %91 : tensor<1x256xi64, #mma> -> tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2736687Z %93 = arith.addi %88, %92 : tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2736882Z %94 = tt.addptr %17, %93 : tensor<32x256x!tt.ptr, #mma>, tensor<32x256xi64, #mma> 2026-02-21T08:56:53.2737084Z %95 = arith.cmpi sge, %86, %cst_0 : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2737249Z %96 = arith.cmpi slt, %86, %cst : tensor<32x1xi64, #mma> 2026-02-21T08:56:53.2737409Z %97 = arith.andi %95, %96 : tensor<32x1xi1, #mma> 2026-02-21T08:56:53.2737582Z %98 = tt.broadcast %97 : tensor<32x1xi1, #mma> -> tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2737771Z %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2737940Z %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x256xi64, #mma> 2026-02-21T08:56:53.2738108Z %101 = arith.andi %99, %100 : tensor<1x256xi1, #mma> 2026-02-21T08:56:53.2738325Z %102 = tt.broadcast %101 : tensor<1x256xi1, #mma> -> tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2738510Z %103 = arith.andi %98, %102 : tensor<32x256xi1, #mma> 2026-02-21T08:56:53.2738681Z tt.store %94, %81, %103 : tensor<32x256x!tt.ptr, #mma> 2026-02-21T08:56:53.2738820Z } 2026-02-21T08:56:53.2739091Z scf.yield %34, %37#5, %78, %80, %37#0, %37#1, %37#2, %37#3, %37#4 : i32, i32, i32, tensor<32x256xf32, #mma>, i32, i32, tensor<32x4xi32, #blocked2>, tensor<2x256xi64, #blocked>, tensor<2x256xi1, #blocked> 2026-02-21T08:56:53.2739411Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 2 : i32} 2026-02-21T08:56:53.2739553Z tt.return 2026-02-21T08:56:53.2739642Z } 2026-02-21T08:56:53.2739728Z } 2026-02-21T08:56:53.2739775Z 2026-02-21T08:56:53.2739814Z {-# 2026-02-21T08:56:53.2739901Z external_resources: { 2026-02-21T08:56:53.2740011Z mlir_reproducer: { 2026-02-21T08:56:53.2741014Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:56:53.2742018Z disable_threading: false, 2026-02-21T08:56:53.2742130Z verify_each: true 2026-02-21T08:56:53.2742229Z } 2026-02-21T08:56:53.2742309Z } 2026-02-21T08:56:53.2742386Z #-} 2026-02-21T08:56:53.2742668Z /tmp/torchinductor_root/gz/cgzzn756yxc7hskf4mdl2sgzbtg7provhizvgiphj4oto5qophj2.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:56:53.2743387Z /tmp/torchinductor_root/gz/cgzzn756yxc7hskf4mdl2sgzbtg7provhizvgiphj4oto5qophj2.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:56:53.2743981Z [92s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:56:53.2744765Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 32, 256], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, False], range_num_stages=[3, 2], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T08:56:53.2745486Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:56:53.2745660Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:57:10.0402125Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T08:57:10.0406848Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T08:57:10.0407738Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:57:10.0408624Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:57:10.0409445Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T08:57:10.0410220Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T08:57:10.0411382Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T08:57:10.0411852Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T08:57:10.0412220Z #smem = #ttg.shared_memory 2026-02-21T08:57:10.0412687Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T08:57:10.0413643Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:57:10.0414478Z %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma> 2026-02-21T08:57:10.0414835Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:57:10.0415180Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T08:57:10.0415546Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0415866Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:57:10.0416107Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:57:10.0416341Z %c31_i32 = arith.constant 31 : i32 2026-02-21T08:57:10.0416703Z %cst_3 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0417206Z %cst_4 = arith.constant dense<16> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0417694Z %cst_5 = arith.constant dense<24> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0418118Z %cst_6 = arith.constant dense<8192> : tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0418472Z %cst_7 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1> 2026-02-21T08:57:10.0418772Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:57:10.0419014Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:57:10.0419242Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:57:10.0419474Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:57:10.0419707Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:57:10.0419946Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:57:10.0420355Z %c12_i32 = arith.constant 12 : i32 2026-02-21T08:57:10.0420641Z %cst_8 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0420868Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:57:10.0421037Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:57:10.0421198Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:57:10.0421364Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T08:57:10.0421580Z %cst_9 = arith.constant dense<4> : tensor<4x128xi8, #blocked2> 2026-02-21T08:57:10.0421884Z %cst_10 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0422157Z %0 = tt.get_program_id x : i32 2026-02-21T08:57:10.0422431Z %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:57:10.0422838Z %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:57:10.0423247Z %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:57:10.0423627Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:57:10.0423999Z %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0424372Z %6 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0424726Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<64x8x!tt.ptr, #blocked1> 2026-02-21T08:57:10.0425015Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T08:57:10.0425454Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T08:57:10.0426053Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T08:57:10.0426625Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T08:57:10.0426985Z %12 = arith.cmpi eq, %11, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:57:10.0427266Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T08:57:10.0427544Z %14 = arith.cmpi eq, %11, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T08:57:10.0427813Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T08:57:10.0428111Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<64x128x!tt.ptr, #mma> 2026-02-21T08:57:10.0428347Z %17 = arith.subi %c16384_i32, %0 : i32 2026-02-21T08:57:10.0428529Z %18 = arith.ceildivsi %17, %c2432_i32 : i32 2026-02-21T08:57:10.0428709Z %19 = arith.muli %18, %c32_i32 : i32 2026-02-21T08:57:10.0428946Z %20 = ttg.local_alloc : () -> !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> 2026-02-21T08:57:10.0429238Z %21 = ttg.local_alloc : () -> !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> 2026-02-21T08:57:10.0429529Z %22 = ttg.local_alloc : () -> !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> 2026-02-21T08:57:10.0429817Z %23 = ttg.local_alloc : () -> !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> 2026-02-21T08:57:10.0430057Z %24 = arith.cmpi sgt, %19, %c0_i32 : i32 2026-02-21T08:57:10.0430231Z %25 = arith.divsi %0, %c2048_i32 : i32 2026-02-21T08:57:10.0430396Z %26 = arith.muli %25, %c8_i32 : i32 2026-02-21T08:57:10.0430567Z %27 = arith.subi %c64_i32, %26 : i32 2026-02-21T08:57:10.0430711Z %28 = arith.minsi %27, %c8_i32 : i32 2026-02-21T08:57:10.0430846Z %29 = arith.remsi %0, %c2048_i32 : i32 2026-02-21T08:57:10.0430972Z %30 = arith.remsi %29, %28 : i32 2026-02-21T08:57:10.0431148Z %31 = arith.addi %26, %30 : i32 2026-02-21T08:57:10.0431275Z %32 = arith.divsi %29, %28 : i32 2026-02-21T08:57:10.0431401Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T08:57:10.0431583Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:57:10.0431825Z %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:57:10.0432063Z %36 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:57:10.0432296Z %37 = arith.addi %35, %2 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:57:10.0432491Z %38 = arith.muli %32, %c64_i32 : i32 2026-02-21T08:57:10.0432679Z %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:57:10.0432918Z %40 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:57:10.0433227Z %41 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T08:57:10.0433506Z %42 = arith.muli %41, %cst_7 : tensor<64x1xi32, #blocked1> 2026-02-21T08:57:10.0433719Z %43 = tt.broadcast %42 : tensor<64x1xi32, #blocked1> -> tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0434037Z %44 = tt.expand_dims %37 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi32, #blocked2> 2026-02-21T08:57:10.0434347Z %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked2> -> tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0434654Z %46 = tt.expand_dims %6 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T08:57:10.0434956Z %47 = tt.broadcast %46 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0435211Z %48 = arith.addi %43, %47 : tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0435433Z %49 = tt.addptr %7, %48 : tensor<64x8x!tt.ptr, #blocked1>, tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0435652Z %50 = tt.splat %24 : i1 -> tensor<64x8xi1, #blocked1> 2026-02-21T08:57:10.0435941Z %51 = tt.load %49, %50 {amd.pipeliner_part = "prologue"} : tensor<64x8x!tt.ptr, #blocked1> 2026-02-21T08:57:10.0436210Z %52 = arith.addi %6, %cst_3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0436512Z %53 = tt.expand_dims %52 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T08:57:10.0436816Z %54 = tt.broadcast %53 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0437029Z %55 = arith.addi %43, %54 : tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0437250Z %56 = tt.addptr %7, %55 : tensor<64x8x!tt.ptr, #blocked1>, tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0437521Z %57 = tt.load %56, %50 {amd.pipeliner_part = "prologue"} : tensor<64x8x!tt.ptr, #blocked1> 2026-02-21T08:57:10.0437783Z %58 = arith.addi %6, %cst_4 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0438089Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T08:57:10.0438386Z %60 = tt.broadcast %59 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0438587Z %61 = arith.addi %43, %60 : tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0438799Z %62 = tt.addptr %7, %61 : tensor<64x8x!tt.ptr, #blocked1>, tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0439063Z %63 = tt.load %62, %50 {amd.pipeliner_part = "prologue"} : tensor<64x8x!tt.ptr, #blocked1> 2026-02-21T08:57:10.0439322Z %64 = arith.addi %6, %cst_5 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0439629Z %65 = tt.expand_dims %64 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T08:57:10.0439966Z %66 = tt.broadcast %65 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0440175Z %67 = arith.addi %43, %66 : tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0440387Z %68 = tt.addptr %7, %67 : tensor<64x8x!tt.ptr, #blocked1>, tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0440652Z %69 = tt.load %68, %50 {amd.pipeliner_part = "prologue"} : tensor<64x8x!tt.ptr, #blocked1> 2026-02-21T08:57:10.0440987Z %70 = ttg.memdesc_index %20[%c0_i32] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0441341Z ttg.local_store %51, %70 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0441696Z %71 = ttg.memdesc_index %21[%c0_i32] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0442049Z ttg.local_store %57, %71 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0442396Z %72 = ttg.memdesc_index %22[%c0_i32] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0442813Z ttg.local_store %63, %72 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0443157Z %73 = ttg.memdesc_index %23[%c0_i32] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0443508Z ttg.local_store %69, %73 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0443728Z %74 = arith.subi %19, %c1_i32 : i32 2026-02-21T08:57:10.0444798Z %75:17 = scf.for %arg3 = %c0_i32 to %74 step %c1_i32 iter_args(%arg4 = %c0_i32, %arg5 = %0, %arg6 = %cst_2, %arg7 = %37, %arg8 = %38, %arg9 = %43, %arg10 = %45, %arg11 = %c0_i32, %arg12 = %c0_i32, %arg13 = %70, %arg14 = %c4_i32, %arg15 = %71, %arg16 = %c8_i32, %arg17 = %72, %arg18 = %c12_i32, %arg19 = %73, %arg20 = %36) -> (i32, i32, tensor<64x128xf32, #mma>, tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>, i32, tensor<64x8xi32, #blocked1>, tensor<4x128xi32, #blocked2>, i32, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>) : i32 { 2026-02-21T08:57:10.0445833Z %184 = arith.addi %arg12, %c16_i32 : i32 2026-02-21T08:57:10.0445960Z %185 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T08:57:10.0446099Z %186 = arith.cmpi eq, %arg4, %c31_i32 : i32 2026-02-21T08:57:10.0446232Z %187 = arith.select %186, %c0_i32, %185 : i32 2026-02-21T08:57:10.0446372Z %188 = arith.cmpi eq, %187, %c0_i32 : i32 2026-02-21T08:57:10.0446501Z %189 = arith.select %188, %c0_i32, %184 : i32 2026-02-21T08:57:10.0446851Z %190:6 = scf.if %188 -> (tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>, i32, tensor<64x8xi32, #blocked1>, tensor<4x128xi32, #blocked2>, i32, tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>) { 2026-02-21T08:57:10.0447191Z %334 = arith.addi %arg5, %c2432_i32 : i32 2026-02-21T08:57:10.0447319Z %335 = arith.divsi %334, %c2048_i32 : i32 2026-02-21T08:57:10.0447449Z %336 = arith.muli %335, %c8_i32 : i32 2026-02-21T08:57:10.0447570Z %337 = arith.subi %c64_i32, %336 : i32 2026-02-21T08:57:10.0447695Z %338 = arith.minsi %337, %c8_i32 : i32 2026-02-21T08:57:10.0447816Z %339 = arith.remsi %334, %c2048_i32 : i32 2026-02-21T08:57:10.0447946Z %340 = arith.remsi %339, %338 : i32 2026-02-21T08:57:10.0448069Z %341 = arith.addi %336, %340 : i32 2026-02-21T08:57:10.0448185Z %342 = arith.divsi %339, %338 : i32 2026-02-21T08:57:10.0448363Z %343 = arith.muli %341, %c128_i32 : i32 2026-02-21T08:57:10.0448534Z %344 = tt.splat %343 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:57:10.0448789Z %345 = tt.splat %343 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:57:10.0449016Z %346 = arith.addi %344, %1 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:57:10.0449236Z %347 = arith.addi %345, %2 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T08:57:10.0449423Z %348 = arith.muli %342, %c64_i32 : i32 2026-02-21T08:57:10.0449596Z %349 = tt.splat %348 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:57:10.0449828Z %350 = arith.addi %349, %3 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T08:57:10.0450120Z %351 = tt.expand_dims %350 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T08:57:10.0450382Z %352 = arith.muli %351, %cst_7 : tensor<64x1xi32, #blocked1> 2026-02-21T08:57:10.0450583Z %353 = tt.broadcast %352 : tensor<64x1xi32, #blocked1> -> tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0450875Z %354 = tt.expand_dims %347 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi32, #blocked2> 2026-02-21T08:57:10.0451166Z %355 = tt.broadcast %354 : tensor<1x128xi32, #blocked2> -> tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0451576Z scf.yield %347, %348, %353, %355, %334, %346 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>, i32, tensor<64x8xi32, #blocked1>, tensor<4x128xi32, #blocked2>, i32, tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:57:10.0451955Z } else { 2026-02-21T08:57:10.0452295Z scf.yield %arg7, %arg8, %arg9, %arg10, %arg5, %arg20 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>, i32, tensor<64x8xi32, #blocked1>, tensor<4x128xi32, #blocked2>, i32, tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:57:10.0452659Z } 2026-02-21T08:57:10.0452808Z %191 = tt.splat %arg12 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0453042Z %192 = arith.addi %191, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0453221Z %193 = arith.muli %189, %c2_i32 : i32 2026-02-21T08:57:10.0453392Z %194 = tt.splat %193 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0453614Z %195 = arith.addi %194, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0453893Z %196 = tt.expand_dims %195 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T08:57:10.0454176Z %197 = tt.broadcast %196 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0454376Z %198 = arith.addi %190#2, %197 : tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0454586Z %199 = tt.addptr %7, %198 : tensor<64x8x!tt.ptr, #blocked1>, tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0454793Z %200 = tt.load %199 : tensor<64x8x!tt.ptr, #blocked1> 2026-02-21T08:57:10.0455104Z %201 = ttg.local_load %arg13 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0455556Z %202 = arith.extf %201 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0455946Z %203 = tt.expand_dims %192 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0456212Z %204 = arith.muli %203, %cst_6 : tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0456451Z %205 = tt.broadcast %204 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0456651Z %206 = arith.addi %205, %arg10 : tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0456859Z %207 = tt.addptr %8, %206 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0457063Z %208 = tt.load %207 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T08:57:10.0457314Z %209 = ttg.convert_layout %208 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0457602Z %210 = arith.shli %209, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0457841Z %211 = arith.shrsi %210, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0458092Z %212 = arith.shrsi %209, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0458391Z %213 = tt.expand_dims %211 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0458738Z %214 = tt.expand_dims %212 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0459028Z %215 = tt.broadcast %213 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0459271Z %216 = arith.select %13, %215, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0459514Z %217 = tt.broadcast %214 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0459744Z %218 = arith.select %15, %217, %216 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0460011Z %219 = tt.reshape %218 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T08:57:10.0460257Z %220 = arith.sitofp %219 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T08:57:10.0460514Z %221 = ttg.local_alloc %220 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T08:57:10.0460842Z %222 = ttg.local_load %221 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0461316Z %223 = tt.dot %202, %222, %arg6, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0461669Z %224 = arith.addi %189, %c4_i32 : i32 2026-02-21T08:57:10.0461847Z %225 = tt.splat %arg14 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0462074Z %226 = arith.addi %225, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0462247Z %227 = arith.muli %224, %c2_i32 : i32 2026-02-21T08:57:10.0462413Z %228 = tt.splat %227 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0462634Z %229 = arith.addi %228, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0462910Z %230 = tt.expand_dims %229 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T08:57:10.0463181Z %231 = tt.broadcast %230 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0463377Z %232 = arith.addi %190#2, %231 : tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0463579Z %233 = tt.addptr %7, %232 : tensor<64x8x!tt.ptr, #blocked1>, tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0463786Z %234 = tt.load %233 : tensor<64x8x!tt.ptr, #blocked1> 2026-02-21T08:57:10.0464086Z %235 = ttg.local_load %arg15 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0464555Z %236 = arith.extf %235 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0464938Z %237 = tt.expand_dims %226 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0465184Z %238 = arith.muli %237, %cst_6 : tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0465379Z %239 = tt.broadcast %238 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0465580Z %240 = arith.addi %239, %arg10 : tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0465782Z %241 = tt.addptr %8, %240 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0465991Z %242 = tt.load %241 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T08:57:10.0466240Z %243 = ttg.convert_layout %242 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0466528Z %244 = arith.shli %243, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0466765Z %245 = arith.shrsi %244, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0467001Z %246 = arith.shrsi %243, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0467293Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0467635Z %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0467920Z %249 = tt.broadcast %247 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0468198Z %250 = arith.select %13, %249, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0468438Z %251 = tt.broadcast %248 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0468677Z %252 = arith.select %15, %251, %250 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0468911Z %253 = tt.reshape %252 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T08:57:10.0469131Z %254 = arith.sitofp %253 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T08:57:10.0469388Z %255 = ttg.local_alloc %254 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T08:57:10.0469715Z %256 = ttg.local_load %255 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0470189Z %257 = tt.dot %236, %256, %223, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0470531Z %258 = arith.addi %189, %c8_i32 : i32 2026-02-21T08:57:10.0470702Z %259 = tt.splat %arg16 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0470929Z %260 = arith.addi %259, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0471098Z %261 = arith.muli %258, %c2_i32 : i32 2026-02-21T08:57:10.0471265Z %262 = tt.splat %261 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0471482Z %263 = arith.addi %262, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0471758Z %264 = tt.expand_dims %263 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T08:57:10.0472042Z %265 = tt.broadcast %264 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0472235Z %266 = arith.addi %190#2, %265 : tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0472469Z %267 = tt.addptr %7, %266 : tensor<64x8x!tt.ptr, #blocked1>, tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0472671Z %268 = tt.load %267 : tensor<64x8x!tt.ptr, #blocked1> 2026-02-21T08:57:10.0472964Z %269 = ttg.local_load %arg17 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0473393Z %270 = arith.extf %269 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0473765Z %271 = tt.expand_dims %260 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0474015Z %272 = arith.muli %271, %cst_6 : tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0474206Z %273 = tt.broadcast %272 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0474402Z %274 = arith.addi %273, %arg10 : tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0474605Z %275 = tt.addptr %8, %274 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0474806Z %276 = tt.load %275 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T08:57:10.0475054Z %277 = ttg.convert_layout %276 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0475337Z %278 = arith.shli %277, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0475570Z %279 = arith.shrsi %278, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0475805Z %280 = arith.shrsi %277, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0476137Z %281 = tt.expand_dims %279 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0476476Z %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0476760Z %283 = tt.broadcast %281 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0476996Z %284 = arith.select %13, %283, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0477234Z %285 = tt.broadcast %282 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0477464Z %286 = arith.select %15, %285, %284 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0477698Z %287 = tt.reshape %286 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T08:57:10.0477927Z %288 = arith.sitofp %287 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T08:57:10.0478182Z %289 = ttg.local_alloc %288 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T08:57:10.0478512Z %290 = ttg.local_load %289 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0478977Z %291 = tt.dot %270, %290, %257, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0479322Z %292 = arith.addi %189, %c12_i32 : i32 2026-02-21T08:57:10.0479500Z %293 = tt.splat %arg18 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0479725Z %294 = arith.addi %293, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0479903Z %295 = arith.muli %292, %c2_i32 : i32 2026-02-21T08:57:10.0480069Z %296 = tt.splat %295 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0489832Z %297 = arith.addi %296, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T08:57:10.0490106Z %298 = tt.expand_dims %297 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T08:57:10.0490377Z %299 = tt.broadcast %298 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0490573Z %300 = arith.addi %190#2, %299 : tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0490824Z %301 = tt.addptr %7, %300 : tensor<64x8x!tt.ptr, #blocked1>, tensor<64x8xi32, #blocked1> 2026-02-21T08:57:10.0491028Z %302 = tt.load %301 : tensor<64x8x!tt.ptr, #blocked1> 2026-02-21T08:57:10.0491332Z %303 = ttg.local_load %arg19 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0491760Z %304 = arith.extf %303 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0492140Z %305 = tt.expand_dims %294 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0492388Z %306 = arith.muli %305, %cst_6 : tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0492582Z %307 = tt.broadcast %306 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0492775Z %308 = arith.addi %307, %arg10 : tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0492977Z %309 = tt.addptr %8, %308 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0493177Z %310 = tt.load %309 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T08:57:10.0493463Z %311 = ttg.convert_layout %310 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0493745Z %312 = arith.shli %311, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0493981Z %313 = arith.shrsi %312, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0494216Z %314 = arith.shrsi %311, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0494507Z %315 = tt.expand_dims %313 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0494845Z %316 = tt.expand_dims %314 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0495131Z %317 = tt.broadcast %315 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0495376Z %318 = arith.select %13, %317, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0495615Z %319 = tt.broadcast %316 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0495847Z %320 = arith.select %15, %319, %318 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0496082Z %321 = tt.reshape %320 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T08:57:10.0496306Z %322 = arith.sitofp %321 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T08:57:10.0496558Z %323 = ttg.local_alloc %322 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T08:57:10.0496884Z %324 = ttg.local_load %323 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0497359Z %325 = tt.dot %304, %324, %291, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0497763Z %326 = arith.select %186, %cst_2, %325 : tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0497907Z scf.if %186 { 2026-02-21T08:57:10.0498050Z %334 = tt.splat %arg8 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:57:10.0498270Z %335 = arith.addi %334, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:57:10.0498484Z %336 = arith.truncf %325 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma> 2026-02-21T08:57:10.0498751Z %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T08:57:10.0498991Z %338 = arith.muli %337, %cst : tensor<64x1xi32, #mma> 2026-02-21T08:57:10.0499234Z %339 = tt.expand_dims %arg20 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T08:57:10.0499501Z %340 = tt.broadcast %338 : tensor<64x1xi32, #mma> -> tensor<64x128xi32, #mma> 2026-02-21T08:57:10.0499709Z %341 = tt.broadcast %339 : tensor<1x128xi32, #mma> -> tensor<64x128xi32, #mma> 2026-02-21T08:57:10.0499897Z %342 = arith.addi %340, %341 : tensor<64x128xi32, #mma> 2026-02-21T08:57:10.0500093Z %343 = tt.addptr %16, %342 : tensor<64x128x!tt.ptr, #mma>, tensor<64x128xi32, #mma> 2026-02-21T08:57:10.0500292Z tt.store %343, %336 : tensor<64x128x!tt.ptr, #mma> 2026-02-21T08:57:10.0500425Z } 2026-02-21T08:57:10.0500512Z %327 = arith.addi %arg11, %c1_i32 : i32 2026-02-21T08:57:10.0500643Z %328 = arith.cmpi slt, %327, %c1_i32 : i32 2026-02-21T08:57:10.0500772Z %329 = arith.select %328, %327, %c0_i32 : i32 2026-02-21T08:57:10.0501040Z %330 = ttg.memdesc_index %20[%329] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0501435Z ttg.local_store %200, %330 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0501792Z %331 = ttg.memdesc_index %21[%329] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0502136Z ttg.local_store %234, %331 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0502478Z %332 = ttg.memdesc_index %22[%329] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0502818Z ttg.local_store %268, %332 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0503159Z %333 = ttg.memdesc_index %23[%329] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0503501Z ttg.local_store %302, %333 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> 2026-02-21T08:57:10.0504380Z scf.yield %187, %190#4, %326, %190#0, %190#1, %190#2, %190#3, %329, %189, %330, %224, %331, %258, %332, %292, %333, %190#5 : i32, i32, tensor<64x128xf32, #mma>, tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>, i32, tensor<64x8xi32, #blocked1>, tensor<4x128xi32, #blocked2>, i32, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:57:10.0505137Z } 2026-02-21T08:57:10.0505229Z %76 = arith.cmpi sge, %19, %c1_i32 : i32 2026-02-21T08:57:10.0505401Z %77 = tt.splat %75#8 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0505627Z %78 = arith.addi %77, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0505941Z %79 = ttg.local_load %75#9 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0506392Z %80 = arith.extf %79 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0506761Z %81 = tt.expand_dims %78 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0507001Z %82 = arith.muli %81, %cst_6 : tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0507188Z %83 = tt.broadcast %82 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0507379Z %84 = arith.addi %83, %75#6 : tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0507574Z %85 = tt.addptr %8, %84 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0507769Z %86 = tt.splat %76 : i1 -> tensor<4x128xi1, #blocked2> 2026-02-21T08:57:10.0507924Z %87 = tt.load %85, %86 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T08:57:10.0508085Z %88 = arith.shli %87, %cst_9 : tensor<4x128xi8, #blocked2> 2026-02-21T08:57:10.0508243Z %89 = arith.shrsi %88, %cst_9 : tensor<4x128xi8, #blocked2> 2026-02-21T08:57:10.0508485Z %90 = ttg.convert_layout %89 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0508727Z %91 = arith.shrsi %87, %cst_9 : tensor<4x128xi8, #blocked2> 2026-02-21T08:57:10.0508965Z %92 = ttg.convert_layout %91 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0509296Z %93 = tt.expand_dims %90 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0509669Z %94 = tt.expand_dims %92 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0509947Z %95 = tt.broadcast %93 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0510303Z %96 = arith.select %13, %95, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0510560Z %97 = tt.broadcast %94 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0510827Z %98 = arith.select %15, %97, %96 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0511240Z %99 = tt.reshape %98 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T08:57:10.0519256Z %100 = arith.sitofp %99 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T08:57:10.0519537Z %101 = ttg.local_alloc %100 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T08:57:10.0519879Z %102 = ttg.local_load %101 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0520154Z %103 = scf.if %76 -> (tensor<64x128xf32, #mma>) { 2026-02-21T08:57:10.0520511Z %184 = tt.dot %80, %102, %75#2, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0520870Z scf.yield %184 : tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0520995Z } else { 2026-02-21T08:57:10.0521097Z scf.yield %75#2 : tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0521222Z } 2026-02-21T08:57:10.0521364Z %104 = tt.splat %75#10 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0521596Z %105 = arith.addi %104, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0521925Z %106 = ttg.local_load %75#11 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0522414Z %107 = arith.extf %106 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0522843Z %108 = tt.expand_dims %105 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0523095Z %109 = arith.muli %108, %cst_6 : tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0523299Z %110 = tt.broadcast %109 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0523497Z %111 = arith.addi %110, %75#6 : tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0523711Z %112 = tt.addptr %8, %111 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0523929Z %113 = tt.load %112, %86 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T08:57:10.0524103Z %114 = arith.shli %113, %cst_9 : tensor<4x128xi8, #blocked2> 2026-02-21T08:57:10.0524277Z %115 = arith.shrsi %114, %cst_9 : tensor<4x128xi8, #blocked2> 2026-02-21T08:57:10.0524536Z %116 = ttg.convert_layout %115 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0524792Z %117 = arith.shrsi %113, %cst_9 : tensor<4x128xi8, #blocked2> 2026-02-21T08:57:10.0525042Z %118 = ttg.convert_layout %117 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0525378Z %119 = tt.expand_dims %116 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0525721Z %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0526050Z %121 = tt.broadcast %119 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0526299Z %122 = arith.select %13, %121, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0526549Z %123 = tt.broadcast %120 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0526785Z %124 = arith.select %15, %123, %122 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0527020Z %125 = tt.reshape %124 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T08:57:10.0527244Z %126 = arith.sitofp %125 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T08:57:10.0527501Z %127 = ttg.local_alloc %126 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T08:57:10.0527832Z %128 = ttg.local_load %127 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0528099Z %129 = scf.if %76 -> (tensor<64x128xf32, #mma>) { 2026-02-21T08:57:10.0528456Z %184 = tt.dot %107, %128, %103, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0528804Z scf.yield %184 : tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0528929Z } else { 2026-02-21T08:57:10.0529029Z scf.yield %103 : tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0529143Z } 2026-02-21T08:57:10.0529288Z %130 = tt.splat %75#12 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0529516Z %131 = arith.addi %130, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0529848Z %132 = ttg.local_load %75#13 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0530278Z %133 = arith.extf %132 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0530715Z %134 = tt.expand_dims %131 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0530970Z %135 = arith.muli %134, %cst_6 : tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0531164Z %136 = tt.broadcast %135 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0531365Z %137 = arith.addi %136, %75#6 : tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0531572Z %138 = tt.addptr %8, %137 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0531782Z %139 = tt.load %138, %86 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T08:57:10.0531954Z %140 = arith.shli %139, %cst_9 : tensor<4x128xi8, #blocked2> 2026-02-21T08:57:10.0532121Z %141 = arith.shrsi %140, %cst_9 : tensor<4x128xi8, #blocked2> 2026-02-21T08:57:10.0532380Z %142 = ttg.convert_layout %141 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0532633Z %143 = arith.shrsi %139, %cst_9 : tensor<4x128xi8, #blocked2> 2026-02-21T08:57:10.0532881Z %144 = ttg.convert_layout %143 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0533220Z %145 = tt.expand_dims %142 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0533558Z %146 = tt.expand_dims %144 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0533851Z %147 = tt.broadcast %145 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0534095Z %148 = arith.select %13, %147, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0534372Z %149 = tt.broadcast %146 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0534611Z %150 = arith.select %15, %149, %148 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0534845Z %151 = tt.reshape %150 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T08:57:10.0535075Z %152 = arith.sitofp %151 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T08:57:10.0535330Z %153 = ttg.local_alloc %152 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T08:57:10.0535662Z %154 = ttg.local_load %153 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0535927Z %155 = scf.if %76 -> (tensor<64x128xf32, #mma>) { 2026-02-21T08:57:10.0536285Z %184 = tt.dot %133, %154, %129, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0536643Z scf.yield %184 : tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0536765Z } else { 2026-02-21T08:57:10.0536861Z scf.yield %129 : tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0537023Z } 2026-02-21T08:57:10.0537176Z %156 = tt.splat %75#14 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0537407Z %157 = arith.addi %156, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T08:57:10.0537736Z %158 = ttg.local_load %75#15 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0538167Z %159 = arith.extf %158 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0538556Z %160 = tt.expand_dims %157 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0538845Z %161 = arith.muli %160, %cst_6 : tensor<4x1xi32, #blocked2> 2026-02-21T08:57:10.0539044Z %162 = tt.broadcast %161 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0539239Z %163 = arith.addi %162, %75#6 : tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0539445Z %164 = tt.addptr %8, %163 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi32, #blocked2> 2026-02-21T08:57:10.0539657Z %165 = tt.load %164, %86 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T08:57:10.0539823Z %166 = arith.shli %165, %cst_9 : tensor<4x128xi8, #blocked2> 2026-02-21T08:57:10.0539994Z %167 = arith.shrsi %166, %cst_9 : tensor<4x128xi8, #blocked2> 2026-02-21T08:57:10.0540244Z %168 = ttg.convert_layout %167 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0540505Z %169 = arith.shrsi %165, %cst_9 : tensor<4x128xi8, #blocked2> 2026-02-21T08:57:10.0540757Z %170 = ttg.convert_layout %169 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T08:57:10.0541091Z %171 = tt.expand_dims %168 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0541429Z %172 = tt.expand_dims %170 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T08:57:10.0541711Z %173 = tt.broadcast %171 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0541956Z %174 = arith.select %13, %173, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0542202Z %175 = tt.broadcast %172 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0542477Z %176 = arith.select %15, %175, %174 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T08:57:10.0542716Z %177 = tt.reshape %176 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T08:57:10.0542942Z %178 = arith.sitofp %177 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T08:57:10.0543200Z %179 = ttg.local_alloc %178 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T08:57:10.0543532Z %180 = ttg.local_load %179 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T08:57:10.0543792Z %181 = scf.if %76 -> (tensor<64x128xf32, #mma>) { 2026-02-21T08:57:10.0544153Z %184 = tt.dot %159, %180, %155, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0544499Z scf.yield %184 : tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0544626Z } else { 2026-02-21T08:57:10.0544727Z scf.yield %155 : tensor<64x128xf32, #mma> 2026-02-21T08:57:10.0544843Z } 2026-02-21T08:57:10.0544939Z %182 = arith.cmpi eq, %75#0, %c31_i32 : i32 2026-02-21T08:57:10.0545060Z %183 = arith.andi %76, %182 : i1 2026-02-21T08:57:10.0545174Z scf.if %183 { 2026-02-21T08:57:10.0545318Z %184 = tt.splat %75#4 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:57:10.0545538Z %185 = arith.addi %184, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T08:57:10.0545755Z %186 = arith.truncf %181 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma> 2026-02-21T08:57:10.0546027Z %187 = tt.expand_dims %185 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T08:57:10.0546268Z %188 = arith.muli %187, %cst : tensor<64x1xi32, #mma> 2026-02-21T08:57:10.0546550Z %189 = ttg.convert_layout %75#3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T08:57:10.0546908Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T08:57:10.0547218Z %191 = tt.broadcast %188 : tensor<64x1xi32, #mma> -> tensor<64x128xi32, #mma> 2026-02-21T08:57:10.0547427Z %192 = tt.broadcast %190 : tensor<1x128xi32, #mma> -> tensor<64x128xi32, #mma> 2026-02-21T08:57:10.0547618Z %193 = arith.addi %191, %192 : tensor<64x128xi32, #mma> 2026-02-21T08:57:10.0547813Z %194 = tt.addptr %16, %193 : tensor<64x128x!tt.ptr, #mma>, tensor<64x128xi32, #mma> 2026-02-21T08:57:10.0548019Z tt.store %194, %186 : tensor<64x128x!tt.ptr, #mma> 2026-02-21T08:57:10.0548154Z } 2026-02-21T08:57:10.0548289Z ttg.local_dealloc %23 : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> 2026-02-21T08:57:10.0548506Z ttg.local_dealloc %22 : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> 2026-02-21T08:57:10.0548710Z ttg.local_dealloc %21 : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> 2026-02-21T08:57:10.0548917Z ttg.local_dealloc %20 : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> 2026-02-21T08:57:10.0549074Z tt.return 2026-02-21T08:57:10.0549161Z } 2026-02-21T08:57:10.0549245Z } 2026-02-21T08:57:10.0549297Z 2026-02-21T08:57:10.0549330Z {-# 2026-02-21T08:57:10.0549414Z external_resources: { 2026-02-21T08:57:10.0549521Z mlir_reproducer: { 2026-02-21T08:57:10.0550574Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T08:57:10.0551569Z disable_threading: false, 2026-02-21T08:57:10.0551677Z verify_each: true 2026-02-21T08:57:10.0551774Z } 2026-02-21T08:57:10.0551847Z } 2026-02-21T08:57:10.0551922Z #-} 2026-02-21T08:57:10.0552199Z /tmp/torchinductor_root/rf/crfyp33noxhmzkqm66tofoowim2no45tvuuxuk6t2jwtqco6f6eg.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:57:10.0555660Z /tmp/torchinductor_root/rf/crfyp33noxhmzkqm66tofoowim2no45tvuuxuk6t2jwtqco6f6eg.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:57:10.0556252Z [109s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:57:10.0557050Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 64, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T08:57:10.0557770Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:57:10.0557951Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:57:10.0558174Z WARNING:tritonbench.utils.triton_op:Completed input ID 21: 2026-02-21T08:57:10.0558323Z x_val 2026-02-21T08:57:10.0558410Z --------------------- 2026-02-21T08:57:10.0558511Z (4, 4096, 8192, 1024) 2026-02-21T08:57:10.0558570Z 2026-02-21T08:57:10.0558984Z 70%|███████ | 7/10 [47:49<14:02, 280.74s/it]WARNING:tritonbench.utils.triton_op:Running input ID 24: 2026-02-21T08:57:10.0559178Z x_val 2026-02-21T08:57:10.0559264Z ---------------------- 2026-02-21T08:57:10.0559359Z (16, 4096, 1280, 8192) 2026-02-21T08:57:10.0559625Z INFO:tritonbench.utils.triton_op:Took 0.22ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:57:10.0559943Z Initial population exploring neighbors 36% ━━━━━ 36/100 0.5 configs/s 2026-02-21T08:57:10.9811783Z INFO:tritonbench.utils.triton_op:Took 2.46ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:57:14.2895580Z Autotune Choices Stats: 2026-02-21T08:57:14.2897478Z {"num_choices": 37, "num_triton_choices": 36, "best_kernel": "mm", "best_time": 1.6694060564041138, "best_triton_pos": 1, "best_triton_time": 2.027919054031372, "best_triton_kernel": "triton_mm_185", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4"} 2026-02-21T08:57:14.2902260Z AUTOTUNE mm(65536x8192, 8192x1280) 2026-02-21T08:57:14.2902557Z strides: [8192, 1], [1280, 1] 2026-02-21T08:57:14.2902859Z dtypes: torch.bfloat16, torch.bfloat16 2026-02-21T08:57:14.2903136Z mm 1.6694 ms 100.0% 2026-02-21T08:57:14.2903970Z triton_mm_185 2.0279 ms 82.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:57:14.2905383Z triton_mm_191 2.1667 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:57:14.2906755Z triton_mm_188 2.2115 ms 75.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:57:14.2908649Z triton_mm_186 2.2370 ms 74.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:57:14.2909762Z triton_mm_182 2.3020 ms 72.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T08:57:14.2910755Z triton_mm_190 2.3526 ms 71.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:57:14.2911757Z triton_mm_179 2.3692 ms 70.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:57:14.2912769Z triton_mm_183 2.3976 ms 69.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:57:14.2913776Z triton_mm_184 2.5001 ms 66.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:57:14.2914543Z SingleProcess AUTOTUNE benchmarking takes 2.8034 seconds and 0.3202 seconds precompiling for 37 choices 2026-02-21T08:57:16.0504219Z INFO:tritonbench.utils.triton_op:Took 0.14ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:57:16.0513572Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:57:16.0513887Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:57:16.0514158Z 'dtype': 'torch.bfloat16', 2026-02-21T08:57:16.0514440Z 'shape': (16, 4096, 8192), 2026-02-21T08:57:16.0514696Z 'stride': (33554432, 8192, 1)}, 2026-02-21T08:57:16.0515254Z { 'device': 'cuda:0', 2026-02-21T08:57:16.0515504Z 'dtype': 'torch.int32', 2026-02-21T08:57:16.0515751Z 'shape': (8192, 1280), 2026-02-21T08:57:16.0515999Z 'stride': (1280, 1)}), 2026-02-21T08:57:16.0516230Z 'kwargs': {}} 2026-02-21T08:57:16.0527350Z INFO:tritonbench.utils.triton_op:Took 1.54ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:57:16.2279140Z [0s] Autotune random seed: 2134834638 2026-02-21T08:57:16.3658360Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:57:48.4588303Z [32s] Timeout after 30s compiling Config(block_sizes=[32, 8192, 1], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[1, 1], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T08:57:54.9420278Z [38s] Timeout after 30s compiling Config(block_sizes=[64, 2048, 2], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T08:57:55.2390630Z [38s] Timeout after 30s compiling Config(block_sizes=[256, 128, 4], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[0, 0], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:57:58.3724666Z [42s] Timeout after 30s compiling Config(block_sizes=[16, 4096, 1], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:57:58.3746993Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T08:58:09.1532312Z Initial population exploring neighbors 2% 2/100 0.1 configs/s 2026-02-21T08:58:09.1536833Z WARNING:tritonbench.utils.triton_op:Completed input ID 24: 2026-02-21T08:58:09.1537032Z x_val 2026-02-21T08:58:09.1537171Z ---------------------- 2026-02-21T08:58:09.1537276Z (16, 4096, 1280, 8192) 2026-02-21T08:58:09.1538068Z 2026-02-21T08:58:09.1562217Z 80%|████████ | 8/10 [48:48<07:00, 210.18s/it]WARNING:tritonbench.utils.triton_op:Running input ID 28: 2026-02-21T08:58:09.1562503Z x_val 2026-02-21T08:58:09.1562672Z ---------------------- 2026-02-21T08:58:09.1562806Z (64, 4096, 1280, 8192) 2026-02-21T08:58:09.1602333Z INFO:tritonbench.utils.triton_op:Took 0.45ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:58:10.2205357Z INFO:tritonbench.utils.triton_op:Took 2.77ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:58:19.4087613Z Autotune Choices Stats: 2026-02-21T08:58:19.4089300Z {"num_choices": 37, "num_triton_choices": 36, "best_kernel": "mm", "best_time": 7.480846881866455, "best_triton_pos": 1, "best_triton_time": 7.912878036499023, "best_triton_kernel": "triton_mm_221", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4"} 2026-02-21T08:58:19.4094693Z AUTOTUNE mm(262144x8192, 8192x1280) 2026-02-21T08:58:19.4094925Z strides: [8192, 1], [1280, 1] 2026-02-21T08:58:19.4095164Z dtypes: torch.bfloat16, torch.bfloat16 2026-02-21T08:58:19.4095390Z mm 7.4808 ms 100.0% 2026-02-21T08:58:19.4096118Z triton_mm_221 7.9129 ms 94.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:58:19.4097326Z triton_mm_227 8.2071 ms 91.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:58:19.4098525Z triton_mm_222 8.7857 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:58:19.4099708Z triton_mm_224 8.8305 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:58:19.4100896Z triton_mm_218 9.3142 ms 80.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T08:58:19.4102079Z triton_mm_226 9.8029 ms 76.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:58:19.4103675Z triton_mm_219 10.5175 ms 71.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:58:19.4104639Z triton_mm_220 10.9395 ms 68.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:58:19.4105579Z triton_mm_215 11.1460 ms 67.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:58:19.4106301Z SingleProcess AUTOTUNE benchmarking takes 8.6244 seconds and 0.3689 seconds precompiling for 37 choices 2026-02-21T08:58:21.6975171Z INFO:tritonbench.utils.triton_op:Took 0.19ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:58:21.6987961Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:58:21.6988118Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:58:21.6988300Z 'dtype': 'torch.bfloat16', 2026-02-21T08:58:21.6988951Z 'shape': (64, 4096, 8192), 2026-02-21T08:58:21.6989078Z 'stride': (33554432, 8192, 1)}, 2026-02-21T08:58:21.6989195Z { 'device': 'cuda:0', 2026-02-21T08:58:21.6989308Z 'dtype': 'torch.int32', 2026-02-21T08:58:21.6989420Z 'shape': (8192, 1280), 2026-02-21T08:58:21.6989529Z 'stride': (1280, 1)}), 2026-02-21T08:58:21.6989635Z 'kwargs': {}} 2026-02-21T08:58:21.7005443Z INFO:tritonbench.utils.triton_op:Took 1.96ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:58:21.8825372Z [0s] Autotune random seed: 2134834638 2026-02-21T08:58:22.1632267Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:58:54.3769464Z [32s] Timeout after 30s compiling Config(block_sizes=[32, 8192, 1], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[1, 1], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T08:58:59.4880226Z [37s] Timeout after 30s compiling Config(block_sizes=[8, 2048, 4], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T08:59:01.7031087Z [39s] Timeout after 30s compiling Config(block_sizes=[64, 2048, 2], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T08:59:01.9392850Z [39s] Timeout after 30s compiling Config(block_sizes=[256, 128, 4], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[0, 0], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:59:05.0507072Z [42s] Timeout after 30s compiling Config(block_sizes=[16, 4096, 1], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T08:59:05.0526288Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T08:59:06.7653505Z Initial population exploring neighbors 1% 1/100 - configs/s 2026-02-21T08:59:06.7658066Z WARNING:tritonbench.utils.triton_op:Completed input ID 28: 2026-02-21T08:59:06.7660426Z x_val 2026-02-21T08:59:06.7660631Z ---------------------- 2026-02-21T08:59:06.7660803Z (64, 4096, 1280, 8192) 2026-02-21T08:59:06.7660967Z 2026-02-21T08:59:06.7679448Z 90%|█████████ | 9/10 [49:45<02:42, 162.49s/it]WARNING:tritonbench.utils.triton_op:Running input ID 31: 2026-02-21T08:59:06.7679683Z x_val 2026-02-21T08:59:06.7679779Z ---------------------- 2026-02-21T08:59:06.7679890Z (64, 4096, 8192, 3584) 2026-02-21T08:59:06.7728090Z INFO:tritonbench.utils.triton_op:Took 0.37ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:59:07.7681621Z INFO:tritonbench.utils.triton_op:Took 3.05ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:59:30.7765039Z Autotune Choices Stats: 2026-02-21T08:59:30.7766960Z {"num_choices": 37, "num_triton_choices": 36, "best_kernel": "mm", "best_time": 20.915571212768555, "best_triton_pos": 1, "best_triton_time": 23.032167434692383, "best_triton_kernel": "triton_mm_263", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8"} 2026-02-21T08:59:30.7772538Z AUTOTUNE mm(262144x3584, 3584x8192) 2026-02-21T08:59:30.7772735Z strides: [3584, 1], [8192, 1] 2026-02-21T08:59:30.7773741Z dtypes: torch.bfloat16, torch.bfloat16 2026-02-21T08:59:30.7774035Z mm 20.9156 ms 100.0% 2026-02-21T08:59:30.7774432Z triton_mm_263 23.0322 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:59:30.7775152Z triton_mm_260 24.1291 ms 86.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:59:30.7775765Z triton_mm_257 25.0306 ms 83.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:59:30.7776996Z triton_mm_262 25.7462 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:59:30.7777589Z triton_mm_254 26.9698 ms 77.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T08:59:30.7778191Z triton_mm_251 27.2256 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T08:59:30.7778783Z triton_mm_258 28.3907 ms 73.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:59:30.7779392Z triton_mm_261 29.7136 ms 70.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:59:30.7779990Z triton_mm_256 29.8412 ms 70.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T08:59:30.7780450Z SingleProcess AUTOTUNE benchmarking takes 22.4539 seconds and 0.3177 seconds precompiling for 37 choices 2026-02-21T08:59:32.9099278Z INFO:tritonbench.utils.triton_op:Took 0.21ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:59:32.9113799Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:59:32.9114823Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:59:32.9115062Z 'dtype': 'torch.bfloat16', 2026-02-21T08:59:32.9115194Z 'shape': (64, 4096, 3584), 2026-02-21T08:59:32.9115316Z 'stride': (14680064, 3584, 1)}, 2026-02-21T08:59:32.9115437Z { 'device': 'cuda:0', 2026-02-21T08:59:32.9115600Z 'dtype': 'torch.int32', 2026-02-21T08:59:32.9115714Z 'shape': (3584, 8192), 2026-02-21T08:59:32.9116338Z 'stride': (8192, 1)}), 2026-02-21T08:59:32.9116442Z 'kwargs': {}} 2026-02-21T08:59:32.9130806Z INFO:tritonbench.utils.triton_op:Took 1.89ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:59:33.1007419Z [0s] Autotune random seed: 2134834638 2026-02-21T08:59:33.7407183Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:00:06.0429158Z [32s] Timeout after 30s compiling Config(block_sizes=[32, 8192, 1], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[1, 1], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T09:00:12.7103906Z [38s] Timeout after 30s compiling Config(block_sizes=[64, 2048, 2], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:00:12.9324544Z [39s] Timeout after 30s compiling Config(block_sizes=[256, 128, 4], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[0, 0], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:00:12.9342644Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T09:00:17.8615556Z Initial population exploring neighbors 1% 1/100 - configs/s 2026-02-21T09:00:17.8618944Z WARNING:tritonbench.utils.triton_op:Completed input ID 31: 2026-02-21T09:00:17.8619259Z x_val 2026-02-21T09:00:17.8619382Z ---------------------- 2026-02-21T09:00:17.8619571Z (64, 4096, 8192, 3584) 2026-02-21T09:00:17.8619677Z 2026-02-21T09:00:17.8620237Z 100%|██████████| 10/10 [50:57<00:00, 134.27s/it] 2026-02-21T09:00:17.8620521Z 100%|██████████| 10/10 [50:57<00:00, 305.71s/it] 2026-02-21T09:00:17.8632242Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmpxao986lk.csv 2026-02-21T09:00:17.8644949Z x_val preprocessed_torch_compile_int4_gemm-speedup preprocessed_torch_compile_int4_gemm-accuracy preprocessed_triton_int4_gemm-speedup preprocessed_triton_int4_gemm-accuracy helion_int4_gemm_tritonbench-speedup helion_int4_gemm_tritonbench-accuracy 2026-02-21T09:00:17.8647488Z ---------------------- ---------------------------------------------- ----------------------------------------------- ---------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ --------------------------------------- 2026-02-21T09:00:17.8650083Z (1, 1, 1280, 8192) 18.543 1 Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help. 10.202746020611514 1 2026-02-21T09:00:17.8651152Z ERROR:__main__:failed to process results 2026-02-21T09:00:17.8651371Z Traceback (most recent call last): 2026-02-21T09:00:17.8657986Z File "/__w/helion/helion/benchmarks/run.py", line 1312, in run_kernel_variants 2026-02-21T09:00:17.8658254Z process_result( 2026-02-21T09:00:17.8658476Z File "/__w/helion/helion/benchmarks/run.py", line 1380, in process_result 2026-02-21T09:00:17.8658732Z metrics[active_metrics[kernel_name][name]].append(float(item)) 2026-02-21T09:00:17.8658941Z ^^^^^^^^^^^ 2026-02-21T09:00:17.8659353Z ValueError: could not convert string to float: 'Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.' 2026-02-21T09:00:17.8660597Z (1, 1, 8192, 3584) 13.5655 1 Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help. 11.663682014903282 1 2026-02-21T09:00:17.8661980Z (4, 1, 8192, 3584) 4.89291 1 Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help. 6.8418023230135 1 2026-02-21T09:00:17.8663359Z (16, 1, 7168, 8192) 9.25921 1 Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help. 10.242236859684546 1 2026-02-21T09:00:17.8664789Z (64, 1, 7168, 8192) 7.82407 1 Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help. 7.149499113923205 1 2026-02-21T09:00:17.8666171Z (1, 4096, 8192, 1024) 1.55308 1 Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help. Error from Triton code: 2026-02-21T09:00:17.8667449Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:00:17.8667894Z 2026-02-21T09:00:17.8668285Z Error running generated Triton program: 2026-02-21T09:00:17.8669159Z SystemError: returned a result with an exception set 2026-02-21T09:00:17.8670375Z @helion.kernel(config=helion.Config(block_sizes=[512, 1, 2], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:00:17.8671600Z Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning. 2026-02-21T09:00:17.8672629Z (4, 4096, 8192, 1024) 1.15597 1 Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help. Error from Triton code: 2026-02-21T09:00:17.8673588Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:00:17.8674037Z 2026-02-21T09:00:17.8674432Z Error running generated Triton program: 2026-02-21T09:00:17.8675305Z SystemError: returned a result with an exception set 2026-02-21T09:00:17.8676521Z @helion.kernel(config=helion.Config(block_sizes=[256, 4, 1], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T09:00:17.8677694Z Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning. 2026-02-21T09:00:17.8678717Z (16, 4096, 1280, 8192) 1.05667 1 Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help. Error from Triton code: 2026-02-21T09:00:17.8679657Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:00:17.8680089Z 2026-02-21T09:00:17.8680474Z Error running generated Triton program: 2026-02-21T09:00:17.8681324Z SystemError: returned a result with an exception set 2026-02-21T09:00:17.8682521Z @helion.kernel(config=helion.Config(block_sizes=[128, 1, 1], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:00:17.8683798Z Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning. 2026-02-21T09:00:17.8684767Z (64, 4096, 1280, 8192) 1.0136 1 Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help. Error from Triton code: 2026-02-21T09:00:17.8685710Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:00:17.8686151Z 2026-02-21T09:00:17.8686532Z Error running generated Triton program: 2026-02-21T09:00:17.8687445Z SystemError: returned a result with an exception set 2026-02-21T09:00:17.8688651Z @helion.kernel(config=helion.Config(block_sizes=[4, 32, 2], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:00:17.8689812Z Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning. 2026-02-21T09:00:17.8690785Z (64, 4096, 8192, 3584) 1.01338 1 Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help. Error from Triton code: 2026-02-21T09:00:17.8691736Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:00:17.8692196Z 2026-02-21T09:00:17.8692581Z Error running generated Triton program: 2026-02-21T09:00:17.8693427Z SystemError: returned a result with an exception set 2026-02-21T09:00:17.8694628Z @helion.kernel(config=helion.Config(block_sizes=[4, 32, 2], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:00:17.8695782Z Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning. 2026-02-21T09:00:20.7083453Z average 5.98773 1 4.609996633213604 0.5 2026-02-21T09:02:38.5936741Z Applying custom args for int4_gemm: {'num_inputs': 10} 2026-02-21T09:02:38.6022615Z Running int4_gemm benchmark with Helion implementation... 2026-02-21T09:02:38.6022883Z 2026-02-21T09:02:38.7622259Z Equally-spaced-k mode: Selected 10 equally spaced inputs (total available: 32) 2026-02-21T09:02:38.7622885Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 3, 7, 10, 14, 17, 21, 24, 28, 31] 2026-02-21T09:02:38.7629195Z 2026-02-21T09:02:38.7635185Z 0%| | 0/10 [00:00> 4).to(torch.int8) # Sign-extend low 4 bits 2026-02-21T09:09:09.8709181Z v_1 = tl.full([], 4, tl.int8) 2026-02-21T09:09:09.8709326Z v_2 = b_tile << v_1 2026-02-21T09:09:09.8709456Z v_3 = tl.full([], 4, tl.int8) 2026-02-21T09:09:09.8709593Z v_4 = v_2 >> v_3 2026-02-21T09:09:09.8709831Z # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8) # Sign-extend high 4 bits 2026-02-21T09:09:09.8710046Z v_5 = tl.full([], 4, tl.int8) 2026-02-21T09:09:09.8710185Z v_6 = b_tile >> v_5 2026-02-21T09:09:09.8710358Z # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1) 2026-02-21T09:09:09.8710554Z stack_idx = tl.arange(0, 2) 2026-02-21T09:09:09.8710708Z broadcast_idx = stack_idx[None, :, None] 2026-02-21T09:09:09.8710876Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T09:09:09.8711044Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T09:09:09.8711205Z stacked_result = tl.zeros_like(expanded_0) 2026-02-21T09:09:09.8711367Z mask_0 = broadcast_idx == 0 2026-02-21T09:09:09.8711549Z stacked_result = tl.where(mask_0, expanded_0, stacked_result) 2026-02-21T09:09:09.8711739Z mask_1 = broadcast_idx == 1 2026-02-21T09:09:09.8711919Z stacked_result = tl.where(mask_1, expanded_1, stacked_result) 2026-02-21T09:09:09.8712149Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:09:09.8712384Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:09:09.8712605Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:09:09.8712795Z view = tl.reshape(stacked_result, [_SHAPE_DIM_3, _BLOCK_SIZE_2]) 2026-02-21T09:09:09.8712962Z v_7 = tl.cast(view, tl.float32) 2026-02-21T09:09:09.8713145Z # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2) # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1] 2026-02-21T09:09:09.8713323Z a_tile_1 = v_0[:, :, None] 2026-02-21T09:09:09.8713468Z # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0) 2026-02-21T09:09:09.8713617Z b_unpacked_1 = v_7[None, :, :] 2026-02-21T09:09:09.8713808Z # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1) # [BLOCK_SIZE_M, BLOCK_SIZE_N] 2026-02-21T09:09:09.8714002Z v_8 = a_tile_1 * b_unpacked_1 2026-02-21T09:09:09.8714129Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:09:09.8714262Z acc = acc_copy_0 + sum_1 2026-02-21T09:09:09.8714406Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T09:09:09.8714597Z v_10 = tl.cast(acc, tl.bfloat16) 2026-02-21T09:09:09.8714782Z tl.store(C + tl.broadcast_to(indices_2[None, :] * 1, [_BLOCK_SIZE_1, _BLOCK_SIZE_2]), v_10, None) 2026-02-21T09:09:09.8714938Z 2026-02-21T09:09:09.8715025Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher): 2026-02-21T09:09:09.8715187Z """ 2026-02-21T09:09:09.8715298Z BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T09:09:09.8715401Z 2026-02-21T09:09:09.8715465Z This kernel performs matrix multiplication where: 2026-02-21T09:09:09.8715612Z - A is a bfloat16 matrix of shape [M, K] 2026-02-21T09:09:09.8715780Z - B is an int8 matrix of shape [K//2, N] containing packed int4 values 2026-02-21T09:09:09.8715947Z (two 4-bit values packed into each int8) 2026-02-21T09:09:09.8716037Z 2026-02-21T09:09:09.8716072Z Args: 2026-02-21T09:09:09.8716193Z A (Tensor): Input tensor of shape [M, K] in bfloat16 format. 2026-02-21T09:09:09.8716377Z B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format. 2026-02-21T09:09:09.8716493Z 2026-02-21T09:09:09.8716528Z Returns: 2026-02-21T09:09:09.8716644Z Tensor: Output tensor of shape [M, N] in bfloat16 format. 2026-02-21T09:09:09.8716785Z """ 2026-02-21T09:09:09.8716892Z # src[int4_gemm.py:50]: M, K = A.shape 2026-02-21T09:09:09.8717010Z M, K = A.shape 2026-02-21T09:09:09.8717111Z # src[int4_gemm.py:51]: _, N = B.shape 2026-02-21T09:09:09.8717222Z _, N = B.shape 2026-02-21T09:09:09.8717371Z # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:09:09.8717574Z C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:09:09.8717755Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:09:09.8717939Z _BLOCK_SIZE_2 = 8 2026-02-21T09:09:09.8718108Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:09:09.8718379Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:09:09.8718631Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:09:09.8718813Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:09:09.8718924Z _BLOCK_SIZE_0 = 1024 2026-02-21T09:09:09.8719086Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:09:09.8719264Z _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0 2026-02-21T09:09:09.8719406Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:09:09.8719589Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:09:09.8719756Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:09:09.8719889Z _SHAPE_DIM_3 = 2 * _BLOCK_SIZE_0 2026-02-21T09:09:09.8720029Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:09:09.8720225Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:09:09.8720392Z # src[int4_gemm.py:57-91]: ... 2026-02-21T09:09:09.8720532Z _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0) 2026-02-21T09:09:09.8720878Z _launcher(_helion_matmul_bf16_int4, (triton.cdiv(1280, _BLOCK_SIZE_2) * 1,), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, _SHAPE_DIM_3, num_warps=8, num_stages=1, waves_per_eu=1, matrix_instr_nonkdim=16) 2026-02-21T09:09:09.8721200Z # src[int4_gemm.py:93]: return C 2026-02-21T09:09:09.8721313Z return C 2026-02-21T09:09:10.6688424Z WARNING:tritonbench.utils.triton_op:Completed input ID 0: 2026-02-21T09:09:10.6688684Z x_val 2026-02-21T09:09:10.6688822Z ------------------ 2026-02-21T09:09:10.6688965Z (1, 1, 1280, 8192) 2026-02-21T09:09:10.6689057Z 2026-02-21T09:09:10.6703881Z 10%|█ | 1/10 [06:31<58:47, 391.91s/it]WARNING:tritonbench.utils.triton_op:Running input ID 3: 2026-02-21T09:09:10.6704437Z x_val 2026-02-21T09:09:10.6704564Z ------------------ 2026-02-21T09:09:10.6704696Z (1, 1, 8192, 3584) 2026-02-21T09:09:10.6706725Z INFO:tritonbench.utils.triton_op:Took 0.21ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T09:09:11.8554674Z INFO:tritonbench.utils.triton_op:Took 4.61ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T09:09:14.0902513Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T09:09:14.0933825Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:09:14.0934103Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:09:14.0934329Z 'dtype': 'torch.bfloat16', 2026-02-21T09:09:14.0934553Z 'shape': (1, 1, 3584), 2026-02-21T09:09:14.0934756Z 'stride': (3584, 3584, 1)}, 2026-02-21T09:09:14.0935014Z { 'device': 'cuda:0', 2026-02-21T09:09:14.0935211Z 'dtype': 'torch.int32', 2026-02-21T09:09:14.0935410Z 'shape': (3584, 8192), 2026-02-21T09:09:14.0935615Z 'stride': (8192, 1)}), 2026-02-21T09:09:14.0935805Z 'kwargs': {}} 2026-02-21T09:09:14.0954463Z INFO:tritonbench.utils.triton_op:Took 2.27ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T09:09:14.2851289Z [0s] Autotune random seed: 2138032649 2026-02-21T09:09:14.3061363Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:09:49.4575707Z [35s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[4, 1], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:09:51.1427018Z [36s] Timeout after 30s compiling Config(block_sizes=[512, 1, 1024], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:09:52.7224775Z [38s] Timeout after 30s compiling Config(block_sizes=[256, 1, 512], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[1, 1], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:09:53.4099709Z [39s] Timeout after 30s compiling Config(block_sizes=[128, 1, 1024], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T09:09:53.4117217Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T09:10:00.7250001Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.9 configs/s 2026-02-21T09:10:00.7258308Z [46s] Adaptive compile timeout: 30s (90% percentile=10.8s, bounds=[30.0s, 30s]) 2026-02-21T09:10:01.6816005Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 992.0 configs/s 2026-02-21T09:10:01.9553771Z [47s] Initial random population of 100, 5 starting points: 2026-02-21T09:10:01.9553999Z timeout=4 2026-02-21T09:10:01.9554125Z ok=96 2026-02-21T09:10:01.9554213Z min=0.0358 2026-02-21T09:10:01.9554321Z mid=0.2003 2026-02-21T09:10:01.9555111Z max=111.4510 2026-02-21T09:10:01.9555263Z best={'block_sizes': [128, 1, 16], 2026-02-21T09:10:01.9555478Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T09:10:01.9555664Z 'l2_groupings': [1], 2026-02-21T09:10:01.9555770Z 'load_eviction_policies': ['', ''], 2026-02-21T09:10:01.9555933Z 'loop_orders': [[1, 0]], 2026-02-21T09:10:01.9556060Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:10:01.9556215Z 'num_stages': 3, 2026-02-21T09:10:01.9556342Z 'num_warps': 16, 2026-02-21T09:10:01.9556441Z 'pid_type': 'flat', 2026-02-21T09:10:01.9556597Z 'range_flattens': [None, False], 2026-02-21T09:10:01.9556783Z 'range_multi_buffers': [None, False], 2026-02-21T09:10:01.9556945Z 'range_num_stages': [0, 4], 2026-02-21T09:10:01.9557081Z 'range_unroll_factors': [0, 1], 2026-02-21T09:10:01.9557218Z 'range_warp_specializes': [], 2026-02-21T09:10:01.9557387Z 'waves_per_eu': 3} 2026-02-21T09:10:01.9723198Z [47s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:10:02.8412429Z [48s] Generation 1 starting: 85 neighbors, 5 active search path(s) 2026-02-21T09:10:37.9396204Z [83s] Timeout after 30s compiling Config(block_sizes=[512, 1, 8], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[4, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:10:43.9639659Z [89s] Timeout after 30s compiling Config(block_sizes=[128, 1, 128], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:10:43.9655554Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 0.5 configs/s 2026-02-21T09:10:49.3645393Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 16.4 configs/s 2026-02-21T09:10:54.5585336Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 190.4 2026-02-21T09:10:54.5585767Z configs/s 2026-02-21T09:10:55.0704429Z [100s] Generation 1 complete: 2026-02-21T09:10:55.0704639Z timeout=2 2026-02-21T09:10:55.0704751Z ok=88 2026-02-21T09:10:55.0704839Z min=0.0351 2026-02-21T09:10:55.0704921Z mid=0.0392 2026-02-21T09:10:55.0705005Z max=0.4275 2026-02-21T09:10:55.0705096Z best={'block_sizes': [128, 1, 16], 2026-02-21T09:10:55.0705237Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:10:55.0705369Z 'l2_groupings': [1], 2026-02-21T09:10:55.0705519Z 'load_eviction_policies': ['', ''], 2026-02-21T09:10:55.0705639Z 'loop_orders': [[1, 0]], 2026-02-21T09:10:55.0705741Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:10:55.0706231Z 'num_stages': 3, 2026-02-21T09:10:55.0706316Z 'num_warps': 16, 2026-02-21T09:10:55.0706404Z 'pid_type': 'flat', 2026-02-21T09:10:55.0706501Z 'range_flattens': [None, False], 2026-02-21T09:10:55.0706622Z 'range_multi_buffers': [None, False], 2026-02-21T09:10:55.0706738Z 'range_num_stages': [0, 4], 2026-02-21T09:10:55.0706845Z 'range_unroll_factors': [0, 1], 2026-02-21T09:10:55.0706954Z 'range_warp_specializes': [], 2026-02-21T09:10:55.0707060Z 'waves_per_eu': 3} 2026-02-21T09:10:55.1812718Z [100s] Fitting surrogate: 190 points, 190 targets 2026-02-21T09:10:56.5712820Z [102s] Generation 2 starting: 77 neighbors, 5 active search path(s) 2026-02-21T09:11:10.0446815Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 1.8 configs/s 2026-02-21T09:11:14.9957939Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 16.0 configs/s 2026-02-21T09:11:20.7201098Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 185.0 2026-02-21T09:11:20.7201500Z configs/s 2026-02-21T09:11:21.2576765Z [126s] Generation 2 complete: 2026-02-21T09:11:21.2577035Z ok=82 2026-02-21T09:11:21.2577156Z min=0.0322 2026-02-21T09:11:21.2577283Z mid=0.0369 2026-02-21T09:11:21.2577412Z max=0.1388 2026-02-21T09:11:21.2577545Z best={'block_sizes': [256, 1, 16], 2026-02-21T09:11:21.2577746Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T09:11:21.2577981Z 'l2_groupings': [32], 2026-02-21T09:11:21.2578107Z 'load_eviction_policies': ['', ''], 2026-02-21T09:11:21.2578232Z 'loop_orders': [[0, 1]], 2026-02-21T09:11:21.2578345Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:11:21.2578452Z 'num_stages': 1, 2026-02-21T09:11:21.2578546Z 'num_warps': 4, 2026-02-21T09:11:21.2578642Z 'pid_type': 'xyz', 2026-02-21T09:11:21.2579318Z 'range_flattens': [None, True], 2026-02-21T09:11:21.2579436Z 'range_multi_buffers': [None, None], 2026-02-21T09:11:21.2579557Z 'range_num_stages': [0, 1], 2026-02-21T09:11:21.2579684Z 'range_unroll_factors': [0, 4], 2026-02-21T09:11:21.2579801Z 'range_warp_specializes': [], 2026-02-21T09:11:21.2579905Z 'waves_per_eu': 2} 2026-02-21T09:11:21.3821831Z [127s] Fitting surrogate: 272 points, 272 targets 2026-02-21T09:11:22.3566519Z [128s] Generation 3 starting: 76 neighbors, 5 active search path(s) 2026-02-21T09:11:54.8460389Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 0.1 configs/s 2026-02-21T09:11:59.8831114Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 15.9 configs/s 2026-02-21T09:12:05.3760701Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 193.1 2026-02-21T09:12:05.3760967Z configs/s 2026-02-21T09:12:05.9858946Z [171s] Generation 3 complete: 2026-02-21T09:12:05.9859302Z ok=81 2026-02-21T09:12:05.9859542Z min=0.0316 2026-02-21T09:12:05.9859749Z mid=0.0368 2026-02-21T09:12:05.9859934Z max=0.3577 2026-02-21T09:12:05.9860185Z best={'block_sizes': [128, 1, 32], 2026-02-21T09:12:05.9860535Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:12:05.9860885Z 'l2_groupings': [32], 2026-02-21T09:12:05.9861157Z 'load_eviction_policies': ['', ''], 2026-02-21T09:12:05.9861449Z 'loop_orders': [[1, 0]], 2026-02-21T09:12:05.9861722Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:12:05.9861980Z 'num_stages': 4, 2026-02-21T09:12:05.9862205Z 'num_warps': 16, 2026-02-21T09:12:05.9862420Z 'pid_type': 'xyz', 2026-02-21T09:12:05.9862667Z 'range_flattens': [None, False], 2026-02-21T09:12:05.9862957Z 'range_multi_buffers': [None, None], 2026-02-21T09:12:05.9863259Z 'range_num_stages': [0, 3], 2026-02-21T09:12:05.9863527Z 'range_unroll_factors': [0, 0], 2026-02-21T09:12:05.9863815Z 'range_warp_specializes': [], 2026-02-21T09:12:05.9864085Z 'waves_per_eu': 2} 2026-02-21T09:12:06.1523082Z [171s] Fitting surrogate: 353 points, 353 targets 2026-02-21T09:12:06.9503847Z [172s] Generation 4 starting: 75 neighbors, 5 active search path(s) 2026-02-21T09:12:33.1259175Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 0.2 configs/s 2026-02-21T09:12:38.0933133Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 15.7 configs/s 2026-02-21T09:12:43.6594190Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 192.1 2026-02-21T09:12:43.6594826Z configs/s 2026-02-21T09:12:44.2292308Z [209s] Generation 4 complete: 2026-02-21T09:12:44.2292679Z ok=81 2026-02-21T09:12:44.2292924Z min=0.0276 2026-02-21T09:12:44.2293156Z mid=0.0339 2026-02-21T09:12:44.2293358Z max=0.3608 2026-02-21T09:12:44.2293591Z best={'block_sizes': [256, 1, 32], 2026-02-21T09:12:44.2293956Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:12:44.2294321Z 'l2_groupings': [32], 2026-02-21T09:12:44.2294639Z 'load_eviction_policies': ['', ''], 2026-02-21T09:12:44.2294950Z 'loop_orders': [[1, 0]], 2026-02-21T09:12:44.2295234Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:12:44.2295533Z 'num_stages': 4, 2026-02-21T09:12:44.2295777Z 'num_warps': 4, 2026-02-21T09:12:44.2296010Z 'pid_type': 'xyz', 2026-02-21T09:12:44.2296267Z 'range_flattens': [None, False], 2026-02-21T09:12:44.2296577Z 'range_multi_buffers': [None, None], 2026-02-21T09:12:44.2296899Z 'range_num_stages': [0, 3], 2026-02-21T09:12:44.2297181Z 'range_unroll_factors': [0, 0], 2026-02-21T09:12:44.2297491Z 'range_warp_specializes': [], 2026-02-21T09:12:44.2297776Z 'waves_per_eu': 2} 2026-02-21T09:12:44.3959665Z [210s] Fitting surrogate: 434 points, 434 targets 2026-02-21T09:12:45.2570411Z [210s] Generation 5 starting: 72 neighbors, 5 active search path(s) 2026-02-21T09:13:23.0140716Z [248s] Timeout after 30s compiling Config(block_sizes=[256, 1, 32], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:13:23.0157762Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 0.2 configs/s 2026-02-21T09:13:27.5328345Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 16.1 configs/s 2026-02-21T09:13:31.8574304Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 229.9 2026-02-21T09:13:31.8574691Z configs/s 2026-02-21T09:13:32.3665090Z [258s] Generation 5 complete: 2026-02-21T09:13:32.3665481Z timeout=1 2026-02-21T09:13:32.3665686Z ok=76 2026-02-21T09:13:32.3665894Z min=0.0259 2026-02-21T09:13:32.3666100Z mid=0.0320 2026-02-21T09:13:32.3666297Z max=0.3357 2026-02-21T09:13:32.3666524Z best={'block_sizes': [256, 1, 32], 2026-02-21T09:13:32.3666932Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:13:32.3667298Z 'l2_groupings': [2], 2026-02-21T09:13:32.3667605Z 'load_eviction_policies': ['', ''], 2026-02-21T09:13:32.3667921Z 'loop_orders': [[1, 0]], 2026-02-21T09:13:32.3668202Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:13:32.3668473Z 'num_stages': 3, 2026-02-21T09:13:32.3668699Z 'num_warps': 8, 2026-02-21T09:13:32.3668932Z 'pid_type': 'xyz', 2026-02-21T09:13:32.3669188Z 'range_flattens': [None, False], 2026-02-21T09:13:32.3669498Z 'range_multi_buffers': [None, False], 2026-02-21T09:13:32.3669812Z 'range_num_stages': [0, 2], 2026-02-21T09:13:32.3670093Z 'range_unroll_factors': [0, 1], 2026-02-21T09:13:32.3670392Z 'range_warp_specializes': [], 2026-02-21T09:13:32.3670666Z 'waves_per_eu': 2} 2026-02-21T09:13:32.4912121Z [258s] Fitting surrogate: 511 points, 511 targets 2026-02-21T09:13:33.1593940Z [258s] Generation 6 starting: 58 neighbors, 4 active search path(s) 2026-02-21T09:13:55.7952510Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 0.4 configs/s 2026-02-21T09:13:59.6566097Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 15.9 configs/s 2026-02-21T09:14:03.3483864Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 268.1 2026-02-21T09:14:03.3484135Z configs/s 2026-02-21T09:14:03.7996160Z [289s] Generation 6 complete: 2026-02-21T09:14:03.7996568Z ok=63 2026-02-21T09:14:03.7996781Z min=0.0258 2026-02-21T09:14:03.7996997Z mid=0.0318 2026-02-21T09:14:03.7997196Z max=0.3579 2026-02-21T09:14:03.7997428Z best={'block_sizes': [256, 1, 32], 2026-02-21T09:14:03.7997793Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:14:03.7998161Z 'l2_groupings': [32], 2026-02-21T09:14:03.7998448Z 'load_eviction_policies': ['', ''], 2026-02-21T09:14:03.7998760Z 'loop_orders': [[1, 0]], 2026-02-21T09:14:03.7999038Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:14:03.7999311Z 'num_stages': 4, 2026-02-21T09:14:03.7999580Z 'num_warps': 8, 2026-02-21T09:14:03.7999811Z 'pid_type': 'xyz', 2026-02-21T09:14:03.8000067Z 'range_flattens': [None, False], 2026-02-21T09:14:03.8000408Z 'range_multi_buffers': [None, False], 2026-02-21T09:14:03.8000723Z 'range_num_stages': [0, 2], 2026-02-21T09:14:03.8001004Z 'range_unroll_factors': [0, 1], 2026-02-21T09:14:03.8001300Z 'range_warp_specializes': [], 2026-02-21T09:14:03.8001575Z 'waves_per_eu': 2} 2026-02-21T09:14:03.9071742Z [289s] Fitting surrogate: 574 points, 574 targets 2026-02-21T09:14:04.4392546Z [290s] Generation 7 starting: 43 neighbors, 3 active search path(s) 2026-02-21T09:14:38.9837460Z [324s] Timeout after 30s compiling Config(block_sizes=[512, 1, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:14:38.9857975Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/45 0.3 configs/s 2026-02-21T09:14:41.8458909Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 45/45 16.0 configs/s 2026-02-21T09:14:43.7996203Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 496.8 2026-02-21T09:14:43.7996473Z configs/s 2026-02-21T09:14:44.0880878Z [329s] Generation 7 complete: 2026-02-21T09:14:44.0881058Z timeout=1 2026-02-21T09:14:44.0881142Z ok=45 2026-02-21T09:14:44.0881223Z min=0.0257 2026-02-21T09:14:44.0881304Z mid=0.0313 2026-02-21T09:14:44.0881381Z max=0.2706 2026-02-21T09:14:44.0881468Z best={'block_sizes': [256, 1, 32], 2026-02-21T09:14:44.0881608Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:14:44.0881741Z 'l2_groupings': [32], 2026-02-21T09:14:44.0881847Z 'load_eviction_policies': ['', ''], 2026-02-21T09:14:44.0881981Z 'loop_orders': [[1, 0]], 2026-02-21T09:14:44.0882090Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:14:44.0882192Z 'num_stages': 4, 2026-02-21T09:14:44.0882299Z 'num_warps': 8, 2026-02-21T09:14:44.0882388Z 'pid_type': 'xyz', 2026-02-21T09:14:44.0882481Z 'range_flattens': [None, False], 2026-02-21T09:14:44.0882705Z 'range_multi_buffers': [None, False], 2026-02-21T09:14:44.0882819Z 'range_num_stages': [0, 2], 2026-02-21T09:14:44.0882924Z 'range_unroll_factors': [0, 1], 2026-02-21T09:14:44.0883032Z 'range_warp_specializes': [], 2026-02-21T09:14:44.0883136Z 'waves_per_eu': 2} 2026-02-21T09:14:44.1390430Z [329s] Fitting surrogate: 620 points, 620 targets 2026-02-21T09:14:44.6744650Z [330s] Generation 8 starting: 41 neighbors, 3 active search path(s) 2026-02-21T09:15:02.5677929Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 0.4 configs/s 2026-02-21T09:15:05.3206913Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 16.1 configs/s 2026-02-21T09:15:07.1090122Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 538.1 2026-02-21T09:15:07.1091494Z configs/s 2026-02-21T09:15:07.3954703Z [353s] Generation 8 complete: 2026-02-21T09:15:07.3955064Z ok=44 2026-02-21T09:15:07.3955279Z min=0.0257 2026-02-21T09:15:07.3955486Z mid=0.0307 2026-02-21T09:15:07.3955685Z max=0.6465 2026-02-21T09:15:07.3955915Z best={'block_sizes': [256, 1, 32], 2026-02-21T09:15:07.3956287Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:15:07.3956652Z 'l2_groupings': [32], 2026-02-21T09:15:07.3956929Z 'load_eviction_policies': ['', ''], 2026-02-21T09:15:07.3957244Z 'loop_orders': [[1, 0]], 2026-02-21T09:15:07.3957525Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:15:07.3957792Z 'num_stages': 4, 2026-02-21T09:15:07.3958024Z 'num_warps': 8, 2026-02-21T09:15:07.3958260Z 'pid_type': 'xyz', 2026-02-21T09:15:07.3958513Z 'range_flattens': [None, False], 2026-02-21T09:15:07.3958843Z 'range_multi_buffers': [None, False], 2026-02-21T09:15:07.3959153Z 'range_num_stages': [0, 2], 2026-02-21T09:15:07.3959436Z 'range_unroll_factors': [0, 1], 2026-02-21T09:15:07.3959752Z 'range_warp_specializes': [], 2026-02-21T09:15:07.3960030Z 'waves_per_eu': 2} 2026-02-21T09:15:07.4401018Z [353s] Fitting surrogate: 664 points, 664 targets 2026-02-21T09:15:08.4259654Z [354s] Generation 9 starting: 30 neighbors, 2 active search path(s) 2026-02-21T09:15:42.6144290Z [388s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 128], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:15:42.6163065Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0.2 configs/s 2026-02-21T09:15:44.4929624Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 16.9 configs/s 2026-02-21T09:15:45.5864837Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 852.1 2026-02-21T09:15:45.5865516Z configs/s 2026-02-21T09:15:45.8590960Z [391s] Generation 9 complete: 2026-02-21T09:15:45.8591242Z timeout=1 2026-02-21T09:15:45.8591358Z ok=32 2026-02-21T09:15:45.8591475Z min=0.0247 2026-02-21T09:15:45.8591591Z mid=0.0352 2026-02-21T09:15:45.8591711Z max=0.4406 2026-02-21T09:15:45.8591845Z best={'block_sizes': [512, 1, 32], 2026-02-21T09:15:45.8592072Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:15:45.8592289Z 'l2_groupings': [2], 2026-02-21T09:15:45.8592455Z 'load_eviction_policies': ['', ''], 2026-02-21T09:15:45.8592673Z 'loop_orders': [[1, 0]], 2026-02-21T09:15:45.8592842Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:15:45.8593000Z 'num_stages': 4, 2026-02-21T09:15:45.8593142Z 'num_warps': 8, 2026-02-21T09:15:45.8593324Z 'pid_type': 'flat', 2026-02-21T09:15:45.8593474Z 'range_flattens': [None, False], 2026-02-21T09:15:45.8593677Z 'range_multi_buffers': [None, False], 2026-02-21T09:15:45.8593860Z 'range_num_stages': [0, 2], 2026-02-21T09:15:45.8594031Z 'range_unroll_factors': [0, 1], 2026-02-21T09:15:45.8594206Z 'range_warp_specializes': [], 2026-02-21T09:15:45.8594370Z 'waves_per_eu': 2} 2026-02-21T09:15:45.8882396Z [391s] Fitting surrogate: 697 points, 697 targets 2026-02-21T09:15:46.2022448Z [391s] Generation 10 starting: 24 neighbors, 2 active search path(s) 2026-02-21T09:16:05.6904651Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 0.7 configs/s 2026-02-21T09:16:07.3401289Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 25/25 16.6 configs/s 2026-02-21T09:16:08.1670678Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1111.4 2026-02-21T09:16:08.1671301Z configs/s 2026-02-21T09:16:08.3998352Z [414s] Generation 10 complete: 2026-02-21T09:16:08.3998686Z ok=26 2026-02-21T09:16:08.3998901Z min=0.0248 2026-02-21T09:16:08.4000189Z mid=0.0438 2026-02-21T09:16:08.4000395Z max=0.6669 2026-02-21T09:16:08.4000628Z best={'block_sizes': [512, 1, 32], 2026-02-21T09:16:08.4000998Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:16:08.4001364Z 'l2_groupings': [2], 2026-02-21T09:16:08.4001637Z 'load_eviction_policies': ['', ''], 2026-02-21T09:16:08.4001944Z 'loop_orders': [[1, 0]], 2026-02-21T09:16:08.4002219Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:16:08.4002490Z 'num_stages': 4, 2026-02-21T09:16:08.4002805Z 'num_warps': 8, 2026-02-21T09:16:08.4003039Z 'pid_type': 'flat', 2026-02-21T09:16:08.4003300Z 'range_flattens': [None, False], 2026-02-21T09:16:08.4003632Z 'range_multi_buffers': [None, False], 2026-02-21T09:16:08.4003947Z 'range_num_stages': [0, 2], 2026-02-21T09:16:08.4004222Z 'range_unroll_factors': [0, 1], 2026-02-21T09:16:08.4004759Z 'range_warp_specializes': [], 2026-02-21T09:16:08.4004979Z 'waves_per_eu': 2} 2026-02-21T09:16:08.4173532Z [414s] Fitting surrogate: 723 points, 723 targets 2026-02-21T09:16:08.6530104Z [414s] Generation 11 starting: 14 neighbors, 1 active search path(s) 2026-02-21T09:16:17.9666449Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 1.0 configs/s 2026-02-21T09:16:18.9692723Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.5 configs/s 2026-02-21T09:16:19.1735455Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3920.2 2026-02-21T09:16:19.1735689Z configs/s 2026-02-21T09:16:19.3750137Z [425s] Generation 11 complete: 2026-02-21T09:16:19.3750395Z ok=16 2026-02-21T09:16:19.3750565Z min=0.0248 2026-02-21T09:16:19.3750732Z mid=0.0826 2026-02-21T09:16:19.3750881Z max=0.7023 2026-02-21T09:16:19.3751041Z best={'block_sizes': [512, 1, 32], 2026-02-21T09:16:19.3751347Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:16:19.3751603Z 'l2_groupings': [2], 2026-02-21T09:16:19.3751807Z 'load_eviction_policies': ['', ''], 2026-02-21T09:16:19.3752063Z 'loop_orders': [[1, 0]], 2026-02-21T09:16:19.3752262Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:16:19.3752455Z 'num_stages': 4, 2026-02-21T09:16:19.3752626Z 'num_warps': 8, 2026-02-21T09:16:19.3752787Z 'pid_type': 'flat', 2026-02-21T09:16:19.3752977Z 'range_flattens': [None, False], 2026-02-21T09:16:19.3753200Z 'range_multi_buffers': [None, False], 2026-02-21T09:16:19.3753422Z 'range_num_stages': [0, 2], 2026-02-21T09:16:19.3753625Z 'range_unroll_factors': [0, 1], 2026-02-21T09:16:19.3753836Z 'range_warp_specializes': [], 2026-02-21T09:16:19.3754035Z 'waves_per_eu': 2} 2026-02-21T09:16:19.3821861Z [425s] Fitting surrogate: 739 points, 739 targets 2026-02-21T09:16:19.6135057Z [425s] Generation 12 starting: 15 neighbors, 1 active search path(s) 2026-02-21T09:16:31.6249252Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 0.6 configs/s 2026-02-21T09:16:32.6901517Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.4 configs/s 2026-02-21T09:16:32.8333508Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 6458.5 2026-02-21T09:16:32.8333748Z configs/s 2026-02-21T09:16:33.0058626Z [438s] Generation 12 complete: 2026-02-21T09:16:33.0059108Z ok=17 2026-02-21T09:16:33.0059322Z min=0.0246 2026-02-21T09:16:33.0059548Z mid=0.0963 2026-02-21T09:16:33.0059745Z max=0.3697 2026-02-21T09:16:33.0059977Z best={'block_sizes': [512, 1, 32], 2026-02-21T09:16:33.0060388Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:16:33.0060762Z 'l2_groupings': [2], 2026-02-21T09:16:33.0061038Z 'load_eviction_policies': ['', ''], 2026-02-21T09:16:33.0061348Z 'loop_orders': [[1, 0]], 2026-02-21T09:16:33.0061626Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:16:33.0061904Z 'num_stages': 4, 2026-02-21T09:16:33.0062133Z 'num_warps': 8, 2026-02-21T09:16:33.0062427Z 'pid_type': 'flat', 2026-02-21T09:16:33.0077990Z 'range_flattens': [None, False], 2026-02-21T09:16:33.0078211Z 'range_multi_buffers': [None, False], 2026-02-21T09:16:33.0078393Z 'range_num_stages': [0, 2], 2026-02-21T09:16:33.0078521Z 'range_unroll_factors': [0, 1], 2026-02-21T09:16:33.0078634Z 'range_warp_specializes': [], 2026-02-21T09:16:33.0078738Z 'waves_per_eu': 2} 2026-02-21T09:16:33.0120985Z [438s] Fitting surrogate: 756 points, 756 targets 2026-02-21T09:16:33.6667972Z [439s] Generation 13 starting: 13 neighbors, 1 active search path(s) 2026-02-21T09:16:44.4435120Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 0.8 configs/s 2026-02-21T09:16:45.3885438Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 17.5 configs/s 2026-02-21T09:16:45.4630749Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━━ 1000/1000 - configs/s 2026-02-21T09:16:45.5549050Z [451s] Generation 13 complete: 2026-02-21T09:16:45.5549300Z ok=15 2026-02-21T09:16:45.5550044Z min=0.0245 2026-02-21T09:16:45.5550126Z mid=0.0961 2026-02-21T09:16:45.5550200Z max=0.7028 2026-02-21T09:16:45.5550307Z best={'block_sizes': [512, 1, 32], 2026-02-21T09:16:45.5550446Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:16:45.5550585Z 'l2_groupings': [2], 2026-02-21T09:16:45.5550693Z 'load_eviction_policies': ['', ''], 2026-02-21T09:16:45.5550809Z 'loop_orders': [[1, 0]], 2026-02-21T09:16:45.5550917Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:16:45.5551019Z 'num_stages': 4, 2026-02-21T09:16:45.5551105Z 'num_warps': 8, 2026-02-21T09:16:45.5551192Z 'pid_type': 'flat', 2026-02-21T09:16:45.5551292Z 'range_flattens': [None, False], 2026-02-21T09:16:45.5551406Z 'range_multi_buffers': [None, False], 2026-02-21T09:16:45.5551525Z 'range_num_stages': [0, 2], 2026-02-21T09:16:45.5551629Z 'range_unroll_factors': [0, 1], 2026-02-21T09:16:45.5551740Z 'range_warp_specializes': [], 2026-02-21T09:16:45.5551847Z 'waves_per_eu': 2} 2026-02-21T09:16:45.5606402Z [451s] Fitting surrogate: 771 points, 771 targets 2026-02-21T09:16:45.8413726Z [451s] Generation 14 starting: 16 neighbors, 1 active search path(s) 2026-02-21T09:16:51.9184662Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 2.7 configs/s 2026-02-21T09:16:53.0543201Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.2 configs/s 2026-02-21T09:16:53.4215618Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2337.7 2026-02-21T09:16:53.4216973Z configs/s 2026-02-21T09:16:53.6426080Z [459s] Generation 14 complete: 2026-02-21T09:16:53.6426434Z ok=18 2026-02-21T09:16:53.6426652Z min=0.0244 2026-02-21T09:16:53.6426864Z mid=0.1417 2026-02-21T09:16:53.6427075Z max=0.5240 2026-02-21T09:16:53.6427308Z best={'block_sizes': [512, 1, 32], 2026-02-21T09:16:53.6427681Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:16:53.6428044Z 'l2_groupings': [2], 2026-02-21T09:16:53.6428372Z 'load_eviction_policies': ['', ''], 2026-02-21T09:16:53.6428686Z 'loop_orders': [[1, 0]], 2026-02-21T09:16:53.6428962Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:16:53.6429724Z 'num_stages': 4, 2026-02-21T09:16:53.6429954Z 'num_warps': 8, 2026-02-21T09:16:53.6430189Z 'pid_type': 'flat', 2026-02-21T09:16:53.6430455Z 'range_flattens': [None, False], 2026-02-21T09:16:53.6430769Z 'range_multi_buffers': [None, False], 2026-02-21T09:16:53.6431081Z 'range_num_stages': [0, 2], 2026-02-21T09:16:53.6431332Z 'range_unroll_factors': [0, 1], 2026-02-21T09:16:53.6431506Z 'range_warp_specializes': [], 2026-02-21T09:16:53.6431625Z 'waves_per_eu': 2} 2026-02-21T09:16:53.6527561Z [459s] Fitting surrogate: 789 points, 789 targets 2026-02-21T09:16:53.9038389Z [459s] Generation 15 starting: 15 neighbors, 1 active search path(s) 2026-02-21T09:17:07.3077413Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 0.7 configs/s 2026-02-21T09:17:08.3927487Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.0 configs/s 2026-02-21T09:17:08.6133927Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 4230.6 2026-02-21T09:17:08.6134538Z configs/s 2026-02-21T09:17:08.8129443Z [474s] Generation 15 complete: 2026-02-21T09:17:08.8129733Z ok=17 2026-02-21T09:17:08.8129912Z min=0.0245 2026-02-21T09:17:08.8130088Z mid=0.0855 2026-02-21T09:17:08.8130256Z max=0.8225 2026-02-21T09:17:08.8130477Z best={'block_sizes': [512, 1, 32], 2026-02-21T09:17:08.8130787Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:17:08.8131096Z 'l2_groupings': [2], 2026-02-21T09:17:08.8131328Z 'load_eviction_policies': ['', ''], 2026-02-21T09:17:08.8131584Z 'loop_orders': [[1, 0]], 2026-02-21T09:17:08.8131811Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:17:08.8132039Z 'num_stages': 4, 2026-02-21T09:17:08.8132222Z 'num_warps': 8, 2026-02-21T09:17:08.8132416Z 'pid_type': 'flat', 2026-02-21T09:17:08.8132630Z 'range_flattens': [None, False], 2026-02-21T09:17:08.8133338Z 'range_multi_buffers': [None, False], 2026-02-21T09:17:08.8133596Z 'range_num_stages': [0, 2], 2026-02-21T09:17:08.8133844Z 'range_unroll_factors': [0, 1], 2026-02-21T09:17:08.8134095Z 'range_warp_specializes': [], 2026-02-21T09:17:08.8134318Z 'waves_per_eu': 2} 2026-02-21T09:17:08.8196949Z [474s] Fitting surrogate: 806 points, 806 targets 2026-02-21T09:17:09.0388991Z [474s] Generation 16 starting: 13 neighbors, 1 active search path(s) 2026-02-21T09:17:11.2750677Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 7.0 configs/s 2026-02-21T09:17:12.2687932Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 16.4 configs/s 2026-02-21T09:17:12.9808796Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1256.6 2026-02-21T09:17:12.9809421Z configs/s 2026-02-21T09:17:13.1968732Z [478s] Generation 16 complete: 2026-02-21T09:17:13.1969141Z ok=15 2026-02-21T09:17:13.1969412Z min=0.0249 2026-02-21T09:17:13.1969622Z mid=0.0344 2026-02-21T09:17:13.1969823Z max=0.0958 2026-02-21T09:17:13.1970084Z best={'block_sizes': [512, 1, 32], 2026-02-21T09:17:13.1970454Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:17:13.1970810Z 'l2_groupings': [2], 2026-02-21T09:17:13.1971083Z 'load_eviction_policies': ['', ''], 2026-02-21T09:17:13.1971397Z 'loop_orders': [[1, 0]], 2026-02-21T09:17:13.1971672Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:17:13.1971943Z 'num_stages': 4, 2026-02-21T09:17:13.1972168Z 'num_warps': 8, 2026-02-21T09:17:13.1972398Z 'pid_type': 'flat', 2026-02-21T09:17:13.1972655Z 'range_flattens': [None, False], 2026-02-21T09:17:13.1972962Z 'range_multi_buffers': [None, False], 2026-02-21T09:17:13.1973268Z 'range_num_stages': [0, 2], 2026-02-21T09:17:13.1973546Z 'range_unroll_factors': [0, 1], 2026-02-21T09:17:13.1973842Z 'range_warp_specializes': [], 2026-02-21T09:17:13.1974115Z 'waves_per_eu': 2} 2026-02-21T09:17:13.2093224Z [478s] Fitting surrogate: 821 points, 821 targets 2026-02-21T09:17:13.4386165Z [479s] Generation 17 starting: 15 neighbors, 1 active search path(s) 2026-02-21T09:17:45.2432912Z [510s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 32], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:17:45.2454482Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 0.3 configs/s 2026-02-21T09:17:46.1656661Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 18.3 configs/s 2026-02-21T09:17:46.3825184Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 4026.2 2026-02-21T09:17:46.3829479Z configs/s 2026-02-21T09:17:46.5799871Z [512s] Generation 17 complete: 2026-02-21T09:17:46.5800120Z timeout=1 2026-02-21T09:17:46.5800223Z ok=16 2026-02-21T09:17:46.5800314Z min=0.0247 2026-02-21T09:17:46.5800395Z mid=0.0858 2026-02-21T09:17:46.5800487Z max=0.8225 2026-02-21T09:17:46.5800575Z best={'block_sizes': [512, 1, 32], 2026-02-21T09:17:46.5800724Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:17:46.5800865Z 'l2_groupings': [2], 2026-02-21T09:17:46.5800975Z 'load_eviction_policies': ['', ''], 2026-02-21T09:17:46.5801096Z 'loop_orders': [[1, 0]], 2026-02-21T09:17:46.5801212Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:17:46.5801321Z 'num_stages': 4, 2026-02-21T09:17:46.5801411Z 'num_warps': 8, 2026-02-21T09:17:46.5801506Z 'pid_type': 'flat', 2026-02-21T09:17:46.5801608Z 'range_flattens': [None, False], 2026-02-21T09:17:46.5801732Z 'range_multi_buffers': [None, False], 2026-02-21T09:17:46.5801851Z 'range_num_stages': [0, 2], 2026-02-21T09:17:46.5801963Z 'range_unroll_factors': [0, 1], 2026-02-21T09:17:46.5802790Z 'range_warp_specializes': [], 2026-02-21T09:17:46.5802977Z 'waves_per_eu': 2} 2026-02-21T09:17:46.5864861Z [512s] Fitting surrogate: 838 points, 838 targets 2026-02-21T09:17:46.8124259Z [512s] Generation 18 starting: 12 neighbors, 1 active search path(s) 2026-02-21T09:17:48.5671757Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 12.0 configs/s 2026-02-21T09:17:49.4264881Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 13/13 15.9 configs/s 2026-02-21T09:17:49.8052736Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2211.7 2026-02-21T09:17:49.8052967Z configs/s 2026-02-21T09:17:50.0034912Z [515s] Generation 18 complete: 2026-02-21T09:17:50.0035137Z ok=14 2026-02-21T09:17:50.0035268Z min=0.0247 2026-02-21T09:17:50.0035372Z mid=0.0376 2026-02-21T09:17:50.0035466Z max=0.0976 2026-02-21T09:17:50.0035579Z best={'block_sizes': [512, 1, 32], 2026-02-21T09:17:50.0035810Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:17:50.0035974Z 'l2_groupings': [2], 2026-02-21T09:17:50.0036680Z 'load_eviction_policies': ['', ''], 2026-02-21T09:17:50.0036821Z 'loop_orders': [[1, 0]], 2026-02-21T09:17:50.0036941Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:17:50.0037064Z 'num_stages': 4, 2026-02-21T09:17:50.0037173Z 'num_warps': 8, 2026-02-21T09:17:50.0037277Z 'pid_type': 'flat', 2026-02-21T09:17:50.0037402Z 'range_flattens': [None, False], 2026-02-21T09:17:50.0037542Z 'range_multi_buffers': [None, False], 2026-02-21T09:17:50.0037677Z 'range_num_stages': [0, 2], 2026-02-21T09:17:50.0037803Z 'range_unroll_factors': [0, 1], 2026-02-21T09:17:50.0037936Z 'range_warp_specializes': [], 2026-02-21T09:17:50.0038063Z 'waves_per_eu': 2} 2026-02-21T09:17:50.0135076Z [515s] Fitting surrogate: 852 points, 852 targets 2026-02-21T09:17:50.6759629Z [516s] Generation 19 starting: 13 neighbors, 1 active search path(s) 2026-02-21T09:17:55.3140195Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 2.0 configs/s 2026-02-21T09:17:56.2805116Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 17.0 configs/s 2026-02-21T09:17:56.5926298Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2674.6 2026-02-21T09:17:56.5926529Z configs/s 2026-02-21T09:17:56.7790452Z [522s] Generation 19 complete: 2026-02-21T09:17:56.7790674Z ok=15 2026-02-21T09:17:56.7790766Z min=0.0247 2026-02-21T09:17:56.7790854Z mid=0.0396 2026-02-21T09:17:56.7790939Z max=0.4143 2026-02-21T09:17:56.7791029Z best={'block_sizes': [512, 1, 32], 2026-02-21T09:17:56.7791179Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:17:56.7791318Z 'l2_groupings': [2], 2026-02-21T09:17:56.7791435Z 'load_eviction_policies': ['', ''], 2026-02-21T09:17:56.7791556Z 'loop_orders': [[1, 0]], 2026-02-21T09:17:56.7791671Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:17:56.7791777Z 'num_stages': 4, 2026-02-21T09:17:56.7792473Z 'num_warps': 8, 2026-02-21T09:17:56.7792570Z 'pid_type': 'flat', 2026-02-21T09:17:56.7792691Z 'range_flattens': [None, False], 2026-02-21T09:17:56.7792816Z 'range_multi_buffers': [None, False], 2026-02-21T09:17:56.7792934Z 'range_num_stages': [0, 2], 2026-02-21T09:17:56.7793048Z 'range_unroll_factors': [0, 1], 2026-02-21T09:17:56.7793161Z 'range_warp_specializes': [], 2026-02-21T09:17:56.7793272Z 'waves_per_eu': 2} 2026-02-21T09:17:56.7864208Z [522s] Fitting surrogate: 867 points, 867 targets 2026-02-21T09:17:57.0347063Z [522s] Generation 20 starting: 15 neighbors, 1 active search path(s) 2026-02-21T09:18:01.6267658Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.2 configs/s 2026-02-21T09:18:02.7355850Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.5 configs/s 2026-02-21T09:18:03.4305113Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1288.7 2026-02-21T09:18:03.4305695Z configs/s 2026-02-21T09:18:03.6656571Z [529s] Generation 20 complete: 2026-02-21T09:18:03.6656953Z ok=17 2026-02-21T09:18:03.6657179Z min=0.0248 2026-02-21T09:18:03.6657397Z mid=0.0351 2026-02-21T09:18:03.6657600Z max=0.1104 2026-02-21T09:18:03.6657835Z best={'block_sizes': [512, 1, 32], 2026-02-21T09:18:03.6658212Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:18:03.6658583Z 'l2_groupings': [2], 2026-02-21T09:18:03.6658862Z 'load_eviction_policies': ['', ''], 2026-02-21T09:18:03.6659180Z 'loop_orders': [[1, 0]], 2026-02-21T09:18:03.6659472Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:18:03.6659747Z 'num_stages': 4, 2026-02-21T09:18:03.6659980Z 'num_warps': 8, 2026-02-21T09:18:03.6660213Z 'pid_type': 'flat', 2026-02-21T09:18:03.6660476Z 'range_flattens': [None, False], 2026-02-21T09:18:03.6660785Z 'range_multi_buffers': [None, False], 2026-02-21T09:18:03.6661098Z 'range_num_stages': [0, 2], 2026-02-21T09:18:03.6661385Z 'range_unroll_factors': [0, 1], 2026-02-21T09:18:03.6661704Z 'range_warp_specializes': [], 2026-02-21T09:18:03.6661988Z 'waves_per_eu': 2} 2026-02-21T09:18:03.6803947Z [529s] Fitting surrogate: 884 points, 884 targets 2026-02-21T09:18:03.8024768Z [529s] Autotuning complete in 529.5s after searching 832 configs. 2026-02-21T09:18:03.8025181Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:18:03.8026565Z @helion.kernel(config=helion.Config(block_sizes=[512, 1, 32], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:18:03.8027832Z 2026-02-21T09:18:03.8028176Z [529s] Code of selected kernel: /tmp/torchinductor_root/xy/cxyczichkyjbyzbbk57upsfj6bffbu55pjaku6spkqal2zheky63.py 2026-02-21T09:18:03.8202700Z from __future__ import annotations 2026-02-21T09:18:03.8202860Z 2026-02-21T09:18:03.8202928Z import torch 2026-02-21T09:18:03.8203086Z import triton 2026-02-21T09:18:03.8203255Z import triton.language as tl 2026-02-21T09:18:03.8203520Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:18:03.8203737Z 2026-02-21T09:18:03.8203825Z _BLOCK_SIZE_2 = tl.constexpr(32) 2026-02-21T09:18:03.8204032Z _BLOCK_SIZE_1 = tl.constexpr(1) 2026-02-21T09:18:03.8204234Z _BLOCK_SIZE_0 = tl.constexpr(512) 2026-02-21T09:18:03.8204365Z 2026-02-21T09:18:03.8204427Z @triton.jit 2026-02-21T09:18:03.8204760Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr, _SHAPE_DIM_3: tl.constexpr): 2026-02-21T09:18:03.8205212Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:18:03.8205504Z num_pid_m = tl.cdiv(8192, _BLOCK_SIZE_2) 2026-02-21T09:18:03.8205705Z num_pid_n = 1 2026-02-21T09:18:03.8206215Z inner_2d_pid = tl.program_id(0) 2026-02-21T09:18:03.8206426Z num_pid_in_group = 2 * num_pid_n 2026-02-21T09:18:03.8206649Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T09:18:03.8206876Z first_pid_m = group_id * 2 2026-02-21T09:18:03.8207095Z group_size_m = min(num_pid_m - first_pid_m, 2) 2026-02-21T09:18:03.8207393Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T09:18:03.8207666Z offset_2 = pid_0 * _BLOCK_SIZE_2 2026-02-21T09:18:03.8207923Z indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32) 2026-02-21T09:18:03.8208274Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:18:03.8208616Z acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32) 2026-02-21T09:18:03.8208995Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:18:03.8209461Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:18:03.8209912Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:18:03.8210225Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:18:03.8210640Z for offset_3 in tl.range(0, 1792, _BLOCK_SIZE_0, loop_unroll_factor=1, num_stages=2, disallow_acc_multi_buffer=True, flatten=False): 2026-02-21T09:18:03.8211124Z indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T09:18:03.8211392Z mask_0 = indices_3 < 1792 2026-02-21T09:18:03.8211580Z acc_copy = acc 2026-02-21T09:18:03.8211753Z acc_copy_0 = acc_copy 2026-02-21T09:18:03.8212000Z # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2 2026-02-21T09:18:03.8212247Z mul = 2 * offset_3 2026-02-21T09:18:03.8212534Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:18:03.8212852Z iota = mul + tl.arange(0, mul_1) 2026-02-21T09:18:03.8213174Z load = tl.broadcast_to(tl.load(A + iota[None, :] * 1, None), [_BLOCK_SIZE_1, _SHAPE_DIM_2]) 2026-02-21T09:18:03.8213686Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:18:03.8213975Z # src[int4_gemm.py:66]: torch.float32 2026-02-21T09:18:03.8214181Z # src[int4_gemm.py:67]: ) # [BLOCK_SIZE_M, BLOCK_SIZE_K] 2026-02-21T09:18:03.8214379Z v_0 = tl.cast(load, tl.float32) 2026-02-21T09:18:03.8214625Z # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n] # [BLOCK_SIZE_K//2, BLOCK_SIZE_N] 2026-02-21T09:18:03.8214965Z b_tile = tl.load(B + (indices_3[:, None] * 8192 + indices_2[None, :] * 1), mask_0[:, None], other=0) 2026-02-21T09:18:03.8215304Z # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8) # Sign-extend low 4 bits 2026-02-21T09:18:03.8215544Z v_1 = tl.full([], 4, tl.int8) 2026-02-21T09:18:03.8215701Z v_2 = b_tile << v_1 2026-02-21T09:18:03.8215850Z v_3 = tl.full([], 4, tl.int8) 2026-02-21T09:18:03.8215994Z v_4 = v_2 >> v_3 2026-02-21T09:18:03.8216208Z # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8) # Sign-extend high 4 bits 2026-02-21T09:18:03.8216438Z v_5 = tl.full([], 4, tl.int8) 2026-02-21T09:18:03.8216583Z v_6 = b_tile >> v_5 2026-02-21T09:18:03.8216769Z # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1) 2026-02-21T09:18:03.8216981Z stack_idx = tl.arange(0, 2) 2026-02-21T09:18:03.8217148Z broadcast_idx = stack_idx[None, :, None] 2026-02-21T09:18:03.8217321Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T09:18:03.8217496Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T09:18:03.8217665Z stacked_result = tl.zeros_like(expanded_0) 2026-02-21T09:18:03.8217844Z mask_1 = broadcast_idx == 0 2026-02-21T09:18:03.8218044Z stacked_result = tl.where(mask_1, expanded_0, stacked_result) 2026-02-21T09:18:03.8218302Z mask_2 = broadcast_idx == 1 2026-02-21T09:18:03.8218497Z stacked_result = tl.where(mask_2, expanded_1, stacked_result) 2026-02-21T09:18:03.8218740Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:18:03.8218990Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:18:03.8219216Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:18:03.8219439Z view = tl.reshape(stacked_result, [_SHAPE_DIM_3, _BLOCK_SIZE_2]) 2026-02-21T09:18:03.8219653Z v_7 = tl.cast(view, tl.float32) 2026-02-21T09:18:03.8219874Z # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2) # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1] 2026-02-21T09:18:03.8220121Z a_tile_1 = v_0[:, :, None] 2026-02-21T09:18:03.8220309Z # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0) 2026-02-21T09:18:03.8220514Z b_unpacked_1 = v_7[None, :, :] 2026-02-21T09:18:03.8220771Z # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1) # [BLOCK_SIZE_M, BLOCK_SIZE_N] 2026-02-21T09:18:03.8221029Z v_8 = a_tile_1 * b_unpacked_1 2026-02-21T09:18:03.8221200Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:18:03.8221371Z acc = acc_copy_0 + sum_1 2026-02-21T09:18:03.8221601Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T09:18:03.8221805Z v_10 = tl.cast(acc, tl.bfloat16) 2026-02-21T09:18:03.8222137Z tl.store(tl.make_block_ptr(C, [1, 8192], [8192, 1], [0, offset_2], [1, _BLOCK_SIZE_2], [1, 0]), tl.reshape(v_10, [1, _BLOCK_SIZE_2]), boundary_check=[1]) 2026-02-21T09:18:03.8222421Z 2026-02-21T09:18:03.8222536Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher): 2026-02-21T09:18:03.8222749Z """ 2026-02-21T09:18:03.8222905Z BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T09:18:03.8223043Z 2026-02-21T09:18:03.8223132Z This kernel performs matrix multiplication where: 2026-02-21T09:18:03.8223338Z - A is a bfloat16 matrix of shape [M, K] 2026-02-21T09:18:03.8223552Z - B is an int8 matrix of shape [K//2, N] containing packed int4 values 2026-02-21T09:18:03.8223827Z (two 4-bit values packed into each int8) 2026-02-21T09:18:03.8223947Z 2026-02-21T09:18:03.8223996Z Args: 2026-02-21T09:18:03.8224172Z A (Tensor): Input tensor of shape [M, K] in bfloat16 format. 2026-02-21T09:18:03.8224408Z B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format. 2026-02-21T09:18:03.8224533Z 2026-02-21T09:18:03.8224571Z Returns: 2026-02-21T09:18:03.8224699Z Tensor: Output tensor of shape [M, N] in bfloat16 format. 2026-02-21T09:18:03.8224846Z """ 2026-02-21T09:18:03.8224945Z # src[int4_gemm.py:50]: M, K = A.shape 2026-02-21T09:18:03.8225069Z M, K = A.shape 2026-02-21T09:18:03.8225172Z # src[int4_gemm.py:51]: _, N = B.shape 2026-02-21T09:18:03.8225297Z _, N = B.shape 2026-02-21T09:18:03.8225453Z # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:18:03.8225675Z C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:18:03.8225866Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:18:03.8226023Z _BLOCK_SIZE_2 = 32 2026-02-21T09:18:03.8226199Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:18:03.8226483Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:18:03.8226756Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:18:03.8226944Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:18:03.8227064Z _BLOCK_SIZE_0 = 512 2026-02-21T09:18:03.8227234Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:18:03.8227426Z _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0 2026-02-21T09:18:03.8227614Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:18:03.8227808Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:18:03.8227994Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:18:03.8228125Z _SHAPE_DIM_3 = 2 * _BLOCK_SIZE_0 2026-02-21T09:18:03.8228279Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:18:03.8228514Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:18:03.8228696Z # src[int4_gemm.py:57-91]: ... 2026-02-21T09:18:03.8228842Z _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0) 2026-02-21T09:18:03.8229220Z _launcher(_helion_matmul_bf16_int4, (triton.cdiv(8192, _BLOCK_SIZE_2) * 1,), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, _SHAPE_DIM_3, num_warps=8, num_stages=4, waves_per_eu=2, matrix_instr_nonkdim=32) 2026-02-21T09:18:03.8229566Z # src[int4_gemm.py:93]: return C 2026-02-21T09:18:03.8229681Z return C 2026-02-21T09:18:04.6690181Z WARNING:tritonbench.utils.triton_op:Completed input ID 3: 2026-02-21T09:18:04.6690459Z x_val 2026-02-21T09:18:04.6690572Z ------------------ 2026-02-21T09:18:04.6690716Z (1, 1, 8192, 3584) 2026-02-21T09:18:04.6690782Z 2026-02-21T09:18:04.6698751Z 20%|██ | 2/10 [15:25<1:03:23, 475.49s/it]WARNING:tritonbench.utils.triton_op:Running input ID 7: 2026-02-21T09:18:04.6699025Z x_val 2026-02-21T09:18:04.6699135Z ------------------ 2026-02-21T09:18:04.6699255Z (4, 1, 8192, 3584) 2026-02-21T09:18:04.6704057Z INFO:tritonbench.utils.triton_op:Took 0.43ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T09:18:05.8439032Z INFO:tritonbench.utils.triton_op:Took 4.81ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T09:18:07.7728431Z INFO:tritonbench.utils.triton_op:Took 0.18ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T09:18:07.7847587Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:18:07.7847905Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:18:07.7848123Z 'dtype': 'torch.bfloat16', 2026-02-21T09:18:07.7848316Z 'shape': (4, 1, 3584), 2026-02-21T09:18:07.7849047Z 'stride': (3584, 3584, 1)}, 2026-02-21T09:18:07.7849172Z { 'device': 'cuda:0', 2026-02-21T09:18:07.7849290Z 'dtype': 'torch.int32', 2026-02-21T09:18:07.7849410Z 'shape': (3584, 8192), 2026-02-21T09:18:07.7849520Z 'stride': (8192, 1)}), 2026-02-21T09:18:07.7849666Z 'kwargs': {}} 2026-02-21T09:18:07.7882151Z INFO:tritonbench.utils.triton_op:Took 3.72ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T09:18:08.0113463Z [0s] Autotune random seed: 2138032649 2026-02-21T09:18:08.0385374Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:18:41.5978666Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 1, 16], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:18:42.5755890Z [34s] Timeout after 30s compiling Config(block_sizes=[512, 1, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T09:18:42.9950703Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 64], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[3, 3], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T09:18:43.2622888Z [35s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[4, 1], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:18:43.7770059Z [35s] Timeout after 30s compiling Config(block_sizes=[8, 1, 2048], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:18:49.5235774Z [41s] Timeout after 30s compiling Config(block_sizes=[128, 1, 256], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T09:18:50.1277516Z [42s] Timeout after 30s compiling Config(block_sizes=[512, 1, 256], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[4, 1], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:18:50.1290358Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T09:18:56.0853413Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.9 configs/s 2026-02-21T09:18:56.0863189Z [48s] Adaptive compile timeout: 30s (90% percentile=10.9s, bounds=[30.0s, 30s]) 2026-02-21T09:18:56.2517242Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 5458.9 configs/s 2026-02-21T09:18:56.5238593Z [48s] Initial random population of 100, 5 starting points: 2026-02-21T09:18:56.5239109Z timeout=7 2026-02-21T09:18:56.5239328Z ok=93 2026-02-21T09:18:56.5239539Z min=0.0455 2026-02-21T09:18:56.5239754Z mid=0.2931 2026-02-21T09:18:56.5239965Z max=21.1505 2026-02-21T09:18:56.5240213Z best={'block_sizes': [128, 2, 16], 2026-02-21T09:18:56.5240577Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T09:18:56.5240949Z 'l2_groupings': [2], 2026-02-21T09:18:56.5241300Z 'load_eviction_policies': ['', ''], 2026-02-21T09:18:56.5241620Z 'loop_orders': [[0, 1]], 2026-02-21T09:18:56.5241897Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:18:56.5242201Z 'num_stages': 2, 2026-02-21T09:18:56.5242442Z 'num_warps': 1, 2026-02-21T09:18:56.5242779Z 'pid_type': 'xyz', 2026-02-21T09:18:56.5243044Z 'range_flattens': [None, None], 2026-02-21T09:18:56.5243362Z 'range_multi_buffers': [None, False], 2026-02-21T09:18:56.5243682Z 'range_num_stages': [0, 3], 2026-02-21T09:18:56.5243972Z 'range_unroll_factors': [0, 1], 2026-02-21T09:18:56.5244277Z 'range_warp_specializes': [], 2026-02-21T09:18:56.5244561Z 'waves_per_eu': 1} 2026-02-21T09:18:56.5277234Z [48s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:18:57.8168880Z [49s] Generation 1 starting: 90 neighbors, 5 active search path(s) 2026-02-21T09:19:29.4778370Z [81s] Timeout after 30s compiling Config(block_sizes=[512, 2, 16], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[4, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:19:30.3199521Z [82s] Timeout after 30s compiling Config(block_sizes=[512, 2, 16], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='xyz', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:19:30.3221152Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 0.7 configs/s 2026-02-21T09:19:36.0033249Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 94/94 16.7 configs/s 2026-02-21T09:19:37.9571444Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 479.9 2026-02-21T09:19:37.9571736Z configs/s 2026-02-21T09:19:38.3038593Z [90s] Generation 1 complete: 2026-02-21T09:19:38.3038963Z timeout=2 2026-02-21T09:19:38.3039172Z ok=94 2026-02-21T09:19:38.3039388Z min=0.0399 2026-02-21T09:19:38.3039597Z mid=0.0730 2026-02-21T09:19:38.3039802Z max=0.5802 2026-02-21T09:19:38.3040030Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:19:38.3040412Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T09:19:38.3040782Z 'l2_groupings': [32], 2026-02-21T09:19:38.3041065Z 'load_eviction_policies': ['', ''], 2026-02-21T09:19:38.3041380Z 'loop_orders': [[1, 0]], 2026-02-21T09:19:38.3041662Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:19:38.3041935Z 'num_stages': 3, 2026-02-21T09:19:38.3042171Z 'num_warps': 1, 2026-02-21T09:19:38.3042410Z 'pid_type': 'xyz', 2026-02-21T09:19:38.3042741Z 'range_flattens': [None, None], 2026-02-21T09:19:38.3043096Z 'range_multi_buffers': [None, None], 2026-02-21T09:19:38.3043409Z 'range_num_stages': [0, 2], 2026-02-21T09:19:38.3043520Z 'range_unroll_factors': [0, 1], 2026-02-21T09:19:38.3044015Z 'range_warp_specializes': [], 2026-02-21T09:19:38.3044129Z 'waves_per_eu': 1} 2026-02-21T09:19:38.3379682Z [90s] Fitting surrogate: 196 points, 196 targets 2026-02-21T09:19:39.2627573Z [91s] Generation 2 starting: 93 neighbors, 5 active search path(s) 2026-02-21T09:20:12.8403599Z [124s] Timeout after 30s compiling Config(block_sizes=[128, 2, 32], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, None], range_num_stages=[4, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:20:13.0041845Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96/96 0.6 configs/s 2026-02-21T09:20:18.8929806Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 96/96 16.7 configs/s 2026-02-21T09:20:24.0346967Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 190.8 2026-02-21T09:20:24.0347417Z configs/s 2026-02-21T09:20:24.5604149Z [136s] Generation 2 complete: 2026-02-21T09:20:24.5604580Z timeout=1 2026-02-21T09:20:24.5604791Z ok=98 2026-02-21T09:20:24.5605006Z min=0.0390 2026-02-21T09:20:24.5605215Z mid=0.0567 2026-02-21T09:20:24.5605425Z max=0.7450 2026-02-21T09:20:24.5605659Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:20:24.5606065Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T09:20:24.5606463Z 'l2_groupings': [32], 2026-02-21T09:20:24.5606747Z 'load_eviction_policies': ['', ''], 2026-02-21T09:20:24.5607068Z 'loop_orders': [[1, 0]], 2026-02-21T09:20:24.5607352Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:20:24.5607631Z 'num_stages': 3, 2026-02-21T09:20:24.5608607Z 'num_warps': 2, 2026-02-21T09:20:24.5608848Z 'pid_type': 'flat', 2026-02-21T09:20:24.5609089Z 'range_flattens': [None, None], 2026-02-21T09:20:24.5609360Z 'range_multi_buffers': [None, None], 2026-02-21T09:20:24.5609609Z 'range_num_stages': [0, 2], 2026-02-21T09:20:24.5609836Z 'range_unroll_factors': [0, 1], 2026-02-21T09:20:24.5610073Z 'range_warp_specializes': [], 2026-02-21T09:20:24.5610306Z 'waves_per_eu': 1} 2026-02-21T09:20:24.6803951Z [136s] Fitting surrogate: 295 points, 295 targets 2026-02-21T09:20:25.6412062Z [137s] Generation 3 starting: 97 neighbors, 5 active search path(s) 2026-02-21T09:20:54.1576317Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 0.6 configs/s 2026-02-21T09:21:00.1958119Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 16.5 configs/s 2026-02-21T09:21:05.3419639Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 190.8 2026-02-21T09:21:05.3420270Z configs/s 2026-02-21T09:21:05.8574618Z [177s] Generation 3 complete: 2026-02-21T09:21:05.8574975Z ok=102 2026-02-21T09:21:05.8575227Z min=0.0388 2026-02-21T09:21:05.8575400Z mid=0.0545 2026-02-21T09:21:05.8575601Z max=1.0205 2026-02-21T09:21:05.8575789Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:21:05.8576081Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T09:21:05.8576371Z 'l2_groupings': [32], 2026-02-21T09:21:05.8576593Z 'load_eviction_policies': ['', ''], 2026-02-21T09:21:05.8576838Z 'loop_orders': [[1, 0]], 2026-02-21T09:21:05.8577074Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:21:05.8577289Z 'num_stages': 3, 2026-02-21T09:21:05.8577477Z 'num_warps': 2, 2026-02-21T09:21:05.8577663Z 'pid_type': 'flat', 2026-02-21T09:21:05.8577877Z 'range_flattens': [None, None], 2026-02-21T09:21:05.8578115Z 'range_multi_buffers': [None, None], 2026-02-21T09:21:05.8578360Z 'range_num_stages': [0, 2], 2026-02-21T09:21:05.8578580Z 'range_unroll_factors': [0, 1], 2026-02-21T09:21:05.8578844Z 'range_warp_specializes': [], 2026-02-21T09:21:05.8579066Z 'waves_per_eu': 1} 2026-02-21T09:21:05.9751681Z [177s] Fitting surrogate: 397 points, 397 targets 2026-02-21T09:21:07.1919024Z [179s] Generation 4 starting: 79 neighbors, 5 active search path(s) 2026-02-21T09:21:43.3819417Z [215s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 8], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[1, 2], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:21:46.1295606Z [218s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[4, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:21:46.1312852Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 0.4 configs/s 2026-02-21T09:21:48.5359476Z [220s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[512, 4, 8], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:21:48.5361417Z Tensor-likes are not close! 2026-02-21T09:21:48.5361625Z 2026-02-21T09:21:48.5361768Z Mismatched elements: 8006 / 32768 (24.4%) 2026-02-21T09:21:48.5363015Z Greatest absolute difference: 342.0 at index (0, 6175) (up to 0.01 allowed) 2026-02-21T09:21:48.5363599Z Greatest relative difference: 4608.0 at index (0, 549) (up to 0.01 allowed) 2026-02-21T09:21:48.5364145Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:21:48.5364444Z 2026-02-21T09:21:50.9293569Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 16.6 configs/s 2026-02-21T09:21:55.3020185Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 224.2 2026-02-21T09:21:55.3020457Z configs/s 2026-02-21T09:21:55.7656657Z [227s] Generation 4 complete: 2026-02-21T09:21:55.7656866Z error=1 2026-02-21T09:21:55.7656956Z timeout=2 2026-02-21T09:21:55.7657036Z ok=81 2026-02-21T09:21:55.7657121Z min=0.0388 2026-02-21T09:21:55.7657201Z mid=0.0466 2026-02-21T09:21:55.7657282Z max=1.3751 2026-02-21T09:21:55.7657370Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:21:55.7657514Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T09:21:55.7657717Z 'l2_groupings': [32], 2026-02-21T09:21:55.7658128Z 'load_eviction_policies': ['', ''], 2026-02-21T09:21:55.7658267Z 'loop_orders': [[1, 0]], 2026-02-21T09:21:55.7658373Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:21:55.7658482Z 'num_stages': 3, 2026-02-21T09:21:55.7658572Z 'num_warps': 2, 2026-02-21T09:21:55.7658666Z 'pid_type': 'flat', 2026-02-21T09:21:55.7658765Z 'range_flattens': [None, None], 2026-02-21T09:21:55.7662395Z 'range_multi_buffers': [None, None], 2026-02-21T09:21:55.7662510Z 'range_num_stages': [0, 2], 2026-02-21T09:21:55.7662620Z 'range_unroll_factors': [0, 1], 2026-02-21T09:21:55.7662731Z 'range_warp_specializes': [], 2026-02-21T09:21:55.7662843Z 'waves_per_eu': 1} 2026-02-21T09:21:55.8644654Z [227s] Fitting surrogate: 481 points, 481 targets 2026-02-21T09:21:56.6291063Z [228s] Generation 5 starting: 71 neighbors, 4 active search path(s) 2026-02-21T09:22:31.0463921Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.3 configs/s 2026-02-21T09:22:35.5289564Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 16.4 configs/s 2026-02-21T09:22:39.5962487Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 240.7 2026-02-21T09:22:39.5963239Z configs/s 2026-02-21T09:22:40.0664947Z [272s] Generation 5 complete: 2026-02-21T09:22:40.0665287Z ok=76 2026-02-21T09:22:40.0665480Z min=0.0388 2026-02-21T09:22:40.0665678Z mid=0.0464 2026-02-21T09:22:40.0665861Z max=1.5840 2026-02-21T09:22:40.0666072Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:22:40.0666407Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T09:22:40.0666745Z 'l2_groupings': [32], 2026-02-21T09:22:40.0667008Z 'load_eviction_policies': ['', ''], 2026-02-21T09:22:40.0667288Z 'loop_orders': [[1, 0]], 2026-02-21T09:22:40.0667545Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:22:40.0667786Z 'num_stages': 3, 2026-02-21T09:22:40.0668002Z 'num_warps': 2, 2026-02-21T09:22:40.0668244Z 'pid_type': 'flat', 2026-02-21T09:22:40.0668598Z 'range_flattens': [None, None], 2026-02-21T09:22:40.0668886Z 'range_multi_buffers': [None, None], 2026-02-21T09:22:40.0669198Z 'range_num_stages': [0, 2], 2026-02-21T09:22:40.0669457Z 'range_unroll_factors': [0, 1], 2026-02-21T09:22:40.0669721Z 'range_warp_specializes': [], 2026-02-21T09:22:40.0669981Z 'waves_per_eu': 1} 2026-02-21T09:22:40.1820699Z [272s] Fitting surrogate: 557 points, 557 targets 2026-02-21T09:22:40.7469271Z [272s] Generation 6 starting: 52 neighbors, 3 active search path(s) 2026-02-21T09:22:48.8849724Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 3.0 configs/s 2026-02-21T09:22:52.3112198Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 15.8 configs/s 2026-02-21T09:22:56.1384759Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 305.6 2026-02-21T09:22:56.1385148Z configs/s 2026-02-21T09:22:56.5418584Z [288s] Generation 6 complete: 2026-02-21T09:22:56.5419033Z ok=56 2026-02-21T09:22:56.5419245Z min=0.0389 2026-02-21T09:22:56.5419484Z mid=0.0446 2026-02-21T09:22:56.5419689Z max=0.7937 2026-02-21T09:22:56.5419925Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:22:56.5420300Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T09:22:56.5420669Z 'l2_groupings': [32], 2026-02-21T09:22:56.5420945Z 'load_eviction_policies': ['', ''], 2026-02-21T09:22:56.5421259Z 'loop_orders': [[1, 0]], 2026-02-21T09:22:56.5421538Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:22:56.5421822Z 'num_stages': 3, 2026-02-21T09:22:56.5422057Z 'num_warps': 2, 2026-02-21T09:22:56.5422244Z 'pid_type': 'flat', 2026-02-21T09:22:56.5422412Z 'range_flattens': [None, None], 2026-02-21T09:22:56.5422600Z 'range_multi_buffers': [None, None], 2026-02-21T09:22:56.5422783Z 'range_num_stages': [0, 2], 2026-02-21T09:22:56.5422955Z 'range_unroll_factors': [0, 1], 2026-02-21T09:22:56.5423143Z 'range_warp_specializes': [], 2026-02-21T09:22:56.5423322Z 'waves_per_eu': 1} 2026-02-21T09:22:56.6176628Z [288s] Fitting surrogate: 613 points, 613 targets 2026-02-21T09:22:57.2349523Z [289s] Generation 7 starting: 48 neighbors, 3 active search path(s) 2026-02-21T09:23:04.8150373Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 6.4 configs/s 2026-02-21T09:23:07.8628791Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 16.5 configs/s 2026-02-21T09:23:10.7200929Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 337.6 2026-02-21T09:23:10.7201547Z configs/s 2026-02-21T09:23:11.0884376Z [303s] Generation 7 complete: 2026-02-21T09:23:11.0884683Z ok=52 2026-02-21T09:23:11.0884856Z min=0.0387 2026-02-21T09:23:11.0885029Z mid=0.0460 2026-02-21T09:23:11.0885190Z max=0.2855 2026-02-21T09:23:11.0885371Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:23:11.0885673Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T09:23:11.0886019Z 'l2_groupings': [32], 2026-02-21T09:23:11.0886240Z 'load_eviction_policies': ['', ''], 2026-02-21T09:23:11.0886520Z 'loop_orders': [[1, 0]], 2026-02-21T09:23:11.0886742Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:23:11.0886995Z 'num_stages': 3, 2026-02-21T09:23:11.0887182Z 'num_warps': 2, 2026-02-21T09:23:11.0887377Z 'pid_type': 'flat', 2026-02-21T09:23:11.0887591Z 'range_flattens': [None, None], 2026-02-21T09:23:11.0887835Z 'range_multi_buffers': [None, None], 2026-02-21T09:23:11.0888085Z 'range_num_stages': [0, 2], 2026-02-21T09:23:11.0888308Z 'range_unroll_factors': [0, 1], 2026-02-21T09:23:11.0888563Z 'range_warp_specializes': [], 2026-02-21T09:23:11.0888803Z 'waves_per_eu': 1} 2026-02-21T09:23:11.1468505Z [303s] Fitting surrogate: 665 points, 665 targets 2026-02-21T09:23:11.5804276Z [303s] Generation 8 starting: 37 neighbors, 2 active search path(s) 2026-02-21T09:23:17.9289164Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 3.6 configs/s 2026-02-21T09:23:20.3179935Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 16.5 configs/s 2026-02-21T09:23:22.4407838Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 448.1 2026-02-21T09:23:22.4408465Z configs/s 2026-02-21T09:23:22.7455284Z [314s] Generation 8 complete: 2026-02-21T09:23:22.7455649Z ok=40 2026-02-21T09:23:22.7455952Z min=0.0384 2026-02-21T09:23:22.7456212Z mid=0.0445 2026-02-21T09:23:22.7456419Z max=0.3020 2026-02-21T09:23:22.7456651Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:23:22.7457020Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:23:22.7457403Z 'l2_groupings': [1], 2026-02-21T09:23:22.7457679Z 'load_eviction_policies': ['', ''], 2026-02-21T09:23:22.7458002Z 'loop_orders': [[0, 1]], 2026-02-21T09:23:22.7458284Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:23:22.7458552Z 'num_stages': 1, 2026-02-21T09:23:22.7458785Z 'num_warps': 4, 2026-02-21T09:23:22.7459057Z 'pid_type': 'flat', 2026-02-21T09:23:22.7459318Z 'range_flattens': [None, None], 2026-02-21T09:23:22.7459575Z 'range_multi_buffers': [None, False], 2026-02-21T09:23:22.7459859Z 'range_num_stages': [0, 2], 2026-02-21T09:23:22.7460097Z 'range_unroll_factors': [0, 0], 2026-02-21T09:23:22.7460350Z 'range_warp_specializes': [], 2026-02-21T09:23:22.7460582Z 'waves_per_eu': 3} 2026-02-21T09:23:22.7830875Z [314s] Fitting surrogate: 705 points, 705 targets 2026-02-21T09:23:23.2116073Z [315s] Generation 9 starting: 36 neighbors, 2 active search path(s) 2026-02-21T09:23:28.7091777Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 4.5 configs/s 2026-02-21T09:23:31.0566854Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 16.3 configs/s 2026-02-21T09:23:32.7000023Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 564.2 2026-02-21T09:23:32.7000391Z configs/s 2026-02-21T09:23:32.9872466Z [324s] Generation 9 complete: 2026-02-21T09:23:32.9872865Z ok=38 2026-02-21T09:23:32.9873058Z min=0.0385 2026-02-21T09:23:32.9873867Z mid=0.0523 2026-02-21T09:23:32.9874001Z max=0.3922 2026-02-21T09:23:32.9874139Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:23:32.9874359Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:23:32.9874575Z 'l2_groupings': [1], 2026-02-21T09:23:32.9874730Z 'load_eviction_policies': ['', ''], 2026-02-21T09:23:32.9874912Z 'loop_orders': [[0, 1]], 2026-02-21T09:23:32.9875080Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:23:32.9875243Z 'num_stages': 1, 2026-02-21T09:23:32.9875384Z 'num_warps': 4, 2026-02-21T09:23:32.9875521Z 'pid_type': 'flat', 2026-02-21T09:23:32.9875668Z 'range_flattens': [None, None], 2026-02-21T09:23:32.9875850Z 'range_multi_buffers': [None, False], 2026-02-21T09:23:32.9876028Z 'range_num_stages': [0, 2], 2026-02-21T09:23:32.9876195Z 'range_unroll_factors': [0, 0], 2026-02-21T09:23:32.9876372Z 'range_warp_specializes': [], 2026-02-21T09:23:32.9876545Z 'waves_per_eu': 3} 2026-02-21T09:23:33.0146879Z [324s] Fitting surrogate: 743 points, 743 targets 2026-02-21T09:23:33.3910158Z [325s] Generation 10 starting: 31 neighbors, 2 active search path(s) 2026-02-21T09:23:37.8973572Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 3.1 configs/s 2026-02-21T09:23:39.9229548Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 16.5 configs/s 2026-02-21T09:23:42.3415334Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 496.4 2026-02-21T09:23:42.3415976Z configs/s 2026-02-21T09:23:42.6715929Z [334s] Generation 10 complete: 2026-02-21T09:23:42.6716448Z ok=33 2026-02-21T09:23:42.6716812Z min=0.0383 2026-02-21T09:23:42.6717057Z mid=0.0433 2026-02-21T09:23:42.6717261Z max=0.2369 2026-02-21T09:23:42.6717498Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:23:42.6717978Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:23:42.6719001Z 'l2_groupings': [1], 2026-02-21T09:23:42.6719356Z 'load_eviction_policies': ['', ''], 2026-02-21T09:23:42.6719704Z 'loop_orders': [[0, 1]], 2026-02-21T09:23:42.6719981Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:23:42.6720258Z 'num_stages': 1, 2026-02-21T09:23:42.6720482Z 'num_warps': 4, 2026-02-21T09:23:42.6720718Z 'pid_type': 'flat', 2026-02-21T09:23:42.6720988Z 'range_flattens': [None, None], 2026-02-21T09:23:42.6721294Z 'range_multi_buffers': [None, False], 2026-02-21T09:23:42.6721611Z 'range_num_stages': [0, 2], 2026-02-21T09:23:42.6721897Z 'range_unroll_factors': [0, 0], 2026-02-21T09:23:42.6722199Z 'range_warp_specializes': [], 2026-02-21T09:23:42.6722474Z 'waves_per_eu': 3} 2026-02-21T09:23:42.7080351Z [334s] Fitting surrogate: 776 points, 776 targets 2026-02-21T09:23:42.9664962Z [334s] Generation 11 starting: 17 neighbors, 1 active search path(s) 2026-02-21T09:23:59.3342554Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.4 configs/s 2026-02-21T09:24:00.4723009Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.1 configs/s 2026-02-21T09:24:01.2589896Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1073.0 2026-02-21T09:24:01.2590163Z configs/s 2026-02-21T09:24:01.4972339Z [353s] Generation 11 complete: 2026-02-21T09:24:01.4972514Z ok=19 2026-02-21T09:24:01.4972622Z min=0.0388 2026-02-21T09:24:01.4972710Z mid=0.0528 2026-02-21T09:24:01.4972786Z max=0.4943 2026-02-21T09:24:01.4972881Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:24:01.4973025Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:24:01.4973159Z 'l2_groupings': [1], 2026-02-21T09:24:01.4973266Z 'load_eviction_policies': ['', ''], 2026-02-21T09:24:01.4973381Z 'loop_orders': [[0, 1]], 2026-02-21T09:24:01.4973493Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:24:01.4973599Z 'num_stages': 1, 2026-02-21T09:24:01.4973691Z 'num_warps': 4, 2026-02-21T09:24:01.4973830Z 'pid_type': 'flat', 2026-02-21T09:24:01.4973934Z 'range_flattens': [None, None], 2026-02-21T09:24:01.4974054Z 'range_multi_buffers': [None, False], 2026-02-21T09:24:01.4974513Z 'range_num_stages': [0, 2], 2026-02-21T09:24:01.4974626Z 'range_unroll_factors': [0, 0], 2026-02-21T09:24:01.4974738Z 'range_warp_specializes': [], 2026-02-21T09:24:01.4974849Z 'waves_per_eu': 3} 2026-02-21T09:24:01.5087115Z [353s] Fitting surrogate: 795 points, 795 targets 2026-02-21T09:24:01.7514106Z [353s] Generation 12 starting: 17 neighbors, 1 active search path(s) 2026-02-21T09:24:04.2018753Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 15.0 configs/s 2026-02-21T09:24:05.3051971Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 16.7 configs/s 2026-02-21T09:24:06.0772578Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1122.0 2026-02-21T09:24:06.0772865Z configs/s 2026-02-21T09:24:06.3117267Z [358s] Generation 12 complete: 2026-02-21T09:24:06.3117487Z ok=19 2026-02-21T09:24:06.3117582Z min=0.0384 2026-02-21T09:24:06.3117683Z mid=0.0442 2026-02-21T09:24:06.3117780Z max=0.1085 2026-02-21T09:24:06.3117874Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:24:06.3118016Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:24:06.3118163Z 'l2_groupings': [1], 2026-02-21T09:24:06.3118272Z 'load_eviction_policies': ['', ''], 2026-02-21T09:24:06.3118396Z 'loop_orders': [[0, 1]], 2026-02-21T09:24:06.3118504Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:24:06.3118610Z 'num_stages': 1, 2026-02-21T09:24:06.3118700Z 'num_warps': 4, 2026-02-21T09:24:06.3118796Z 'pid_type': 'flat', 2026-02-21T09:24:06.3118897Z 'range_flattens': [None, None], 2026-02-21T09:24:06.3119019Z 'range_multi_buffers': [None, False], 2026-02-21T09:24:06.3119143Z 'range_num_stages': [0, 2], 2026-02-21T09:24:06.3119251Z 'range_unroll_factors': [0, 0], 2026-02-21T09:24:06.3119367Z 'range_warp_specializes': [], 2026-02-21T09:24:06.3120218Z 'waves_per_eu': 3} 2026-02-21T09:24:06.3299411Z [358s] Fitting surrogate: 814 points, 814 targets 2026-02-21T09:24:06.5960270Z [358s] Generation 13 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:24:10.5732590Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 3.7 configs/s 2026-02-21T09:24:11.9268263Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 15.6 configs/s 2026-02-21T09:24:13.2329529Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 724.0 2026-02-21T09:24:13.2330057Z configs/s 2026-02-21T09:24:13.5303897Z [365s] Generation 13 complete: 2026-02-21T09:24:13.5304041Z ok=21 2026-02-21T09:24:13.5305196Z min=0.0385 2026-02-21T09:24:13.5305370Z mid=0.0431 2026-02-21T09:24:13.5305455Z max=0.0685 2026-02-21T09:24:13.5305556Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:24:13.5305739Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:24:13.5305929Z 'l2_groupings': [1], 2026-02-21T09:24:13.5306035Z 'load_eviction_policies': ['', ''], 2026-02-21T09:24:13.5306203Z 'loop_orders': [[0, 1]], 2026-02-21T09:24:13.5306308Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:24:13.5306414Z 'num_stages': 1, 2026-02-21T09:24:13.5306505Z 'num_warps': 4, 2026-02-21T09:24:13.5306592Z 'pid_type': 'flat', 2026-02-21T09:24:13.5306693Z 'range_flattens': [None, None], 2026-02-21T09:24:13.5306808Z 'range_multi_buffers': [None, False], 2026-02-21T09:24:13.5306925Z 'range_num_stages': [0, 2], 2026-02-21T09:24:13.5307031Z 'range_unroll_factors': [0, 0], 2026-02-21T09:24:13.5307142Z 'range_warp_specializes': [], 2026-02-21T09:24:13.5307245Z 'waves_per_eu': 3} 2026-02-21T09:24:13.5589558Z [365s] Fitting surrogate: 835 points, 835 targets 2026-02-21T09:24:13.7924104Z [365s] Generation 14 starting: 18 neighbors, 1 active search path(s) 2026-02-21T09:24:16.2966609Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 9.0 configs/s 2026-02-21T09:24:17.5376808Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 16.4 configs/s 2026-02-21T09:24:18.9314304Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 685.8 2026-02-21T09:24:18.9315509Z configs/s 2026-02-21T09:24:19.2361778Z [371s] Generation 14 complete: 2026-02-21T09:24:19.2363645Z ok=20 2026-02-21T09:24:19.2363809Z min=0.0386 2026-02-21T09:24:19.2363964Z mid=0.0395 2026-02-21T09:24:19.2364204Z max=0.0684 2026-02-21T09:24:19.2364378Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:24:19.2364657Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:24:19.2364933Z 'l2_groupings': [4], 2026-02-21T09:24:19.2365139Z 'load_eviction_policies': ['', ''], 2026-02-21T09:24:19.2365372Z 'loop_orders': [[1, 0]], 2026-02-21T09:24:19.2365583Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:24:19.2365790Z 'num_stages': 4, 2026-02-21T09:24:19.2365958Z 'num_warps': 4, 2026-02-21T09:24:19.2366177Z 'pid_type': 'flat', 2026-02-21T09:24:19.2366371Z 'range_flattens': [None, False], 2026-02-21T09:24:19.2366607Z 'range_multi_buffers': [None, None], 2026-02-21T09:24:19.2366868Z 'range_num_stages': [0, 1], 2026-02-21T09:24:19.2367081Z 'range_unroll_factors': [0, 2], 2026-02-21T09:24:19.2367303Z 'range_warp_specializes': [], 2026-02-21T09:24:19.2367515Z 'waves_per_eu': 1} 2026-02-21T09:24:19.2687924Z [371s] Fitting surrogate: 855 points, 855 targets 2026-02-21T09:24:19.9688387Z [371s] Generation 15 starting: 12 neighbors, 1 active search path(s) 2026-02-21T09:24:22.3247371Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 4.0 configs/s 2026-02-21T09:24:23.1972909Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 16.4 configs/s 2026-02-21T09:24:23.7810357Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1483.9 2026-02-21T09:24:23.7810880Z configs/s 2026-02-21T09:24:24.0306722Z [375s] Generation 15 complete: 2026-02-21T09:24:24.0307181Z ok=13 2026-02-21T09:24:24.0307397Z min=0.0389 2026-02-21T09:24:24.0307644Z mid=0.0416 2026-02-21T09:24:24.0307850Z max=0.1087 2026-02-21T09:24:24.0308072Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:24:24.0308444Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:24:24.0308806Z 'l2_groupings': [8], 2026-02-21T09:24:24.0309085Z 'load_eviction_policies': ['', ''], 2026-02-21T09:24:24.0309396Z 'loop_orders': [[1, 0]], 2026-02-21T09:24:24.0309677Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:24:24.0309946Z 'num_stages': 4, 2026-02-21T09:24:24.0310176Z 'num_warps': 4, 2026-02-21T09:24:24.0310410Z 'pid_type': 'flat', 2026-02-21T09:24:24.0310678Z 'range_flattens': [None, False], 2026-02-21T09:24:24.0310988Z 'range_multi_buffers': [None, None], 2026-02-21T09:24:24.0311298Z 'range_num_stages': [0, 1], 2026-02-21T09:24:24.0311586Z 'range_unroll_factors': [0, 2], 2026-02-21T09:24:24.0311883Z 'range_warp_specializes': [], 2026-02-21T09:24:24.0312172Z 'waves_per_eu': 1} 2026-02-21T09:24:24.0419949Z [376s] Fitting surrogate: 868 points, 868 targets 2026-02-21T09:24:24.3746806Z [376s] Generation 16 starting: 17 neighbors, 1 active search path(s) 2026-02-21T09:24:27.1652840Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 21.9 configs/s 2026-02-21T09:24:28.3887807Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 15.7 configs/s 2026-02-21T09:24:29.7249630Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 695.9 2026-02-21T09:24:29.7249991Z configs/s 2026-02-21T09:24:30.0443518Z [382s] Generation 16 complete: 2026-02-21T09:24:30.0443720Z ok=18 2026-02-21T09:24:30.0443812Z min=0.0389 2026-02-21T09:24:30.0443894Z mid=0.0413 2026-02-21T09:24:30.0443979Z max=0.0633 2026-02-21T09:24:30.0444076Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:24:30.0444225Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:24:30.0444431Z 'l2_groupings': [8], 2026-02-21T09:24:30.0444564Z 'load_eviction_policies': ['', ''], 2026-02-21T09:24:30.0444705Z 'loop_orders': [[1, 0]], 2026-02-21T09:24:30.0444812Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:24:30.0444923Z 'num_stages': 4, 2026-02-21T09:24:30.0445010Z 'num_warps': 4, 2026-02-21T09:24:30.0445104Z 'pid_type': 'flat', 2026-02-21T09:24:30.0445206Z 'range_flattens': [None, False], 2026-02-21T09:24:30.0445325Z 'range_multi_buffers': [None, True], 2026-02-21T09:24:30.0445447Z 'range_num_stages': [0, 1], 2026-02-21T09:24:30.0445553Z 'range_unroll_factors': [0, 2], 2026-02-21T09:24:30.0445669Z 'range_warp_specializes': [], 2026-02-21T09:24:30.0445777Z 'waves_per_eu': 1} 2026-02-21T09:24:30.0758198Z [382s] Fitting surrogate: 886 points, 886 targets 2026-02-21T09:24:30.3659035Z [382s] Generation 17 starting: 16 neighbors, 1 active search path(s) 2026-02-21T09:24:33.0543170Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 13.4 configs/s 2026-02-21T09:24:34.1866883Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.1 configs/s 2026-02-21T09:24:35.3241749Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 803.0 2026-02-21T09:24:35.3242366Z configs/s 2026-02-21T09:24:35.6022112Z [387s] Generation 17 complete: 2026-02-21T09:24:35.6022482Z ok=17 2026-02-21T09:24:35.6022702Z min=0.0390 2026-02-21T09:24:35.6022923Z mid=0.0411 2026-02-21T09:24:35.6023123Z max=0.0684 2026-02-21T09:24:35.6023355Z best={'block_sizes': [256, 4, 8], 2026-02-21T09:24:35.6023732Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:24:35.6024097Z 'l2_groupings': [8], 2026-02-21T09:24:35.6024384Z 'load_eviction_policies': ['', ''], 2026-02-21T09:24:35.6024693Z 'loop_orders': [[1, 0]], 2026-02-21T09:24:35.6024979Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:24:35.6025255Z 'num_stages': 4, 2026-02-21T09:24:35.6025492Z 'num_warps': 4, 2026-02-21T09:24:35.6025773Z 'pid_type': 'flat', 2026-02-21T09:24:35.6026050Z 'range_flattens': [None, False], 2026-02-21T09:24:35.6026359Z 'range_multi_buffers': [None, True], 2026-02-21T09:24:35.6026687Z 'range_num_stages': [0, 1], 2026-02-21T09:24:35.6026976Z 'range_unroll_factors': [0, 2], 2026-02-21T09:24:35.6027280Z 'range_warp_specializes': [], 2026-02-21T09:24:35.6027564Z 'waves_per_eu': 1} 2026-02-21T09:24:35.6280588Z [387s] Fitting surrogate: 903 points, 903 targets 2026-02-21T09:24:35.7598803Z [387s] Autotuning complete in 387.7s after searching 845 configs. 2026-02-21T09:24:35.7599355Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:24:35.7601253Z @helion.kernel(config=helion.Config(block_sizes=[256, 4, 8], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:24:35.7603116Z 2026-02-21T09:24:35.7603977Z [387s] Code of selected kernel: /tmp/torchinductor_root/nu/cnukfdkpviyw2zwyyyx532dwweldqjajvionaw2sprqyzlaiy3s3.py 2026-02-21T09:24:35.7774652Z from __future__ import annotations 2026-02-21T09:24:35.7774985Z 2026-02-21T09:24:35.7775087Z import torch 2026-02-21T09:24:35.7775319Z import triton 2026-02-21T09:24:35.7775576Z import triton.language as tl 2026-02-21T09:24:35.7775983Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:24:35.7776312Z 2026-02-21T09:24:35.7776435Z _BLOCK_SIZE_2 = tl.constexpr(8) 2026-02-21T09:24:35.7776730Z _BLOCK_SIZE_1 = tl.constexpr(4) 2026-02-21T09:24:35.7777023Z _BLOCK_SIZE_0 = tl.constexpr(256) 2026-02-21T09:24:35.7777226Z 2026-02-21T09:24:35.7777312Z @triton.jit 2026-02-21T09:24:35.7777723Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr): 2026-02-21T09:24:35.7778336Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:24:35.7778757Z num_pid_m = tl.cdiv(8192, _BLOCK_SIZE_2) 2026-02-21T09:24:35.7779103Z num_pid_n = tl.cdiv(4, _BLOCK_SIZE_1) 2026-02-21T09:24:35.7779430Z inner_2d_pid = tl.program_id(0) 2026-02-21T09:24:35.7779748Z num_pid_in_group = 8 * num_pid_n 2026-02-21T09:24:35.7780086Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T09:24:35.7780440Z first_pid_m = group_id * 8 2026-02-21T09:24:35.7780777Z group_size_m = min(num_pid_m - first_pid_m, 8) 2026-02-21T09:24:35.7781215Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T09:24:35.7781482Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T09:24:35.7781674Z offset_2 = pid_0 * _BLOCK_SIZE_2 2026-02-21T09:24:35.7781878Z indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32) 2026-02-21T09:24:35.7782084Z offset_1 = pid_1 * _BLOCK_SIZE_1 2026-02-21T09:24:35.7782470Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T09:24:35.7782739Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:24:35.7783010Z acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32) 2026-02-21T09:24:35.7783304Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:24:35.7783664Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:24:35.7784018Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:24:35.7784264Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:24:35.7784590Z for offset_3 in tl.range(0, 1792, _BLOCK_SIZE_0, loop_unroll_factor=2, num_stages=1, disallow_acc_multi_buffer=False, flatten=False): 2026-02-21T09:24:35.7784982Z indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T09:24:35.7785183Z acc_copy = acc 2026-02-21T09:24:35.7785317Z acc_copy_0 = acc_copy 2026-02-21T09:24:35.7785510Z # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2 2026-02-21T09:24:35.7785713Z mul = 2 * offset_3 2026-02-21T09:24:35.7785929Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:24:35.7786186Z iota = mul + tl.arange(0, mul_1) 2026-02-21T09:24:35.7786401Z load = tl.load(A + (indices_1[:, None] * 3584 + iota[None, :] * 1), None) 2026-02-21T09:24:35.7786693Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:24:35.7786953Z # src[int4_gemm.py:66]: torch.float32 2026-02-21T09:24:35.7787155Z # src[int4_gemm.py:67]: ) # [BLOCK_SIZE_M, BLOCK_SIZE_K] 2026-02-21T09:24:35.7787354Z v_0 = tl.cast(load, tl.float32) 2026-02-21T09:24:35.7787597Z # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n] # [BLOCK_SIZE_K//2, BLOCK_SIZE_N] 2026-02-21T09:24:35.7787913Z b_tile = tl.load(B + (indices_3[:, None] * 8192 + indices_2[None, :] * 1), None) 2026-02-21T09:24:35.7788273Z # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8) # Sign-extend low 4 bits 2026-02-21T09:24:35.7788518Z v_1 = tl.full([], 4, tl.int8) 2026-02-21T09:24:35.7788669Z v_2 = b_tile << v_1 2026-02-21T09:24:35.7788802Z v_3 = tl.full([], 4, tl.int8) 2026-02-21T09:24:35.7788952Z v_4 = v_2 >> v_3 2026-02-21T09:24:35.7789165Z # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8) # Sign-extend high 4 bits 2026-02-21T09:24:35.7789405Z v_5 = tl.full([], 4, tl.int8) 2026-02-21T09:24:35.7789551Z v_6 = b_tile >> v_5 2026-02-21T09:24:35.7789736Z # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1) 2026-02-21T09:24:35.7789957Z stack_idx = tl.arange(0, 2) 2026-02-21T09:24:35.7790117Z broadcast_idx = stack_idx[None, :, None] 2026-02-21T09:24:35.7790297Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T09:24:35.7790469Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T09:24:35.7790650Z stacked_result = tl.zeros_like(expanded_0) 2026-02-21T09:24:35.7790831Z mask_0 = broadcast_idx == 0 2026-02-21T09:24:35.7791032Z stacked_result = tl.where(mask_0, expanded_0, stacked_result) 2026-02-21T09:24:35.7791259Z mask_1 = broadcast_idx == 1 2026-02-21T09:24:35.7791443Z stacked_result = tl.where(mask_1, expanded_1, stacked_result) 2026-02-21T09:24:35.7791630Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:24:35.7791828Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:24:35.7792013Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:24:35.7792184Z view = tl.reshape(stacked_result, [_SHAPE_DIM_2, _BLOCK_SIZE_2]) 2026-02-21T09:24:35.7792347Z v_7 = tl.cast(view, tl.float32) 2026-02-21T09:24:35.7792570Z # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2) # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1] 2026-02-21T09:24:35.7792755Z a_tile_1 = v_0[:, :, None] 2026-02-21T09:24:35.7792910Z # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0) 2026-02-21T09:24:35.7793070Z b_unpacked_1 = v_7[None, :, :] 2026-02-21T09:24:35.7793268Z # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1) # [BLOCK_SIZE_M, BLOCK_SIZE_N] 2026-02-21T09:24:35.7793471Z v_8 = a_tile_1 * b_unpacked_1 2026-02-21T09:24:35.7793605Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:24:35.7793743Z acc = acc_copy_0 + sum_1 2026-02-21T09:24:35.7793893Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T09:24:35.7794055Z v_10 = tl.cast(acc, tl.bfloat16) 2026-02-21T09:24:35.7794217Z tl.store(C + (indices_1[:, None] * 8192 + indices_2[None, :] * 1), v_10, None) 2026-02-21T09:24:35.7794354Z 2026-02-21T09:24:35.7794445Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher): 2026-02-21T09:24:35.7794612Z """ 2026-02-21T09:24:35.7794729Z BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T09:24:35.7794839Z 2026-02-21T09:24:35.7794905Z This kernel performs matrix multiplication where: 2026-02-21T09:24:35.7795059Z - A is a bfloat16 matrix of shape [M, K] 2026-02-21T09:24:35.7795255Z - B is an int8 matrix of shape [K//2, N] containing packed int4 values 2026-02-21T09:24:35.7795434Z (two 4-bit values packed into each int8) 2026-02-21T09:24:35.7795523Z 2026-02-21T09:24:35.7795560Z Args: 2026-02-21T09:24:35.7795688Z A (Tensor): Input tensor of shape [M, K] in bfloat16 format. 2026-02-21T09:24:35.7795877Z B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format. 2026-02-21T09:24:35.7796002Z 2026-02-21T09:24:35.7796041Z Returns: 2026-02-21T09:24:35.7796165Z Tensor: Output tensor of shape [M, N] in bfloat16 format. 2026-02-21T09:24:35.7796314Z """ 2026-02-21T09:24:35.7796411Z # src[int4_gemm.py:50]: M, K = A.shape 2026-02-21T09:24:35.7796532Z M, K = A.shape 2026-02-21T09:24:35.7796671Z # src[int4_gemm.py:51]: _, N = B.shape 2026-02-21T09:24:35.7796785Z _, N = B.shape 2026-02-21T09:24:35.7796942Z # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:24:35.7797156Z C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:24:35.7797345Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:24:35.7797499Z _BLOCK_SIZE_2 = 8 2026-02-21T09:24:35.7797600Z _BLOCK_SIZE_1 = 4 2026-02-21T09:24:35.7797772Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:24:35.7798059Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:24:35.7798332Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:24:35.7798526Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:24:35.7798641Z _BLOCK_SIZE_0 = 256 2026-02-21T09:24:35.7798775Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:24:35.7798968Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:24:35.7799142Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:24:35.7799276Z _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0 2026-02-21T09:24:35.7799427Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:24:35.7799621Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:24:35.7799801Z # src[int4_gemm.py:57-91]: ... 2026-02-21T09:24:35.7799944Z _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0) 2026-02-21T09:24:35.7800375Z _launcher(_helion_matmul_bf16_int4, (triton.cdiv(8192, _BLOCK_SIZE_2) * triton.cdiv(4, _BLOCK_SIZE_1),), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, num_warps=4, num_stages=4, waves_per_eu=1, matrix_instr_nonkdim=0) 2026-02-21T09:24:35.7800742Z # src[int4_gemm.py:93]: return C 2026-02-21T09:24:35.7800865Z return C 2026-02-21T09:24:36.6153979Z WARNING:tritonbench.utils.triton_op:Completed input ID 7: 2026-02-21T09:24:36.6154429Z x_val 2026-02-21T09:24:36.6154639Z ------------------ 2026-02-21T09:24:36.6154869Z (4, 1, 8192, 3584) 2026-02-21T09:24:36.6155002Z 2026-02-21T09:24:36.6186032Z 30%|███ | 3/10 [21:57<51:01, 437.34s/it] WARNING:tritonbench.utils.triton_op:Running input ID 10: 2026-02-21T09:24:36.6186470Z x_val 2026-02-21T09:24:36.6186650Z ------------------- 2026-02-21T09:24:36.6186847Z (16, 1, 7168, 8192) 2026-02-21T09:24:36.6189760Z INFO:tritonbench.utils.triton_op:Took 0.20ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T09:24:37.6151837Z INFO:tritonbench.utils.triton_op:Took 5.01ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T09:24:39.0638347Z INFO:tritonbench.utils.triton_op:Took 0.22ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T09:24:39.0702758Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:24:39.0703124Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:24:39.0703424Z 'dtype': 'torch.bfloat16', 2026-02-21T09:24:39.0703712Z 'shape': (16, 1, 8192), 2026-02-21T09:24:39.0703995Z 'stride': (8192, 8192, 1)}, 2026-02-21T09:24:39.0704269Z { 'device': 'cuda:0', 2026-02-21T09:24:39.0704531Z 'dtype': 'torch.int32', 2026-02-21T09:24:39.0704800Z 'shape': (8192, 7168), 2026-02-21T09:24:39.0705054Z 'stride': (7168, 1)}), 2026-02-21T09:24:39.0705301Z 'kwargs': {}} 2026-02-21T09:24:39.0743181Z INFO:tritonbench.utils.triton_op:Took 4.25ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T09:24:39.2682857Z [0s] Autotune random seed: 2138032649 2026-02-21T09:24:39.2942162Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:25:22.1995909Z [42s] Timeout after 30s compiling Config(block_sizes=[8, 2, 4096], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[2, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:25:22.6261084Z [43s] Timeout after 30s compiling Config(block_sizes=[128, 1, 512], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[0, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:25:22.6284519Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s 2026-02-21T09:25:29.9183542Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.8 configs/s 2026-02-21T09:25:29.9193949Z [50s] Adaptive compile timeout: 30s (90% percentile=12.0s, bounds=[30.0s, 30s]) 2026-02-21T09:25:29.9195523Z [50s] Initial random population of 100, 5 starting points: 2026-02-21T09:25:29.9195663Z error=2 2026-02-21T09:25:29.9195754Z timeout=2 2026-02-21T09:25:29.9195829Z ok=96 2026-02-21T09:25:29.9195920Z min=0.1809 2026-02-21T09:25:29.9195995Z mid=1.2689 2026-02-21T09:25:29.9196072Z max=36.9792 2026-02-21T09:25:29.9196159Z best={'block_sizes': [16, 16, 16], 2026-02-21T09:25:29.9196287Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:25:29.9196412Z 'l2_groupings': [1], 2026-02-21T09:25:29.9196515Z 'load_eviction_policies': ['', ''], 2026-02-21T09:25:29.9196627Z 'loop_orders': [[0, 1]], 2026-02-21T09:25:29.9196728Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:25:29.9197238Z 'num_stages': 1, 2026-02-21T09:25:29.9197323Z 'num_warps': 4, 2026-02-21T09:25:29.9197411Z 'pid_type': 'flat', 2026-02-21T09:25:29.9197515Z 'range_flattens': [None, None], 2026-02-21T09:25:29.9197628Z 'range_multi_buffers': [None, None], 2026-02-21T09:25:29.9197742Z 'range_num_stages': [0, 0], 2026-02-21T09:25:29.9197846Z 'range_unroll_factors': [0, 0], 2026-02-21T09:25:29.9197953Z 'range_warp_specializes': [], 2026-02-21T09:25:29.9198059Z 'waves_per_eu': 1} 2026-02-21T09:25:29.9212052Z [50s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:25:31.6211877Z [52s] Generation 1 starting: 97 neighbors, 5 active search path(s) 2026-02-21T09:26:13.3841914Z [94s] Timeout after 30s compiling Config(block_sizes=[256, 2, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:26:13.3862589Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 0.4 configs/s 2026-02-21T09:26:14.2341451Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:26:14.2345968Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:26:14.2346874Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:26:14.2347671Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:26:14.2348467Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:26:14.2349121Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:26:14.2350397Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:26:14.2350804Z #smem = #ttg.shared_memory 2026-02-21T09:26:14.2351265Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:26:14.2352184Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:26:14.2352950Z %cst = arith.constant dense<0.000000e+00> : tensor<16x16xf32, #mma> 2026-02-21T09:26:14.2353269Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:26:14.2353495Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:26:14.2353722Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:26:14.2354011Z %cst_0 = arith.constant dense<0> : tensor<4x2x16xi8, #blocked> 2026-02-21T09:26:14.2354301Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:26:14.2354531Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:26:14.2354902Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:26:14.2355113Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:26:14.2355383Z %cst_1 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:26:14.2355810Z %cst_2 = arith.constant dense<7168> : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2356314Z %cst_3 = arith.constant dense<4> : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2356726Z %cst_4 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:26:14.2357055Z %cst_5 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:26:14.2357377Z %cst_6 = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:26:14.2357865Z %0 = tt.get_program_id x : i32 2026-02-21T09:26:14.2358083Z %1 = arith.muli %0, %c2_i32 : i32 2026-02-21T09:26:14.2358306Z %2 = arith.addi %1, %c2_i32 : i32 2026-02-21T09:26:14.2358526Z %3 = arith.minsi %2, %c448_i32 : i32 2026-02-21T09:26:14.2358984Z %4 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:26:14.2359583Z %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:26:14.2360165Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:26:14.2360756Z %7 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:26:14.2361133Z %8 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:26:14.2361575Z %9 = tt.expand_dims %8 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:26:14.2361931Z %10 = arith.muli %9, %cst_1 : tensor<16x1xi32, #blocked1> 2026-02-21T09:26:14.2362197Z %11 = tt.broadcast %10 : tensor<16x1xi32, #blocked1> -> tensor<16x8xi32, #blocked1> 2026-02-21T09:26:14.2362503Z %12 = tt.splat %arg0 : !tt.ptr -> tensor<16x8x!tt.ptr, #blocked1> 2026-02-21T09:26:14.2362959Z %13 = tt.splat %arg1 : !tt.ptr -> tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2363381Z %14 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:26:14.2364077Z %15 = tt.expand_dims %14 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:26:14.2364638Z %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:26:14.2365056Z %17 = arith.cmpi eq, %16, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:26:14.2365328Z %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x16xi1, #blocked> 2026-02-21T09:26:14.2365605Z %19 = arith.cmpi eq, %16, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:26:14.2365873Z %20 = tt.broadcast %19 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x16xi1, #blocked> 2026-02-21T09:26:14.2366202Z %21 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:26:14.2366613Z %22 = tt.expand_dims %21 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:26:14.2366949Z %23 = arith.muli %22, %cst_6 : tensor<16x1xi32, #mma> 2026-02-21T09:26:14.2367196Z %24 = tt.broadcast %23 : tensor<16x1xi32, #mma> -> tensor<16x16xi32, #mma> 2026-02-21T09:26:14.2367508Z %25 = tt.splat %arg2 : !tt.ptr -> tensor<16x16x!tt.ptr, #mma> 2026-02-21T09:26:14.2368093Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T09:26:14.2368287Z %26 = arith.muli %arg3, %c16_i32 : i32 2026-02-21T09:26:14.2368584Z %27 = tt.splat %26 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:26:14.2368922Z %28 = tt.splat %26 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:26:14.2369271Z %29 = arith.addi %27, %4 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:26:14.2369627Z %30 = arith.addi %28, %5 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:26:14.2370173Z %31 = tt.expand_dims %29 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2370745Z %32 = tt.broadcast %31 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2371124Z %33 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg5 = %cst) -> (tensor<16x16xf32, #mma>) : i32 { 2026-02-21T09:26:14.2371453Z %39 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:26:14.2371782Z %40 = arith.addi %39, %6 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:26:14.2372016Z %41 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:26:14.2372206Z %42 = tt.splat %41 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:26:14.2372526Z %43 = arith.addi %42, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:26:14.2372830Z %44 = tt.expand_dims %43 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:26:14.2373141Z %45 = tt.broadcast %44 : tensor<1x8xi32, #blocked1> -> tensor<16x8xi32, #blocked1> 2026-02-21T09:26:14.2373354Z %46 = arith.addi %11, %45 : tensor<16x8xi32, #blocked1> 2026-02-21T09:26:14.2373577Z %47 = tt.addptr %12, %46 : tensor<16x8x!tt.ptr, #blocked1>, tensor<16x8xi32, #blocked1> 2026-02-21T09:26:14.2373797Z %48 = tt.load %47 : tensor<16x8x!tt.ptr, #blocked1> 2026-02-21T09:26:14.2374042Z %49 = ttg.local_alloc %48 : (tensor<16x8xbf16, #blocked1>) -> !ttg.memdesc<16x8xbf16, #shared, #smem> 2026-02-21T09:26:14.2374405Z %50 = ttg.local_load %49 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:26:14.2374858Z %51 = arith.extf %50 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:26:14.2375392Z %52 = tt.expand_dims %40 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2375786Z %53 = arith.muli %52, %cst_2 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2376123Z %54 = tt.broadcast %53 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2376452Z %55 = arith.addi %54, %32 : tensor<4x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2376784Z %56 = tt.addptr %13, %55 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2377124Z %57 = tt.load %56 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2377378Z %58 = arith.shli %57, %cst_3 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2377634Z %59 = arith.shrsi %58, %cst_3 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2377893Z %60 = arith.shrsi %57, %cst_3 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:26:14.2378210Z %61 = tt.expand_dims %59 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:26:14.2378574Z %62 = tt.expand_dims %60 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:26:14.2378879Z %63 = tt.broadcast %61 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:26:14.2379133Z %64 = arith.select %18, %63, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:26:14.2379426Z %65 = tt.broadcast %62 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:26:14.2379679Z %66 = arith.select %20, %65, %64 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:26:14.2379922Z %67 = tt.reshape %66 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2> 2026-02-21T09:26:14.2380165Z %68 = arith.sitofp %67 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2> 2026-02-21T09:26:14.2380433Z %69 = ttg.local_alloc %68 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem> 2026-02-21T09:26:14.2380766Z %70 = ttg.local_load %69 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:26:14.2381239Z %71 = tt.dot %51, %70, %arg5, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x16xf32, #mma> 2026-02-21T09:26:14.2381588Z scf.yield %71 : tensor<16x16xf32, #mma> 2026-02-21T09:26:14.2381714Z } 2026-02-21T09:26:14.2381845Z %34 = arith.truncf %33 : tensor<16x16xf32, #mma> to tensor<16x16xbf16, #mma> 2026-02-21T09:26:14.2382109Z %35 = tt.expand_dims %30 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma> 2026-02-21T09:26:14.2382369Z %36 = tt.broadcast %35 : tensor<1x16xi32, #mma> -> tensor<16x16xi32, #mma> 2026-02-21T09:26:14.2382547Z %37 = arith.addi %24, %36 : tensor<16x16xi32, #mma> 2026-02-21T09:26:14.2382736Z %38 = tt.addptr %25, %37 : tensor<16x16x!tt.ptr, #mma>, tensor<16x16xi32, #mma> 2026-02-21T09:26:14.2382929Z tt.store %38, %34 : tensor<16x16x!tt.ptr, #mma> 2026-02-21T09:26:14.2383062Z } 2026-02-21T09:26:14.2383144Z tt.return 2026-02-21T09:26:14.2383232Z } 2026-02-21T09:26:14.2383313Z } 2026-02-21T09:26:14.2383362Z 2026-02-21T09:26:14.2383401Z {-# 2026-02-21T09:26:14.2383489Z external_resources: { 2026-02-21T09:26:14.2383591Z mlir_reproducer: { 2026-02-21T09:26:14.2384648Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:26:14.2385833Z disable_threading: false, 2026-02-21T09:26:14.2385943Z verify_each: true 2026-02-21T09:26:14.2386040Z } 2026-02-21T09:26:14.2386119Z } 2026-02-21T09:26:14.2386193Z #-} 2026-02-21T09:26:14.2386486Z /tmp/torchinductor_root/sr/csrcdavv3qtyuok5hacwwd33omqwh7txpyktxxydqp7domrkedaf.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:26:14.2387178Z /tmp/torchinductor_root/sr/csrcdavv3qtyuok5hacwwd33omqwh7txpyktxxydqp7domrkedaf.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:26:14.2387733Z [94s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:26:14.2388543Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T09:26:14.2389243Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:26:14.2389424Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:26:19.4430413Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 16.6 configs/s 2026-02-21T09:26:20.6903287Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 609.0 2026-02-21T09:26:20.6903887Z configs/s 2026-02-21T09:26:21.2661187Z [101s] Generation 1 complete: 2026-02-21T09:26:21.2661401Z error=1 2026-02-21T09:26:21.2661556Z timeout=1 2026-02-21T09:26:21.2661692Z ok=101 2026-02-21T09:26:21.2661846Z min=0.1282 2026-02-21T09:26:21.2661973Z mid=0.3725 2026-02-21T09:26:21.2662092Z max=4.6295 2026-02-21T09:26:21.2662229Z best={'block_sizes': [32, 16, 16], 2026-02-21T09:26:21.2662457Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:26:21.2662694Z 'l2_groupings': [1], 2026-02-21T09:26:21.2662856Z 'load_eviction_policies': ['', ''], 2026-02-21T09:26:21.2663023Z 'loop_orders': [[0, 1]], 2026-02-21T09:26:21.2663203Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:26:21.2663363Z 'num_stages': 1, 2026-02-21T09:26:21.2663502Z 'num_warps': 4, 2026-02-21T09:26:21.2663632Z 'pid_type': 'flat', 2026-02-21T09:26:21.2663787Z 'range_flattens': [None, True], 2026-02-21T09:26:21.2663965Z 'range_multi_buffers': [None, None], 2026-02-21T09:26:21.2664138Z 'range_num_stages': [0, 0], 2026-02-21T09:26:21.2664308Z 'range_unroll_factors': [0, 0], 2026-02-21T09:26:21.2664486Z 'range_warp_specializes': [], 2026-02-21T09:26:21.2664652Z 'waves_per_eu': 3} 2026-02-21T09:26:21.2748952Z [101s] Fitting surrogate: 203 points, 203 targets 2026-02-21T09:26:22.2445187Z [102s] Generation 2 starting: 99 neighbors, 5 active search path(s) 2026-02-21T09:27:00.5068496Z [141s] Timeout after 30s compiling Config(block_sizes=[256, 1, 64], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[4, 1], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:27:01.4158841Z [142s] Timeout after 30s compiling Config(block_sizes=[256, 1, 64], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[4, 1], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:27:01.8154235Z [142s] Timeout after 30s compiling Config(block_sizes=[256, 1, 64], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[4, 2], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:27:05.7736181Z [146s] Timeout after 30s compiling Config(block_sizes=[256, 2, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[4, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:27:05.7755194Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 0.8 configs/s 2026-02-21T09:27:11.9985837Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 102/102 16.5 configs/s 2026-02-21T09:27:13.4022692Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 595.6 2026-02-21T09:27:13.4023093Z configs/s 2026-02-21T09:27:13.8739128Z [154s] Generation 2 complete: 2026-02-21T09:27:13.8739291Z timeout=4 2026-02-21T09:27:13.8739368Z ok=101 2026-02-21T09:27:13.8739449Z min=0.0941 2026-02-21T09:27:13.8739525Z mid=0.3238 2026-02-21T09:27:13.8739603Z max=5.1746 2026-02-21T09:27:13.8739689Z best={'block_sizes': [32, 16, 16], 2026-02-21T09:27:13.8739829Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:27:13.8739961Z 'l2_groupings': [1], 2026-02-21T09:27:13.8740068Z 'load_eviction_policies': ['', ''], 2026-02-21T09:27:13.8740184Z 'loop_orders': [[0, 1]], 2026-02-21T09:27:13.8740286Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:27:13.8740387Z 'num_stages': 2, 2026-02-21T09:27:13.8740472Z 'num_warps': 2, 2026-02-21T09:27:13.8740560Z 'pid_type': 'flat', 2026-02-21T09:27:13.8740657Z 'range_flattens': [None, True], 2026-02-21T09:27:13.8740806Z 'range_multi_buffers': [None, False], 2026-02-21T09:27:13.8740924Z 'range_num_stages': [0, 0], 2026-02-21T09:27:13.8741043Z 'range_unroll_factors': [0, 0], 2026-02-21T09:27:13.8741150Z 'range_warp_specializes': [], 2026-02-21T09:27:13.8741254Z 'waves_per_eu': 3} 2026-02-21T09:27:13.8855383Z [154s] Fitting surrogate: 308 points, 308 targets 2026-02-21T09:27:14.9986786Z [155s] Generation 3 starting: 105 neighbors, 5 active search path(s) 2026-02-21T09:27:57.7780549Z [198s] Timeout after 30s compiling Config(block_sizes=[512, 2, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[0, 2], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T09:27:58.7149374Z [199s] Timeout after 30s compiling Config(block_sizes=[128, 2, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:27:59.5394419Z [200s] Timeout after 30s compiling Config(block_sizes=[256, 2, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:27:59.5421049Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 108/108 0.7 configs/s 2026-02-21T09:28:05.9706888Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 108/108 16.9 configs/s 2026-02-21T09:28:09.4005520Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 268.2 2026-02-21T09:28:09.4005774Z configs/s 2026-02-21T09:28:09.8516565Z [210s] Generation 3 complete: 2026-02-21T09:28:09.8516865Z error=1 2026-02-21T09:28:09.8517031Z timeout=3 2026-02-21T09:28:09.8517201Z ok=106 2026-02-21T09:28:09.8517354Z min=0.0908 2026-02-21T09:28:09.8517513Z mid=0.3135 2026-02-21T09:28:09.8517665Z max=6.8092 2026-02-21T09:28:09.8517843Z best={'block_sizes': [64, 16, 16], 2026-02-21T09:28:09.8518178Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:28:09.8518479Z 'l2_groupings': [1], 2026-02-21T09:28:09.8518697Z 'load_eviction_policies': ['', ''], 2026-02-21T09:28:09.8518944Z 'loop_orders': [[0, 1]], 2026-02-21T09:28:09.8519167Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:28:09.8519378Z 'num_stages': 2, 2026-02-21T09:28:09.8520167Z 'num_warps': 4, 2026-02-21T09:28:09.8520352Z 'pid_type': 'flat', 2026-02-21T09:28:09.8520583Z 'range_flattens': [None, True], 2026-02-21T09:28:09.8520826Z 'range_multi_buffers': [None, False], 2026-02-21T09:28:09.8521074Z 'range_num_stages': [0, 0], 2026-02-21T09:28:09.8521297Z 'range_unroll_factors': [0, 0], 2026-02-21T09:28:09.8521530Z 'range_warp_specializes': [], 2026-02-21T09:28:09.8521751Z 'waves_per_eu': 1} 2026-02-21T09:28:09.8900377Z [210s] Fitting surrogate: 418 points, 418 targets 2026-02-21T09:28:11.5540623Z [212s] Generation 4 starting: 90 neighbors, 5 active search path(s) 2026-02-21T09:28:30.9855630Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93/93 0.9 configs/s 2026-02-21T09:28:36.3444242Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 93/93 17.8 configs/s 2026-02-21T09:28:41.3357802Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 189.6 2026-02-21T09:28:41.3358339Z configs/s 2026-02-21T09:28:41.8844030Z [242s] Generation 4 complete: 2026-02-21T09:28:41.8844334Z error=7 2026-02-21T09:28:41.8844497Z ok=89 2026-02-21T09:28:41.8844653Z min=0.0894 2026-02-21T09:28:41.8844807Z mid=0.2292 2026-02-21T09:28:41.8844976Z max=6.3389 2026-02-21T09:28:41.8845143Z best={'block_sizes': [128, 16, 16], 2026-02-21T09:28:41.8845424Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:28:41.8845693Z 'l2_groupings': [4], 2026-02-21T09:28:41.8845897Z 'load_eviction_policies': ['', ''], 2026-02-21T09:28:41.8846134Z 'loop_orders': [[0, 1]], 2026-02-21T09:28:41.8846340Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:28:41.8846544Z 'num_stages': 4, 2026-02-21T09:28:41.8846713Z 'num_warps': 2, 2026-02-21T09:28:41.8846889Z 'pid_type': 'flat', 2026-02-21T09:28:41.8847082Z 'range_flattens': [None, False], 2026-02-21T09:28:41.8847312Z 'range_multi_buffers': [None, None], 2026-02-21T09:28:41.8847539Z 'range_num_stages': [0, 1], 2026-02-21T09:28:41.8847755Z 'range_unroll_factors': [0, 1], 2026-02-21T09:28:41.8847996Z 'range_warp_specializes': [], 2026-02-21T09:28:41.8848574Z 'waves_per_eu': 2} 2026-02-21T09:28:41.9480663Z [242s] Fitting surrogate: 514 points, 514 targets 2026-02-21T09:28:42.9043826Z [243s] Generation 5 starting: 92 neighbors, 5 active search path(s) 2026-02-21T09:29:25.4196967Z [286s] Timeout after 30s compiling Config(block_sizes=[256, 2, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T09:29:25.4214669Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 0.3 configs/s 2026-02-21T09:29:31.1491564Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 94/94 16.5 configs/s 2026-02-21T09:29:37.2628781Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 156.9 2026-02-21T09:29:37.2629379Z configs/s 2026-02-21T09:29:37.8447399Z [298s] Generation 5 complete: 2026-02-21T09:29:37.8447661Z error=2 2026-02-21T09:29:37.8447801Z timeout=1 2026-02-21T09:29:37.8447932Z ok=94 2026-02-21T09:29:37.8448066Z min=0.0881 2026-02-21T09:29:37.8448197Z mid=0.1505 2026-02-21T09:29:37.8448333Z max=6.4633 2026-02-21T09:29:37.8448476Z best={'block_sizes': [128, 16, 16], 2026-02-21T09:29:37.8448719Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:29:37.8448946Z 'l2_groupings': [16], 2026-02-21T09:29:37.8449132Z 'load_eviction_policies': ['', ''], 2026-02-21T09:29:37.8449334Z 'loop_orders': [[1, 0]], 2026-02-21T09:29:37.8449520Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:29:37.8449704Z 'num_sm_multiplier': 64, 2026-02-21T09:29:37.8449869Z 'num_stages': 2, 2026-02-21T09:29:37.8450459Z 'num_warps': 2, 2026-02-21T09:29:37.8450634Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:29:37.8450845Z 'range_flattens': [False, False], 2026-02-21T09:29:37.8451050Z 'range_multi_buffers': [False, True], 2026-02-21T09:29:37.8451249Z 'range_num_stages': [4, 1], 2026-02-21T09:29:37.8451430Z 'range_unroll_factors': [3, 0], 2026-02-21T09:29:37.8451623Z 'range_warp_specializes': [], 2026-02-21T09:29:37.8451799Z 'waves_per_eu': 2} 2026-02-21T09:29:37.9464489Z [298s] Fitting surrogate: 611 points, 611 targets 2026-02-21T09:29:38.9092982Z [299s] Generation 6 starting: 88 neighbors, 5 active search path(s) 2026-02-21T09:30:22.5002950Z [343s] Timeout after 30s compiling Config(block_sizes=[128, 2, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:30:23.3320766Z [344s] Timeout after 30s compiling Config(block_sizes=[128, 2, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:30:23.3338406Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 0.2 configs/s 2026-02-21T09:30:28.6129440Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 17.2 configs/s 2026-02-21T09:30:35.4679027Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 155.7 2026-02-21T09:30:35.4679658Z configs/s 2026-02-21T09:30:35.9966714Z [356s] Generation 6 complete: 2026-02-21T09:30:35.9967064Z error=1 2026-02-21T09:30:35.9967278Z timeout=2 2026-02-21T09:30:35.9967509Z ok=90 2026-02-21T09:30:35.9967711Z min=0.0874 2026-02-21T09:30:35.9967919Z mid=0.1325 2026-02-21T09:30:35.9968114Z max=2.7252 2026-02-21T09:30:35.9968350Z best={'block_sizes': [128, 16, 16], 2026-02-21T09:30:35.9968727Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:30:35.9969093Z 'l2_groupings': [16], 2026-02-21T09:30:35.9969385Z 'load_eviction_policies': ['', ''], 2026-02-21T09:30:35.9969697Z 'loop_orders': [[1, 0]], 2026-02-21T09:30:35.9969982Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:30:35.9970264Z 'num_sm_multiplier': 64, 2026-02-21T09:30:35.9970532Z 'num_stages': 3, 2026-02-21T09:30:35.9970770Z 'num_warps': 2, 2026-02-21T09:30:35.9971040Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:30:35.9971367Z 'range_flattens': [False, False], 2026-02-21T09:30:35.9971687Z 'range_multi_buffers': [False, True], 2026-02-21T09:30:35.9972424Z 'range_num_stages': [4, 1], 2026-02-21T09:30:35.9972718Z 'range_unroll_factors': [3, 0], 2026-02-21T09:30:35.9973024Z 'range_warp_specializes': [], 2026-02-21T09:30:35.9973289Z 'waves_per_eu': 2} 2026-02-21T09:30:36.0802675Z [356s] Fitting surrogate: 704 points, 704 targets 2026-02-21T09:30:36.8882849Z [357s] Generation 7 starting: 72 neighbors, 4 active search path(s) 2026-02-21T09:31:16.3117502Z [397s] Timeout after 30s compiling Config(block_sizes=[128, 2, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:31:16.3141321Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 0.2 configs/s 2026-02-21T09:31:20.7460155Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.9 configs/s 2026-02-21T09:31:24.1190985Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 274.4 2026-02-21T09:31:24.1191667Z configs/s 2026-02-21T09:31:24.5464352Z [405s] Generation 7 complete: 2026-02-21T09:31:24.5464608Z timeout=1 2026-02-21T09:31:24.5464759Z ok=75 2026-02-21T09:31:24.5464917Z min=0.0846 2026-02-21T09:31:24.5465068Z mid=0.1910 2026-02-21T09:31:24.5465225Z max=5.0350 2026-02-21T09:31:24.5465413Z best={'block_sizes': [256, 16, 16], 2026-02-21T09:31:24.5465688Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:31:24.5465953Z 'l2_groupings': [4], 2026-02-21T09:31:24.5466164Z 'load_eviction_policies': ['', ''], 2026-02-21T09:31:24.5466398Z 'loop_orders': [[0, 1]], 2026-02-21T09:31:24.5466607Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:31:24.5466810Z 'num_stages': 4, 2026-02-21T09:31:24.5467028Z 'num_warps': 2, 2026-02-21T09:31:24.5467209Z 'pid_type': 'flat', 2026-02-21T09:31:24.5467434Z 'range_flattens': [None, False], 2026-02-21T09:31:24.5467674Z 'range_multi_buffers': [None, True], 2026-02-21T09:31:24.5468216Z 'range_num_stages': [0, 1], 2026-02-21T09:31:24.5468428Z 'range_unroll_factors': [0, 1], 2026-02-21T09:31:24.5468645Z 'range_warp_specializes': [], 2026-02-21T09:31:24.5468855Z 'waves_per_eu': 1} 2026-02-21T09:31:24.5893720Z [405s] Fitting surrogate: 780 points, 780 targets 2026-02-21T09:31:25.3269593Z [406s] Generation 8 starting: 68 neighbors, 4 active search path(s) 2026-02-21T09:31:37.2207081Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 12.8 configs/s 2026-02-21T09:31:41.5274647Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 16.6 configs/s 2026-02-21T09:31:44.9429834Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 274.4 2026-02-21T09:31:44.9432404Z configs/s 2026-02-21T09:31:45.4239044Z [426s] Generation 8 complete: 2026-02-21T09:31:45.4239450Z ok=72 2026-02-21T09:31:45.4239700Z min=0.0777 2026-02-21T09:31:45.4239919Z mid=0.1677 2026-02-21T09:31:45.4240149Z max=3.8257 2026-02-21T09:31:45.4240379Z best={'block_sizes': [256, 16, 16], 2026-02-21T09:31:45.4240772Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T09:31:45.4241144Z 'l2_groupings': [4], 2026-02-21T09:31:45.4241427Z 'load_eviction_policies': ['', ''], 2026-02-21T09:31:45.4241751Z 'loop_orders': [[0, 1]], 2026-02-21T09:31:45.4242038Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:31:45.4242327Z 'num_sm_multiplier': 64, 2026-02-21T09:31:45.4242668Z 'num_stages': 4, 2026-02-21T09:31:45.4242910Z 'num_warps': 2, 2026-02-21T09:31:45.4243880Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:31:45.4244325Z 'range_flattens': [None, False], 2026-02-21T09:31:45.4244635Z 'range_multi_buffers': [False, True], 2026-02-21T09:31:45.4244964Z 'range_num_stages': [2, 1], 2026-02-21T09:31:45.4245256Z 'range_unroll_factors': [3, 1], 2026-02-21T09:31:45.4245971Z 'range_warp_specializes': [], 2026-02-21T09:31:45.4246209Z 'waves_per_eu': 1} 2026-02-21T09:31:45.4667001Z [426s] Fitting surrogate: 852 points, 852 targets 2026-02-21T09:31:46.1222395Z [426s] Generation 9 starting: 57 neighbors, 3 active search path(s) 2026-02-21T09:32:02.1595389Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 0.5 configs/s 2026-02-21T09:32:05.8325064Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 16.7 configs/s 2026-02-21T09:32:09.3163146Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 268.5 2026-02-21T09:32:09.3163750Z configs/s 2026-02-21T09:32:09.8358648Z [450s] Generation 9 complete: 2026-02-21T09:32:09.8358997Z ok=60 2026-02-21T09:32:09.8359198Z min=0.0778 2026-02-21T09:32:09.8359400Z mid=0.1385 2026-02-21T09:32:09.8359588Z max=1.6785 2026-02-21T09:32:09.8359796Z best={'block_sizes': [256, 16, 16], 2026-02-21T09:32:09.8360171Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T09:32:09.8360511Z 'l2_groupings': [4], 2026-02-21T09:32:09.8360754Z 'load_eviction_policies': ['', ''], 2026-02-21T09:32:09.8361061Z 'loop_orders': [[0, 1]], 2026-02-21T09:32:09.8361336Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:32:09.8361596Z 'num_sm_multiplier': 64, 2026-02-21T09:32:09.8361822Z 'num_stages': 4, 2026-02-21T09:32:09.8362035Z 'num_warps': 2, 2026-02-21T09:32:09.8362277Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:32:09.8362672Z 'range_flattens': [None, False], 2026-02-21T09:32:09.8362965Z 'range_multi_buffers': [False, True], 2026-02-21T09:32:09.8363234Z 'range_num_stages': [2, 1], 2026-02-21T09:32:09.8363479Z 'range_unroll_factors': [3, 1], 2026-02-21T09:32:09.8363739Z 'range_warp_specializes': [], 2026-02-21T09:32:09.8363979Z 'waves_per_eu': 1} 2026-02-21T09:32:09.8805534Z [450s] Fitting surrogate: 912 points, 912 targets 2026-02-21T09:32:10.4475242Z [451s] Generation 10 starting: 50 neighbors, 3 active search path(s) 2026-02-21T09:32:20.2747787Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 1.7 configs/s 2026-02-21T09:32:23.4046138Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 51/51 17.1 configs/s 2026-02-21T09:32:27.2943633Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 242.9 2026-02-21T09:32:27.2944236Z configs/s 2026-02-21T09:32:27.8003050Z [468s] Generation 10 complete: 2026-02-21T09:32:27.8003386Z error=1 2026-02-21T09:32:27.8003598Z ok=52 2026-02-21T09:32:27.8003810Z min=0.0759 2026-02-21T09:32:27.8004020Z mid=0.0938 2026-02-21T09:32:27.8004223Z max=0.7182 2026-02-21T09:32:27.8004450Z best={'block_sizes': [256, 16, 16], 2026-02-21T09:32:27.8004832Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:32:27.8005200Z 'l2_groupings': [4], 2026-02-21T09:32:27.8005480Z 'load_eviction_policies': ['', ''], 2026-02-21T09:32:27.8005792Z 'loop_orders': [[0, 1]], 2026-02-21T09:32:27.8006119Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:32:27.8006404Z 'num_sm_multiplier': 32, 2026-02-21T09:32:27.8006671Z 'num_stages': 4, 2026-02-21T09:32:27.8006933Z 'num_warps': 2, 2026-02-21T09:32:27.8007197Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:32:27.8007528Z 'range_flattens': [None, False], 2026-02-21T09:32:27.8007838Z 'range_multi_buffers': [False, True], 2026-02-21T09:32:27.8008155Z 'range_num_stages': [2, 1], 2026-02-21T09:32:27.8008441Z 'range_unroll_factors': [3, 1], 2026-02-21T09:32:27.8008746Z 'range_warp_specializes': [], 2026-02-21T09:32:27.8009027Z 'waves_per_eu': 1} 2026-02-21T09:32:27.8610752Z [468s] Fitting surrogate: 965 points, 965 targets 2026-02-21T09:32:28.4748967Z [469s] Generation 11 starting: 55 neighbors, 3 active search path(s) 2026-02-21T09:32:41.8937392Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55/55 1.5 configs/s 2026-02-21T09:32:45.2110714Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 55/55 17.3 configs/s 2026-02-21T09:32:49.8527690Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 205.7 2026-02-21T09:32:49.8528291Z configs/s 2026-02-21T09:32:50.3904708Z [491s] Generation 11 complete: 2026-02-21T09:32:50.3905032Z error=2 2026-02-21T09:32:50.3905228Z ok=56 2026-02-21T09:32:50.3905425Z min=0.0755 2026-02-21T09:32:50.3905607Z mid=0.0915 2026-02-21T09:32:50.3905782Z max=0.6898 2026-02-21T09:32:50.3905980Z best={'block_sizes': [256, 16, 16], 2026-02-21T09:32:50.3909251Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:32:50.3909638Z 'l2_groupings': [4], 2026-02-21T09:32:50.3909857Z 'load_eviction_policies': ['', ''], 2026-02-21T09:32:50.3910088Z 'loop_orders': [[0, 1]], 2026-02-21T09:32:50.3910292Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:32:50.3910518Z 'num_sm_multiplier': 32, 2026-02-21T09:32:50.3910708Z 'num_stages': 4, 2026-02-21T09:32:50.3910884Z 'num_warps': 2, 2026-02-21T09:32:50.3911146Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:32:50.3911394Z 'range_flattens': [None, False], 2026-02-21T09:32:50.3911614Z 'range_multi_buffers': [False, True], 2026-02-21T09:32:50.3911889Z 'range_num_stages': [1, 1], 2026-02-21T09:32:50.3912104Z 'range_unroll_factors': [3, 1], 2026-02-21T09:32:50.3912320Z 'range_warp_specializes': [], 2026-02-21T09:32:50.3912526Z 'waves_per_eu': 1} 2026-02-21T09:32:50.4675040Z [491s] Fitting surrogate: 1023 points, 1023 targets 2026-02-21T09:32:51.0720248Z [491s] Generation 12 starting: 54 neighbors, 3 active search path(s) 2026-02-21T09:33:24.9389440Z [525s] Timeout after 30s compiling Config(block_sizes=[256, 16, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[1, 1], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:33:28.3527471Z [529s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:33:28.5480732Z [529s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:33:28.8149004Z [529s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:33:29.0894460Z [529s] Timeout after 30s compiling Config(block_sizes=[256, 16, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:33:29.6155709Z [530s] Timeout after 30s compiling Config(block_sizes=[256, 16, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:33:29.7676109Z [530s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:33:29.7699059Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 0.4 configs/s 2026-02-21T09:33:32.4143331Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 54/54 20.8 configs/s 2026-02-21T09:33:37.0233871Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 207.0 2026-02-21T09:33:37.0236174Z configs/s 2026-02-21T09:33:37.5535534Z [538s] Generation 12 complete: 2026-02-21T09:33:37.5535911Z error=3 2026-02-21T09:33:37.5536141Z timeout=7 2026-02-21T09:33:37.5536346Z ok=47 2026-02-21T09:33:37.5536553Z min=0.0758 2026-02-21T09:33:37.5536756Z mid=0.0864 2026-02-21T09:33:37.5536956Z max=0.2752 2026-02-21T09:33:37.5537173Z best={'block_sizes': [256, 16, 16], 2026-02-21T09:33:37.5537557Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:33:37.5537926Z 'l2_groupings': [4], 2026-02-21T09:33:37.5538201Z 'load_eviction_policies': ['', ''], 2026-02-21T09:33:37.5538516Z 'loop_orders': [[0, 1]], 2026-02-21T09:33:37.5538802Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:33:37.5539088Z 'num_sm_multiplier': 32, 2026-02-21T09:33:37.5539352Z 'num_stages': 4, 2026-02-21T09:33:37.5539583Z 'num_warps': 2, 2026-02-21T09:33:37.5539845Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:33:37.5540205Z 'range_flattens': [None, False], 2026-02-21T09:33:37.5540510Z 'range_multi_buffers': [False, True], 2026-02-21T09:33:37.5540824Z 'range_num_stages': [1, 1], 2026-02-21T09:33:37.5541507Z 'range_unroll_factors': [3, 1], 2026-02-21T09:33:37.5541813Z 'range_warp_specializes': [], 2026-02-21T09:33:37.5542087Z 'waves_per_eu': 1} 2026-02-21T09:33:37.6279722Z [538s] Fitting surrogate: 1080 points, 1080 targets 2026-02-21T09:33:37.9924709Z [538s] Generation 13 starting: 23 neighbors, 2 active search path(s) 2026-02-21T09:34:11.1392121Z [571s] Timeout after 30s compiling Config(block_sizes=[256, 16, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:34:11.1410447Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 0.4 configs/s 2026-02-21T09:34:12.2913014Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 23/23 21.0 configs/s 2026-02-21T09:34:13.4451232Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 717.5 2026-02-21T09:34:13.4451781Z configs/s 2026-02-21T09:34:13.7982102Z [574s] Generation 13 complete: 2026-02-21T09:34:13.7982451Z error=3 2026-02-21T09:34:13.7982660Z timeout=1 2026-02-21T09:34:13.7982893Z ok=22 2026-02-21T09:34:13.7983089Z min=0.0759 2026-02-21T09:34:13.7983297Z mid=0.1191 2026-02-21T09:34:13.7983493Z max=0.9180 2026-02-21T09:34:13.7983721Z best={'block_sizes': [256, 16, 16], 2026-02-21T09:34:13.7984089Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:34:13.7984451Z 'l2_groupings': [4], 2026-02-21T09:34:13.7984731Z 'load_eviction_policies': ['', ''], 2026-02-21T09:34:13.7985044Z 'loop_orders': [[0, 1]], 2026-02-21T09:34:13.7985358Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:34:13.7986115Z 'num_sm_multiplier': 32, 2026-02-21T09:34:13.7986389Z 'num_stages': 4, 2026-02-21T09:34:13.7986645Z 'num_warps': 2, 2026-02-21T09:34:13.7986914Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:34:13.7987246Z 'range_flattens': [None, False], 2026-02-21T09:34:13.7998200Z 'range_multi_buffers': [False, True], 2026-02-21T09:34:13.7998416Z 'range_num_stages': [1, 1], 2026-02-21T09:34:13.7998583Z 'range_unroll_factors': [3, 1], 2026-02-21T09:34:13.7998756Z 'range_warp_specializes': [], 2026-02-21T09:34:13.7998918Z 'waves_per_eu': 1} 2026-02-21T09:34:13.8129179Z [574s] Fitting surrogate: 1106 points, 1106 targets 2026-02-21T09:34:14.0857200Z [574s] Generation 14 starting: 17 neighbors, 1 active search path(s) 2026-02-21T09:34:45.2173984Z [605s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:34:45.2195481Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.4 configs/s 2026-02-21T09:34:46.1777167Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 18.7 configs/s 2026-02-21T09:34:47.7082478Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 566.2 2026-02-21T09:34:47.7083140Z configs/s 2026-02-21T09:34:48.0492172Z [608s] Generation 14 complete: 2026-02-21T09:34:48.0492304Z timeout=1 2026-02-21T09:34:48.0492379Z ok=18 2026-02-21T09:34:48.0492461Z min=0.0758 2026-02-21T09:34:48.0492538Z mid=0.0792 2026-02-21T09:34:48.0492616Z max=0.2255 2026-02-21T09:34:48.0492707Z best={'block_sizes': [256, 16, 16], 2026-02-21T09:34:48.0492852Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:34:48.0493012Z 'l2_groupings': [4], 2026-02-21T09:34:48.0493126Z 'load_eviction_policies': ['', ''], 2026-02-21T09:34:48.0493557Z 'loop_orders': [[0, 1]], 2026-02-21T09:34:48.0493672Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:34:48.0493778Z 'num_sm_multiplier': 32, 2026-02-21T09:34:48.0493874Z 'num_stages': 4, 2026-02-21T09:34:48.0493962Z 'num_warps': 2, 2026-02-21T09:34:48.0494664Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:34:48.0494789Z 'range_flattens': [None, False], 2026-02-21T09:34:48.0494903Z 'range_multi_buffers': [False, True], 2026-02-21T09:34:48.0495018Z 'range_num_stages': [1, 1], 2026-02-21T09:34:48.0495121Z 'range_unroll_factors': [3, 1], 2026-02-21T09:34:48.0495231Z 'range_warp_specializes': [], 2026-02-21T09:34:48.0495336Z 'waves_per_eu': 1} 2026-02-21T09:34:48.0711041Z [608s] Fitting surrogate: 1125 points, 1125 targets 2026-02-21T09:34:48.3637932Z [609s] Generation 15 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:35:19.9193208Z [640s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:35:20.0583118Z [640s] Timeout after 30s compiling Config(block_sizes=[256, 16, 32], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:35:20.0595605Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 0.5 configs/s 2026-02-21T09:35:21.0912238Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 19.4 configs/s 2026-02-21T09:35:22.7660122Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 521.7 2026-02-21T09:35:22.7660724Z configs/s 2026-02-21T09:35:23.1605332Z [643s] Generation 15 complete: 2026-02-21T09:35:23.1605562Z timeout=2 2026-02-21T09:35:23.1605692Z ok=19 2026-02-21T09:35:23.1605826Z min=0.0759 2026-02-21T09:35:23.1605955Z mid=0.0876 2026-02-21T09:35:23.1606090Z max=0.2290 2026-02-21T09:35:23.1606234Z best={'block_sizes': [256, 16, 16], 2026-02-21T09:35:23.1606479Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:35:23.1606710Z 'l2_groupings': [4], 2026-02-21T09:35:23.1606891Z 'load_eviction_policies': ['', ''], 2026-02-21T09:35:23.1607093Z 'loop_orders': [[0, 1]], 2026-02-21T09:35:23.1607271Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:35:23.1607451Z 'num_sm_multiplier': 32, 2026-02-21T09:35:23.1607656Z 'num_stages': 4, 2026-02-21T09:35:23.1607805Z 'num_warps': 2, 2026-02-21T09:35:23.1607976Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:35:23.1608485Z 'range_flattens': [None, False], 2026-02-21T09:35:23.1608678Z 'range_multi_buffers': [False, True], 2026-02-21T09:35:23.1608882Z 'range_num_stages': [1, 1], 2026-02-21T09:35:23.1609067Z 'range_unroll_factors': [3, 1], 2026-02-21T09:35:23.1609270Z 'range_warp_specializes': [], 2026-02-21T09:35:23.1609441Z 'waves_per_eu': 1} 2026-02-21T09:35:23.1863949Z [643s] Fitting surrogate: 1146 points, 1146 targets 2026-02-21T09:35:23.4821953Z [644s] Generation 16 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:35:55.5368613Z [676s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:35:55.6533302Z [676s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:35:55.8984805Z [676s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:35:55.9006618Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 0.6 configs/s 2026-02-21T09:35:56.8190738Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 22.1 configs/s 2026-02-21T09:35:58.7583021Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 668.7 2026-02-21T09:35:58.7583602Z configs/s 2026-02-21T09:35:59.1407684Z [679s] Generation 16 complete: 2026-02-21T09:35:59.1408030Z error=1 2026-02-21T09:35:59.1408244Z timeout=3 2026-02-21T09:35:59.1408447Z ok=17 2026-02-21T09:35:59.1408648Z min=0.0759 2026-02-21T09:35:59.1408880Z mid=0.0876 2026-02-21T09:35:59.1409083Z max=0.3048 2026-02-21T09:35:59.1409322Z best={'block_sizes': [256, 16, 16], 2026-02-21T09:35:59.1409701Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T09:35:59.1410113Z 'l2_groupings': [4], 2026-02-21T09:35:59.1410394Z 'load_eviction_policies': ['', ''], 2026-02-21T09:35:59.1410708Z 'loop_orders': [[0, 1]], 2026-02-21T09:35:59.1411029Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:35:59.1411308Z 'num_sm_multiplier': 32, 2026-02-21T09:35:59.1411574Z 'num_stages': 4, 2026-02-21T09:35:59.1411799Z 'num_warps': 2, 2026-02-21T09:35:59.1412061Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:35:59.1412388Z 'range_flattens': [None, False], 2026-02-21T09:35:59.1412693Z 'range_multi_buffers': [False, True], 2026-02-21T09:35:59.1413001Z 'range_num_stages': [1, 1], 2026-02-21T09:35:59.1413286Z 'range_unroll_factors': [3, 1], 2026-02-21T09:35:59.1413583Z 'range_warp_specializes': [], 2026-02-21T09:35:59.1413851Z 'waves_per_eu': 1} 2026-02-21T09:35:59.1576237Z [679s] Fitting surrogate: 1167 points, 1167 targets 2026-02-21T09:35:59.2870808Z [679s] Autotuning complete in 680.0s after searching 1100 configs. 2026-02-21T09:35:59.2871353Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:35:59.2873413Z @helion.kernel(config=helion.Config(block_sizes=[256, 16, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 1], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:35:59.2875673Z 2026-02-21T09:35:59.2876131Z [679s] Code of selected kernel: /tmp/torchinductor_root/ll/cll53dqlm5eq55uqbwntdste57uy2gcyxcujca5g2h2jgsbmdcpl.py 2026-02-21T09:35:59.3034585Z from __future__ import annotations 2026-02-21T09:35:59.3034734Z 2026-02-21T09:35:59.3034790Z import torch 2026-02-21T09:35:59.3034919Z import helion 2026-02-21T09:35:59.3035044Z import triton 2026-02-21T09:35:59.3035183Z import triton.language as tl 2026-02-21T09:35:59.3035430Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:35:59.3035620Z 2026-02-21T09:35:59.3035691Z _BLOCK_SIZE_1 = tl.constexpr(16) 2026-02-21T09:35:59.3035868Z _BLOCK_SIZE_2 = tl.constexpr(16) 2026-02-21T09:35:59.3036038Z _BLOCK_SIZE_0 = tl.constexpr(256) 2026-02-21T09:35:59.3036152Z 2026-02-21T09:35:59.3036212Z @triton.jit 2026-02-21T09:35:59.3036500Z def _helion_matmul_bf16_int4(A, B, C, _NUM_SM: tl.constexpr, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr): 2026-02-21T09:35:59.3036900Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:35:59.3037227Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:35:59.3037494Z # src[int4_gemm.py:57-91]: ... 2026-02-21T09:35:59.3037727Z total_pids = tl.cdiv(16, _BLOCK_SIZE_1) * tl.cdiv(7168, _BLOCK_SIZE_2) 2026-02-21T09:35:59.3038195Z for virtual_pid in tl.range(tl.program_id(0), total_pids, _NUM_SM * 32, loop_unroll_factor=3, num_stages=1, disallow_acc_multi_buffer=True): 2026-02-21T09:35:59.3038841Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:35:59.3039093Z num_pid_m = tl.cdiv(16, _BLOCK_SIZE_1) 2026-02-21T09:35:59.3039293Z num_pid_n = tl.cdiv(7168, _BLOCK_SIZE_2) 2026-02-21T09:35:59.3039485Z inner_2d_pid = virtual_pid 2026-02-21T09:35:59.3039678Z num_pid_in_group = 4 * num_pid_n 2026-02-21T09:35:59.3039879Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T09:35:59.3040079Z first_pid_m = group_id * 4 2026-02-21T09:35:59.3040280Z group_size_m = min(num_pid_m - first_pid_m, 4) 2026-02-21T09:35:59.3040540Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T09:35:59.3040826Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T09:35:59.3041050Z offset_1 = pid_0 * _BLOCK_SIZE_1 2026-02-21T09:35:59.3041287Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T09:35:59.3041533Z offset_2 = pid_1 * _BLOCK_SIZE_2 2026-02-21T09:35:59.3041759Z indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32) 2026-02-21T09:35:59.3042068Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:35:59.3042370Z acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32) 2026-02-21T09:35:59.3042761Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:35:59.3043177Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:35:59.3043576Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:35:59.3043853Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:35:59.3044256Z for offset_3 in tl.range(0, 4096, _BLOCK_SIZE_0, loop_unroll_factor=1, num_stages=1, disallow_acc_multi_buffer=False, flatten=False): 2026-02-21T09:35:59.3044698Z indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T09:35:59.3044936Z acc_copy = acc 2026-02-21T09:35:59.3045162Z acc_copy_0 = acc_copy 2026-02-21T09:35:59.3045379Z # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2 2026-02-21T09:35:59.3045611Z mul = 2 * offset_3 2026-02-21T09:35:59.3045870Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:35:59.3046158Z iota = mul + tl.arange(0, mul_1) 2026-02-21T09:35:59.3046405Z load = tl.load(A + (indices_1[:, None] * 8192 + iota[None, :] * 1), None) 2026-02-21T09:35:59.3046739Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:35:59.3047036Z # src[int4_gemm.py:66]: torch.float32 2026-02-21T09:35:59.3047274Z # src[int4_gemm.py:67]: ) # [BLOCK_SIZE_M, BLOCK_SIZE_K] 2026-02-21T09:35:59.3047460Z v_0 = tl.cast(load, tl.float32) 2026-02-21T09:35:59.3047686Z # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n] # [BLOCK_SIZE_K//2, BLOCK_SIZE_N] 2026-02-21T09:35:59.3047978Z b_tile = tl.load(B + (indices_3[:, None] * 7168 + indices_2[None, :] * 1), None) 2026-02-21T09:35:59.3048267Z # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8) # Sign-extend low 4 bits 2026-02-21T09:35:59.3048487Z v_1 = tl.full([], 4, tl.int8) 2026-02-21T09:35:59.3048633Z v_2 = b_tile << v_1 2026-02-21T09:35:59.3048766Z v_3 = tl.full([], 4, tl.int8) 2026-02-21T09:35:59.3048903Z v_4 = v_2 >> v_3 2026-02-21T09:35:59.3049097Z # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8) # Sign-extend high 4 bits 2026-02-21T09:35:59.3049313Z v_5 = tl.full([], 4, tl.int8) 2026-02-21T09:35:59.3049455Z v_6 = b_tile >> v_5 2026-02-21T09:35:59.3049675Z # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1) 2026-02-21T09:35:59.3049875Z stack_idx = tl.arange(0, 2) 2026-02-21T09:35:59.3050042Z broadcast_idx = stack_idx[None, :, None] 2026-02-21T09:35:59.3050216Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T09:35:59.3050372Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T09:35:59.3050538Z stacked_result = tl.zeros_like(expanded_0) 2026-02-21T09:35:59.3050699Z mask_0 = broadcast_idx == 0 2026-02-21T09:35:59.3050886Z stacked_result = tl.where(mask_0, expanded_0, stacked_result) 2026-02-21T09:35:59.3051081Z mask_1 = broadcast_idx == 1 2026-02-21T09:35:59.3051257Z stacked_result = tl.where(mask_1, expanded_1, stacked_result) 2026-02-21T09:35:59.3051489Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:35:59.3051716Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:35:59.3051954Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:35:59.3052157Z view = tl.reshape(stacked_result, [_SHAPE_DIM_2, _BLOCK_SIZE_2]) 2026-02-21T09:35:59.3052355Z v_7 = tl.cast(view, tl.float32) 2026-02-21T09:35:59.3052577Z # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2) # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1] 2026-02-21T09:35:59.3052798Z a_tile_1 = v_0[:, :, None] 2026-02-21T09:35:59.3052982Z # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0) 2026-02-21T09:35:59.3053172Z b_unpacked_1 = v_7[None, :, :] 2026-02-21T09:35:59.3053407Z # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1) # [BLOCK_SIZE_M, BLOCK_SIZE_N] 2026-02-21T09:35:59.3053653Z v_8 = a_tile_1 * b_unpacked_1 2026-02-21T09:35:59.3053810Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:35:59.3053969Z acc = acc_copy_0 + sum_1 2026-02-21T09:35:59.3054157Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T09:35:59.3054348Z v_10 = tl.cast(acc, tl.bfloat16) 2026-02-21T09:35:59.3054699Z tl.store(tl.make_block_ptr(C, [16, 7168], [7168, 1], [offset_1, offset_2], [_BLOCK_SIZE_1, _BLOCK_SIZE_2], [1, 0]), v_10, boundary_check=[0, 1]) 2026-02-21T09:35:59.3054969Z 2026-02-21T09:35:59.3055075Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher): 2026-02-21T09:35:59.3055267Z """ 2026-02-21T09:35:59.3055399Z BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T09:35:59.3055525Z 2026-02-21T09:35:59.3055602Z This kernel performs matrix multiplication where: 2026-02-21T09:35:59.3055798Z - A is a bfloat16 matrix of shape [M, K] 2026-02-21T09:35:59.3055964Z - B is an int8 matrix of shape [K//2, N] containing packed int4 values 2026-02-21T09:35:59.3056132Z (two 4-bit values packed into each int8) 2026-02-21T09:35:59.3056220Z 2026-02-21T09:35:59.3056253Z Args: 2026-02-21T09:35:59.3056371Z A (Tensor): Input tensor of shape [M, K] in bfloat16 format. 2026-02-21T09:35:59.3056549Z B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format. 2026-02-21T09:35:59.3056670Z 2026-02-21T09:35:59.3056702Z Returns: 2026-02-21T09:35:59.3056817Z Tensor: Output tensor of shape [M, N] in bfloat16 format. 2026-02-21T09:35:59.3056953Z """ 2026-02-21T09:35:59.3057037Z # src[int4_gemm.py:50]: M, K = A.shape 2026-02-21T09:35:59.3057150Z M, K = A.shape 2026-02-21T09:35:59.3057247Z # src[int4_gemm.py:51]: _, N = B.shape 2026-02-21T09:35:59.3057355Z _, N = B.shape 2026-02-21T09:35:59.3057501Z # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:35:59.3057702Z C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:35:59.3057880Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:35:59.3058040Z _NUM_SM = helion.runtime.get_num_sm(A.device) 2026-02-21T09:35:59.3058292Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:35:59.3058561Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:35:59.3058818Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:35:59.3058996Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:35:59.3059104Z _BLOCK_SIZE_0 = 256 2026-02-21T09:35:59.3059230Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:35:59.3059410Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:35:59.3059579Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:35:59.3059705Z _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0 2026-02-21T09:35:59.3059846Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:35:59.3060039Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:35:59.3060208Z # src[int4_gemm.py:57-91]: ... 2026-02-21T09:35:59.3060341Z _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0) 2026-02-21T09:35:59.3060649Z _launcher(_helion_matmul_bf16_int4, (_NUM_SM * 32,), A, B, C, _NUM_SM, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, num_warps=2, num_stages=4, waves_per_eu=1, matrix_instr_nonkdim=16) 2026-02-21T09:35:59.3060935Z # src[int4_gemm.py:93]: return C 2026-02-21T09:35:59.3061044Z return C 2026-02-21T09:36:00.2613509Z WARNING:tritonbench.utils.triton_op:Completed input ID 10: 2026-02-21T09:36:00.2613958Z x_val 2026-02-21T09:36:00.2614176Z ------------------- 2026-02-21T09:36:00.2614420Z (16, 1, 7168, 8192) 2026-02-21T09:36:00.2614562Z 2026-02-21T09:36:00.2635038Z 40%|████ | 4/10 [33:21<53:27, 534.58s/it]WARNING:tritonbench.utils.triton_op:Running input ID 14: 2026-02-21T09:36:00.2635470Z x_val 2026-02-21T09:36:00.2635644Z ------------------- 2026-02-21T09:36:00.2635838Z (64, 1, 7168, 8192) 2026-02-21T09:36:00.2637341Z INFO:tritonbench.utils.triton_op:Took 0.14ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T09:36:01.3270280Z INFO:tritonbench.utils.triton_op:Took 4.45ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T09:36:03.7131543Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T09:36:03.7149944Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:36:03.7150081Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:36:03.7150207Z 'dtype': 'torch.bfloat16', 2026-02-21T09:36:03.7150327Z 'shape': (64, 1, 8192), 2026-02-21T09:36:03.7150439Z 'stride': (8192, 8192, 1)}, 2026-02-21T09:36:03.7150556Z { 'device': 'cuda:0', 2026-02-21T09:36:03.7150668Z 'dtype': 'torch.int32', 2026-02-21T09:36:03.7150777Z 'shape': (8192, 7168), 2026-02-21T09:36:03.7150889Z 'stride': (7168, 1)}), 2026-02-21T09:36:03.7150993Z 'kwargs': {}} 2026-02-21T09:36:03.7186350Z INFO:tritonbench.utils.triton_op:Took 3.65ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T09:36:03.8956365Z [0s] Autotune random seed: 2138032649 2026-02-21T09:36:03.9162935Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:36:40.2709481Z [36s] Timeout after 30s compiling Config(block_sizes=[512, 2, 64], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:36:41.3564197Z [37s] Timeout after 30s compiling Config(block_sizes=[2048, 16, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[0, 2], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:36:43.5023387Z [39s] Timeout after 30s compiling Config(block_sizes=[4096, 32, 4], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:36:44.4451176Z [40s] Timeout after 30s compiling Config(block_sizes=[1024, 2, 32], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[4, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:36:48.2928875Z [44s] Timeout after 30s compiling Config(block_sizes=[128, 1, 512], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[0, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:36:48.2948935Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.4 configs/s 2026-02-21T09:36:59.0164445Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 9.3 configs/s 2026-02-21T09:36:59.0173456Z [55s] Adaptive compile timeout: 30s (90% percentile=13.4s, bounds=[30.0s, 30s]) 2026-02-21T09:36:59.1040960Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 754/754 - configs/s 2026-02-21T09:36:59.8466852Z [55s] Initial random population of 100, 5 starting points: 2026-02-21T09:36:59.8469666Z error=2 2026-02-21T09:36:59.8469941Z timeout=5 2026-02-21T09:36:59.8470160Z ok=93 2026-02-21T09:36:59.8470362Z min=0.2463 2026-02-21T09:36:59.8473466Z mid=3.2506 2026-02-21T09:36:59.8473567Z max=153.5757 2026-02-21T09:36:59.8473671Z best={'block_sizes': [16, 16, 16], 2026-02-21T09:36:59.8473814Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:36:59.8473944Z 'l2_groupings': [1], 2026-02-21T09:36:59.8474048Z 'load_eviction_policies': ['', ''], 2026-02-21T09:36:59.8474164Z 'loop_orders': [[0, 1]], 2026-02-21T09:36:59.8474268Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:36:59.8474368Z 'num_stages': 1, 2026-02-21T09:36:59.8474455Z 'num_warps': 4, 2026-02-21T09:36:59.8474540Z 'pid_type': 'flat', 2026-02-21T09:36:59.8474648Z 'range_flattens': [None, None], 2026-02-21T09:36:59.8474770Z 'range_multi_buffers': [None, None], 2026-02-21T09:36:59.8474883Z 'range_num_stages': [0, 0], 2026-02-21T09:36:59.8474995Z 'range_unroll_factors': [0, 0], 2026-02-21T09:36:59.8475103Z 'range_warp_specializes': [], 2026-02-21T09:36:59.8475209Z 'waves_per_eu': 1} 2026-02-21T09:36:59.8489477Z [55s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:37:00.8420503Z [56s] Generation 1 starting: 100 neighbors, 5 active search path(s) 2026-02-21T09:37:18.3895832Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 104/104 3.4 configs/s 2026-02-21T09:37:21.5539237Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:21.5541766Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:37:21.5543258Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:37:21.5544118Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:37:21.5544901Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [16, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:37:21.5545599Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:37:21.5546240Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:21.5546457Z #smem = #ttg.shared_memory 2026-02-21T09:37:21.5546712Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:21.5547189Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:21.5547593Z %cst = arith.constant dense<0.000000e+00> : tensor<16x32xf32, #mma> 2026-02-21T09:37:21.5547760Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:21.5547901Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:37:21.5548015Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:21.5548130Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:37:21.5548247Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:21.5548356Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:21.5548498Z %cst_0 = arith.constant dense<0> : tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5548646Z %c4092_i32 = arith.constant 4092 : i32 2026-02-21T09:37:21.5548764Z %c12_i32 = arith.constant 12 : i32 2026-02-21T09:37:21.5548872Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:37:21.5549056Z %cst_1 = arith.constant dense<8184> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.5549357Z %cst_2 = arith.constant dense<4092> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.5549762Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:21.5549979Z %cst_4 = arith.constant dense<7168> : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5550232Z %cst_5 = arith.constant dense<4> : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5550446Z %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:21.5550618Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:21.5550783Z %cst_8 = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:21.5550928Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:21.5551040Z %1 = arith.divsi %0, %c16_i32 : i32 2026-02-21T09:37:21.5551157Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T09:37:21.5551271Z %3 = arith.subi %c224_i32, %2 : i32 2026-02-21T09:37:21.5551385Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T09:37:21.5551498Z %5 = arith.remsi %0, %c16_i32 : i32 2026-02-21T09:37:21.5551608Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:37:21.5551718Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:37:21.5551820Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:37:21.5551929Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T09:37:21.5552162Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.5552471Z %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:21.5552758Z %12 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.5553004Z %13 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:21.5553289Z %14 = arith.addi %12, %10 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.5553537Z %15 = arith.addi %13, %11 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:21.5553697Z %16 = arith.muli %8, %c16_i32 : i32 2026-02-21T09:37:21.5553889Z %17 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:21.5554156Z %18 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:21.5554397Z %19 = tt.splat %16 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:21.5554605Z %20 = tt.splat %16 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:21.5554811Z %21 = arith.addi %19, %17 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:21.5555019Z %22 = arith.addi %20, %18 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:21.5555292Z %23 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.5555601Z %24 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.5555899Z %25 = tt.expand_dims %21 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:21.5556146Z %26 = arith.muli %25, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:21.5556334Z %27 = tt.broadcast %26 : tensor<16x1xi32, #blocked1> -> tensor<16x8xi32, #blocked1> 2026-02-21T09:37:21.5556546Z %28 = tt.splat %arg0 : !tt.ptr -> tensor<16x8x!tt.ptr, #blocked1> 2026-02-21T09:37:21.5556890Z %29 = tt.expand_dims %14 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5557313Z %30 = tt.broadcast %29 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5557673Z %31 = tt.splat %arg1 : !tt.ptr -> tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5557976Z %32 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:21.5558378Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:21.5558772Z %34 = tt.expand_dims %33 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:21.5559025Z %35 = arith.cmpi eq, %34, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:21.5559220Z %36 = tt.broadcast %35 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked> 2026-02-21T09:37:21.5559417Z %37 = arith.cmpi eq, %34, %cst_7 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:21.5559602Z %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked> 2026-02-21T09:37:21.5559866Z %39 = scf.for %arg3 = %c0_i32 to %c4092_i32 step %c12_i32 iter_args(%arg4 = %cst) -> (tensor<16x32xf32, #mma>) : i32 { 2026-02-21T09:37:21.5560172Z %79 = tt.splat %arg3 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.5560475Z %80 = arith.addi %79, %23 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.5560687Z %81 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:37:21.5560857Z %82 = tt.splat %81 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.5561106Z %83 = arith.addi %82, %24 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.5561373Z %84 = tt.expand_dims %83 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:37:21.5561646Z %85 = tt.broadcast %84 : tensor<1x8xi32, #blocked1> -> tensor<16x8xi32, #blocked1> 2026-02-21T09:37:21.5561836Z %86 = arith.addi %27, %85 : tensor<16x8xi32, #blocked1> 2026-02-21T09:37:21.5562029Z %87 = tt.addptr %28, %86 : tensor<16x8x!tt.ptr, #blocked1>, tensor<16x8xi32, #blocked1> 2026-02-21T09:37:21.5562232Z %88 = tt.load %87 : tensor<16x8x!tt.ptr, #blocked1> 2026-02-21T09:37:21.5562444Z %89 = ttg.local_alloc %88 : (tensor<16x8xbf16, #blocked1>) -> !ttg.memdesc<16x8xbf16, #shared, #smem> 2026-02-21T09:37:21.5562837Z %90 = ttg.local_load %89 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.5563240Z %91 = arith.extf %90 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.5563691Z %92 = tt.expand_dims %80 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5564040Z %93 = arith.muli %92, %cst_4 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5564342Z %94 = tt.broadcast %93 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5564636Z %95 = arith.addi %94, %30 : tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5564938Z %96 = tt.addptr %31, %95 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5565242Z %97 = tt.load %96 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5565512Z %98 = arith.shli %97, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5565740Z %99 = arith.shrsi %98, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5565971Z %100 = arith.shrsi %97, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5566255Z %101 = tt.expand_dims %99 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:37:21.5566587Z %102 = tt.expand_dims %100 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:37:21.5566868Z %103 = tt.broadcast %101 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5567108Z %104 = arith.select %36, %103, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5567343Z %105 = tt.broadcast %102 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5567583Z %106 = arith.select %38, %105, %104 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5567807Z %107 = tt.reshape %106 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:37:21.5568025Z %108 = arith.sitofp %107 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:37:21.5568275Z %109 = ttg.local_alloc %108 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:37:21.5568595Z %110 = ttg.local_load %109 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.5569101Z %111 = tt.dot %91, %110, %arg4, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma> 2026-02-21T09:37:21.5569454Z %112 = arith.addi %arg3, %c4_i32 : i32 2026-02-21T09:37:21.5569665Z %113 = tt.splat %112 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.5569969Z %114 = arith.addi %113, %23 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.5570181Z %115 = arith.muli %112, %c2_i32 : i32 2026-02-21T09:37:21.5570351Z %116 = tt.splat %115 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.5570577Z %117 = arith.addi %116, %24 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.5570856Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:37:21.5571135Z %119 = tt.broadcast %118 : tensor<1x8xi32, #blocked1> -> tensor<16x8xi32, #blocked1> 2026-02-21T09:37:21.5571328Z %120 = arith.addi %27, %119 : tensor<16x8xi32, #blocked1> 2026-02-21T09:37:21.5571531Z %121 = tt.addptr %28, %120 : tensor<16x8x!tt.ptr, #blocked1>, tensor<16x8xi32, #blocked1> 2026-02-21T09:37:21.5571737Z %122 = tt.load %121 : tensor<16x8x!tt.ptr, #blocked1> 2026-02-21T09:37:21.5571961Z %123 = ttg.local_alloc %122 : (tensor<16x8xbf16, #blocked1>) -> !ttg.memdesc<16x8xbf16, #shared, #smem> 2026-02-21T09:37:21.5572288Z %124 = ttg.local_load %123 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.5572690Z %125 = arith.extf %124 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.5573150Z %126 = tt.expand_dims %114 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5573620Z %127 = arith.muli %126, %cst_4 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5573931Z %128 = tt.broadcast %127 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5574233Z %129 = arith.addi %128, %30 : tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5574538Z %130 = tt.addptr %31, %129 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5574846Z %131 = tt.load %130 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5575076Z %132 = arith.shli %131, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5575310Z %133 = arith.shrsi %132, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5575546Z %134 = arith.shrsi %131, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5575838Z %135 = tt.expand_dims %133 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:37:21.5576164Z %136 = tt.expand_dims %134 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:37:21.5576444Z %137 = tt.broadcast %135 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5576678Z %138 = arith.select %36, %137, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5576912Z %139 = tt.broadcast %136 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5577183Z %140 = arith.select %38, %139, %138 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5577409Z %141 = tt.reshape %140 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:37:21.5577631Z %142 = arith.sitofp %141 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:37:21.5577876Z %143 = ttg.local_alloc %142 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:37:21.5578201Z %144 = ttg.local_load %143 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.5578667Z %145 = tt.dot %125, %144, %111, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma> 2026-02-21T09:37:21.5579011Z %146 = arith.addi %arg3, %c8_i32 : i32 2026-02-21T09:37:21.5579225Z %147 = tt.splat %146 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.5579520Z %148 = arith.addi %147, %23 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.5579734Z %149 = arith.muli %146, %c2_i32 : i32 2026-02-21T09:37:21.5579905Z %150 = tt.splat %149 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.5580126Z %151 = arith.addi %150, %24 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.5580401Z %152 = tt.expand_dims %151 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:37:21.5580674Z %153 = tt.broadcast %152 : tensor<1x8xi32, #blocked1> -> tensor<16x8xi32, #blocked1> 2026-02-21T09:37:21.5580867Z %154 = arith.addi %27, %153 : tensor<16x8xi32, #blocked1> 2026-02-21T09:37:21.5581071Z %155 = tt.addptr %28, %154 : tensor<16x8x!tt.ptr, #blocked1>, tensor<16x8xi32, #blocked1> 2026-02-21T09:37:21.5581277Z %156 = tt.load %155 : tensor<16x8x!tt.ptr, #blocked1> 2026-02-21T09:37:21.5581533Z %157 = ttg.local_alloc %156 : (tensor<16x8xbf16, #blocked1>) -> !ttg.memdesc<16x8xbf16, #shared, #smem> 2026-02-21T09:37:21.5581857Z %158 = ttg.local_load %157 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.5582257Z %159 = arith.extf %158 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.5582711Z %160 = tt.expand_dims %148 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5583062Z %161 = arith.muli %160, %cst_4 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5583369Z %162 = tt.broadcast %161 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5583673Z %163 = arith.addi %162, %30 : tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5583979Z %164 = tt.addptr %31, %163 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5584287Z %165 = tt.load %164 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5584513Z %166 = arith.shli %165, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5584746Z %167 = arith.shrsi %166, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5584979Z %168 = arith.shrsi %165, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5585302Z %169 = tt.expand_dims %167 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:37:21.5585638Z %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:37:21.5585918Z %171 = tt.broadcast %169 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5586156Z %172 = arith.select %36, %171, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5586391Z %173 = tt.broadcast %170 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5586629Z %174 = arith.select %38, %173, %172 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5586853Z %175 = tt.reshape %174 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:37:21.5587072Z %176 = arith.sitofp %175 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:37:21.5587320Z %177 = ttg.local_alloc %176 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:37:21.5587644Z %178 = ttg.local_load %177 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.5588107Z %179 = tt.dot %159, %178, %145, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma> 2026-02-21T09:37:21.5588447Z scf.yield %179 : tensor<16x32xf32, #mma> 2026-02-21T09:37:21.5588569Z } {tt.flatten} 2026-02-21T09:37:21.5588753Z %40 = arith.addi %23, %cst_2 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.5589016Z %41 = arith.addi %24, %cst_1 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.5589294Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:37:21.5589597Z %43 = tt.broadcast %42 : tensor<1x8xi32, #blocked1> -> tensor<16x8xi32, #blocked1> 2026-02-21T09:37:21.5589787Z %44 = arith.addi %27, %43 : tensor<16x8xi32, #blocked1> 2026-02-21T09:37:21.5589980Z %45 = tt.addptr %28, %44 : tensor<16x8x!tt.ptr, #blocked1>, tensor<16x8xi32, #blocked1> 2026-02-21T09:37:21.5590180Z %46 = tt.load %45 : tensor<16x8x!tt.ptr, #blocked1> 2026-02-21T09:37:21.5590393Z %47 = ttg.local_alloc %46 : (tensor<16x8xbf16, #blocked1>) -> !ttg.memdesc<16x8xbf16, #shared, #smem> 2026-02-21T09:37:21.5590708Z %48 = ttg.local_load %47 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.5591104Z %49 = arith.extf %48 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.5591554Z %50 = tt.expand_dims %40 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5591902Z %51 = arith.muli %50, %cst_4 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5592199Z %52 = tt.broadcast %51 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5592493Z %53 = arith.addi %52, %30 : tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5592795Z %54 = tt.addptr %31, %53 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5593094Z %55 = tt.load %54 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5593345Z %56 = arith.shli %55, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5593571Z %57 = arith.shrsi %56, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5593798Z %58 = arith.shrsi %55, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.5594076Z %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:37:21.5594402Z %60 = tt.expand_dims %58 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:37:21.5594674Z %61 = tt.broadcast %59 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5594906Z %62 = arith.select %36, %61, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5595131Z %63 = tt.broadcast %60 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5595356Z %64 = arith.select %38, %63, %62 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:37:21.5595576Z %65 = tt.reshape %64 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:37:21.5595785Z %66 = arith.sitofp %65 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:37:21.5596024Z %67 = ttg.local_alloc %66 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:37:21.5596337Z %68 = ttg.local_load %67 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.5596790Z %69 = tt.dot %49, %68, %39, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma> 2026-02-21T09:37:21.5597166Z %70 = arith.truncf %69 : tensor<16x32xf32, #mma> to tensor<16x32xbf16, #mma> 2026-02-21T09:37:21.5597420Z %71 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:21.5606127Z %72 = arith.muli %71, %cst_8 : tensor<16x1xi32, #mma> 2026-02-21T09:37:21.5606349Z %73 = tt.expand_dims %15 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma> 2026-02-21T09:37:21.5606598Z %74 = tt.broadcast %72 : tensor<16x1xi32, #mma> -> tensor<16x32xi32, #mma> 2026-02-21T09:37:21.5606792Z %75 = tt.broadcast %73 : tensor<1x32xi32, #mma> -> tensor<16x32xi32, #mma> 2026-02-21T09:37:21.5606960Z %76 = arith.addi %74, %75 : tensor<16x32xi32, #mma> 2026-02-21T09:37:21.5607129Z %77 = tt.splat %arg2 : !tt.ptr -> tensor<16x32x!tt.ptr, #mma> 2026-02-21T09:37:21.5607331Z %78 = tt.addptr %77, %76 : tensor<16x32x!tt.ptr, #mma>, tensor<16x32xi32, #mma> 2026-02-21T09:37:21.5607516Z tt.store %78, %70 : tensor<16x32x!tt.ptr, #mma> 2026-02-21T09:37:21.5607644Z tt.return 2026-02-21T09:37:21.5607729Z } 2026-02-21T09:37:21.5607807Z } 2026-02-21T09:37:21.5607851Z 2026-02-21T09:37:21.5607891Z {-# 2026-02-21T09:37:21.5607972Z external_resources: { 2026-02-21T09:37:21.5608071Z mlir_reproducer: { 2026-02-21T09:37:21.5609064Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:21.5610065Z disable_threading: false, 2026-02-21T09:37:21.5610173Z verify_each: true 2026-02-21T09:37:21.5610310Z } 2026-02-21T09:37:21.5610385Z } 2026-02-21T09:37:21.5610454Z #-} 2026-02-21T09:37:21.5610733Z /tmp/torchinductor_root/td/ctd7r2a7zhsond3v5i5kxjbytnvurxpyiaxknhm5ys67dxidw5k7.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:21.5611409Z /tmp/torchinductor_root/td/ctd7r2a7zhsond3v5i5kxjbytnvurxpyiaxknhm5ys67dxidw5k7.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:21.5611957Z [77s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:21.5612679Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 32], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:37:21.5613325Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:21.5613493Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:21.9189174Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:21.9192562Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}> 2026-02-21T09:37:21.9193420Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:37:21.9194156Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:37:21.9194864Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:21.9195753Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:21.9196198Z #smem = #ttg.shared_memory 2026-02-21T09:37:21.9196762Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:21.9197914Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:21.9198794Z %cst = arith.constant dense<0.000000e+00> : tensor<16x128xf32, #mma> 2026-02-21T09:37:21.9199129Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:21.9199375Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:21.9199623Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:21.9199855Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:21.9200147Z %cst_0 = arith.constant dense<0> : tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9200461Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:21.9200706Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:37:21.9200954Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:21.9201176Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:21.9201414Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:37:21.9201643Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:21.9201881Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:21.9202279Z %cst_1 = arith.constant dense<29352960> : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9202930Z %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.9203382Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:21.9203916Z %cst_4 = arith.constant dense<4> : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9204359Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:21.9204712Z %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:21.9205057Z %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:21.9205354Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:21.9205745Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:21.9206302Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:21.9206936Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.9207530Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:21.9208025Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.9208381Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:21.9208725Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9209166Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:21.9209762Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:21.9210342Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:21.9210705Z %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:21.9210998Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x128xi1, #blocked> 2026-02-21T09:37:21.9211332Z %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:21.9211609Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x128xi1, #blocked> 2026-02-21T09:37:21.9211912Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x128x!tt.ptr, #mma> 2026-02-21T09:37:21.9212214Z %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.9212612Z %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:21.9213007Z %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:21.9213285Z scf.for %arg3 = %0 to %c224_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:21.9213503Z %19 = arith.divsi %arg3, %c224_i32 : i32 2026-02-21T09:37:21.9213677Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T09:37:21.9213857Z %21 = arith.subi %c4_i32, %20 : i32 2026-02-21T09:37:21.9214024Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T09:37:21.9214200Z %23 = arith.remsi %arg3, %c224_i32 : i32 2026-02-21T09:37:21.9214380Z %24 = arith.remsi %23, %22 : i32 2026-02-21T09:37:21.9214543Z %25 = arith.addi %20, %24 : i32 2026-02-21T09:37:21.9214705Z %26 = arith.divsi %23, %22 : i32 2026-02-21T09:37:21.9214867Z %27 = arith.muli %25, %c16_i32 : i32 2026-02-21T09:37:21.9215116Z %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:21.9215420Z %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:21.9215729Z %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:21.9216080Z %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:21.9216319Z %32 = arith.muli %26, %c128_i32 : i32 2026-02-21T09:37:21.9216625Z %33 = tt.splat %32 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.9217014Z %34 = tt.splat %32 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:21.9217368Z %35 = arith.addi %33, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:21.9217700Z %36 = arith.addi %34, %4 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:21.9218003Z %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:21.9218283Z %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:21.9218507Z %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:21.9218905Z %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9219352Z %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x128xf32, #mma>) : i32 { 2026-02-21T09:37:21.9219601Z %72 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:21.9219791Z %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.9220040Z %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.9220344Z %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:21.9220656Z %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:21.9220879Z %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:21.9221103Z %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:21.9221375Z %79 = tt.load %78 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:21.9221672Z %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.9222129Z %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.9222454Z %82 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:21.9222656Z %83 = tt.splat %82 : i32 -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9222918Z %84 = arith.addi %83, %40 : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9223269Z %85 = tt.addptr %7, %84 : tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9223622Z %86 = tt.load %85 : tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9223885Z %87 = arith.shli %86, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9224151Z %88 = arith.shrsi %87, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9224417Z %89 = arith.shrsi %86, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9224745Z %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:21.9225160Z %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:21.9225484Z %92 = tt.broadcast %90 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9225756Z %93 = arith.select %12, %92, %cst_0 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9226030Z %94 = tt.broadcast %91 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9226295Z %95 = arith.select %14, %94, %93 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9226550Z %96 = tt.reshape %95 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked2> 2026-02-21T09:37:21.9226806Z %97 = arith.sitofp %96 : tensor<2x128xi8, #blocked2> to tensor<2x128xf32, #blocked2> 2026-02-21T09:37:21.9227090Z %98 = ttg.local_alloc %97 : (tensor<2x128xf32, #blocked2>) -> !ttg.memdesc<2x128xf32, #shared, #smem> 2026-02-21T09:37:21.9227448Z %99 = ttg.local_load %98 : !ttg.memdesc<2x128xf32, #shared, #smem> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.9227935Z %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma> 2026-02-21T09:37:21.9228285Z %101 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:21.9228416Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:37:21.9228589Z %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.9228818Z %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.9229100Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:21.9229378Z %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:21.9229579Z %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:21.9229782Z %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:21.9230045Z %109 = tt.load %108 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:21.9230315Z %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.9230715Z %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.9231001Z %112 = arith.muli %101, %c7168_i32 : i32 2026-02-21T09:37:21.9231182Z %113 = tt.splat %112 : i32 -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9231422Z %114 = arith.addi %113, %40 : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9231746Z %115 = tt.addptr %7, %114 : tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9232069Z %116 = tt.load %115 : tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9232311Z %117 = arith.shli %116, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9232552Z %118 = arith.shrsi %117, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9232799Z %119 = arith.shrsi %116, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9233098Z %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:21.9233473Z %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:21.9233766Z %122 = tt.broadcast %120 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9234021Z %123 = arith.select %12, %122, %cst_0 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9234265Z %124 = tt.broadcast %121 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9234512Z %125 = arith.select %14, %124, %123 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9234749Z %126 = tt.reshape %125 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked2> 2026-02-21T09:37:21.9234986Z %127 = arith.sitofp %126 : tensor<2x128xi8, #blocked2> to tensor<2x128xf32, #blocked2> 2026-02-21T09:37:21.9235251Z %128 = ttg.local_alloc %127 : (tensor<2x128xf32, #blocked2>) -> !ttg.memdesc<2x128xf32, #shared, #smem> 2026-02-21T09:37:21.9235582Z %129 = ttg.local_load %128 : !ttg.memdesc<2x128xf32, #shared, #smem> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.9236063Z %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma> 2026-02-21T09:37:21.9236417Z %131 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:21.9236547Z %132 = arith.muli %131, %c2_i32 : i32 2026-02-21T09:37:21.9236725Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.9236950Z %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:21.9237233Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:21.9237511Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:21.9237717Z %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:21.9237956Z %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:21.9238164Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:21.9238433Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.9238834Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.9239120Z %142 = arith.muli %131, %c7168_i32 : i32 2026-02-21T09:37:21.9239306Z %143 = tt.splat %142 : i32 -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9239540Z %144 = arith.addi %143, %40 : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9239859Z %145 = tt.addptr %7, %144 : tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9240175Z %146 = tt.load %145 : tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9240413Z %147 = arith.shli %146, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9240656Z %148 = arith.shrsi %147, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9240893Z %149 = arith.shrsi %146, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9241189Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:21.9241556Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:21.9241846Z %152 = tt.broadcast %150 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9242097Z %153 = arith.select %12, %152, %cst_0 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9242339Z %154 = tt.broadcast %151 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9242615Z %155 = arith.select %14, %154, %153 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9242854Z %156 = tt.reshape %155 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked2> 2026-02-21T09:37:21.9243086Z %157 = arith.sitofp %156 : tensor<2x128xi8, #blocked2> to tensor<2x128xf32, #blocked2> 2026-02-21T09:37:21.9243340Z %158 = ttg.local_alloc %157 : (tensor<2x128xf32, #blocked2>) -> !ttg.memdesc<2x128xf32, #shared, #smem> 2026-02-21T09:37:21.9243674Z %159 = ttg.local_load %158 : !ttg.memdesc<2x128xf32, #shared, #smem> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.9244144Z %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma> 2026-02-21T09:37:21.9251639Z scf.yield %160 : tensor<16x128xf32, #mma> 2026-02-21T09:37:21.9251772Z } {tt.flatten} 2026-02-21T09:37:21.9251900Z %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:21.9252105Z %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:21.9252307Z %44 = tt.load %43 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:21.9252572Z %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.9252982Z %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.9253383Z %47 = arith.addi %40, %cst_1 : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9253694Z %48 = tt.addptr %7, %47 : tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9254002Z %49 = tt.load %48 : tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9254232Z %50 = arith.shli %49, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9254460Z %51 = arith.shrsi %50, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9254689Z %52 = arith.shrsi %49, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:21.9254974Z %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:21.9255310Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:21.9255591Z %55 = tt.broadcast %53 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9255824Z %56 = arith.select %12, %55, %cst_0 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9256058Z %57 = tt.broadcast %54 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9256288Z %58 = arith.select %14, %57, %56 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:21.9256513Z %59 = tt.reshape %58 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked2> 2026-02-21T09:37:21.9256785Z %60 = arith.sitofp %59 : tensor<2x128xi8, #blocked2> to tensor<2x128xf32, #blocked2> 2026-02-21T09:37:21.9257031Z %61 = ttg.local_alloc %60 : (tensor<2x128xf32, #blocked2>) -> !ttg.memdesc<2x128xf32, #shared, #smem> 2026-02-21T09:37:21.9257357Z %62 = ttg.local_load %61 : !ttg.memdesc<2x128xf32, #shared, #smem> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:21.9257820Z %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma> 2026-02-21T09:37:21.9258201Z %64 = arith.truncf %63 : tensor<16x128xf32, #mma> to tensor<16x128xbf16, #mma> 2026-02-21T09:37:21.9258462Z %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:21.9258695Z %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma> 2026-02-21T09:37:21.9258931Z %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:37:21.9259194Z %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x128xi32, #mma> 2026-02-21T09:37:21.9259391Z %69 = tt.broadcast %67 : tensor<1x128xi32, #mma> -> tensor<16x128xi32, #mma> 2026-02-21T09:37:21.9259567Z %70 = arith.addi %68, %69 : tensor<16x128xi32, #mma> 2026-02-21T09:37:21.9259751Z %71 = tt.addptr %15, %70 : tensor<16x128x!tt.ptr, #mma>, tensor<16x128xi32, #mma> 2026-02-21T09:37:21.9259945Z tt.store %71, %64 : tensor<16x128x!tt.ptr, #mma> 2026-02-21T09:37:21.9260085Z } {tt.num_stages = 2 : i32} 2026-02-21T09:37:21.9260185Z tt.return 2026-02-21T09:37:21.9260269Z } 2026-02-21T09:37:21.9260341Z } 2026-02-21T09:37:21.9260385Z 2026-02-21T09:37:21.9260418Z {-# 2026-02-21T09:37:21.9260498Z external_resources: { 2026-02-21T09:37:21.9260600Z mlir_reproducer: { 2026-02-21T09:37:21.9261599Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:21.9262634Z disable_threading: false, 2026-02-21T09:37:21.9262739Z verify_each: true 2026-02-21T09:37:21.9262831Z } 2026-02-21T09:37:21.9262901Z } 2026-02-21T09:37:21.9262972Z #-} 2026-02-21T09:37:21.9263246Z /tmp/torchinductor_root/zz/czz3su6t24flt4g6nymr7ggtobvnfesbvw7h27sy3zis4is7fcfx.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:21.9263926Z /tmp/torchinductor_root/zz/czz3su6t24flt4g6nymr7ggtobvnfesbvw7h27sy3zis4is7fcfx.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:21.9264481Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:21.9265265Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:37:21.9265981Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:21.9266188Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:22.1683307Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:22.1686955Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 4], order = [2, 1, 0]}> 2026-02-21T09:37:22.1687470Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:37:22.1687887Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 4], order = [1, 0]}> 2026-02-21T09:37:22.1688266Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:22.1688596Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:22.1688837Z #smem = #ttg.shared_memory 2026-02-21T09:37:22.1689136Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:22.1689755Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:22.1690278Z %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma> 2026-02-21T09:37:22.1690491Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:22.1690645Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:22.1690798Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:22.1690942Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:22.1691130Z %cst_0 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1691326Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:22.1691483Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:37:22.1691632Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:22.1691928Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:22.1692074Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:37:22.1692223Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:22.1692375Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:22.1692630Z %cst_1 = arith.constant dense<29352960> : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1692987Z %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.1693277Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.1693558Z %cst_4 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1693841Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.1694064Z %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.1694285Z %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:22.1694477Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:22.1694733Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.1695088Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.1695525Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:22.1695932Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.1696276Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.1696656Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.1696969Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1697357Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:22.1697808Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:22.1698243Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.1698519Z %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.1698752Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:37:22.1698972Z %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.1699175Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:37:22.1699407Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:22.1699639Z %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.1699942Z %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.1700238Z %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.1700447Z scf.for %arg3 = %0 to %c112_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:22.1700611Z %19 = arith.divsi %arg3, %c112_i32 : i32 2026-02-21T09:37:22.1700745Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T09:37:22.1700874Z %21 = arith.subi %c4_i32, %20 : i32 2026-02-21T09:37:22.1701002Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T09:37:22.1701138Z %23 = arith.remsi %arg3, %c112_i32 : i32 2026-02-21T09:37:22.1701270Z %24 = arith.remsi %23, %22 : i32 2026-02-21T09:37:22.1701445Z %25 = arith.addi %20, %24 : i32 2026-02-21T09:37:22.1701569Z %26 = arith.divsi %23, %22 : i32 2026-02-21T09:37:22.1701690Z %27 = arith.muli %25, %c16_i32 : i32 2026-02-21T09:37:22.1701874Z %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.1702106Z %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.1702335Z %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.1702564Z %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.1702818Z %32 = arith.muli %26, %c256_i32 : i32 2026-02-21T09:37:22.1703047Z %33 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:22.1703324Z %34 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.1703611Z %35 = arith.addi %33, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:22.1703887Z %36 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.1704178Z %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.1704450Z %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.1704657Z %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.1705048Z %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1705544Z %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T09:37:22.1705784Z %72 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:22.1705972Z %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.1706215Z %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.1706513Z %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.1706816Z %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.1707024Z %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.1707241Z %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.1707447Z %79 = tt.load %78 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.1707714Z %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.1708118Z %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.1708402Z %82 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:22.1708579Z %83 = tt.splat %82 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1708807Z %84 = arith.addi %83, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1709115Z %85 = tt.addptr %7, %84 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1709430Z %86 = tt.load %85 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1709655Z %87 = arith.shli %86, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1709927Z %88 = arith.shrsi %87, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1710158Z %89 = arith.shrsi %86, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1710438Z %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.1710774Z %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.1711052Z %92 = tt.broadcast %90 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1711291Z %93 = arith.select %12, %92, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1711525Z %94 = tt.broadcast %91 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1711755Z %95 = arith.select %14, %94, %93 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1711981Z %96 = tt.reshape %95 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.1712197Z %97 = arith.sitofp %96 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.1712447Z %98 = ttg.local_alloc %97 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.1712771Z %99 = ttg.local_load %98 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.1713273Z %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.1713622Z %101 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:22.1713748Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:37:22.1713919Z %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.1714142Z %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.1714416Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.1714692Z %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.1714883Z %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.1715083Z %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.1715293Z %109 = tt.load %108 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.1715562Z %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.1715969Z %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.1716252Z %112 = arith.muli %101, %c7168_i32 : i32 2026-02-21T09:37:22.1716432Z %113 = tt.splat %112 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1716666Z %114 = arith.addi %113, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1716985Z %115 = tt.addptr %7, %114 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1717302Z %116 = tt.load %115 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1717539Z %117 = arith.shli %116, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1717810Z %118 = arith.shrsi %117, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1718047Z %119 = arith.shrsi %116, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1718411Z %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.1718752Z %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.1719044Z %122 = tt.broadcast %120 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1719288Z %123 = arith.select %12, %122, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1719529Z %124 = tt.broadcast %121 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1719767Z %125 = arith.select %14, %124, %123 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1719998Z %126 = tt.reshape %125 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.1720223Z %127 = arith.sitofp %126 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.1720473Z %128 = ttg.local_alloc %127 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.1720796Z %129 = ttg.local_load %128 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.1721301Z %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.1721645Z %131 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:22.1721769Z %132 = arith.muli %131, %c2_i32 : i32 2026-02-21T09:37:22.1721939Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.1722163Z %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.1722439Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.1722752Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.1722948Z %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.1723144Z %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.1723351Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.1723611Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.1724011Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.1724288Z %142 = arith.muli %131, %c7168_i32 : i32 2026-02-21T09:37:22.1724461Z %143 = tt.splat %142 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1724691Z %144 = arith.addi %143, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1725000Z %145 = tt.addptr %7, %144 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1725307Z %146 = tt.load %145 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1725582Z %147 = arith.shli %146, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1725814Z %148 = arith.shrsi %147, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1726049Z %149 = arith.shrsi %146, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1726340Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.1726673Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.1726961Z %152 = tt.broadcast %150 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1727203Z %153 = arith.select %12, %152, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1727442Z %154 = tt.broadcast %151 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1727682Z %155 = arith.select %14, %154, %153 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1727910Z %156 = tt.reshape %155 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.1728135Z %157 = arith.sitofp %156 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.1728381Z %158 = ttg.local_alloc %157 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.1728702Z %159 = ttg.local_load %158 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.1729205Z %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.1729552Z scf.yield %160 : tensor<16x256xf32, #mma> 2026-02-21T09:37:22.1729676Z } {tt.flatten} 2026-02-21T09:37:22.1729794Z %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.1729989Z %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.1730191Z %44 = tt.load %43 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.1730448Z %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.1730842Z %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.1731167Z %47 = arith.addi %40, %cst_1 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1731478Z %48 = tt.addptr %7, %47 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1731782Z %49 = tt.load %48 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1732007Z %50 = arith.shli %49, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1732235Z %51 = arith.shrsi %50, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1732463Z %52 = arith.shrsi %49, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.1732746Z %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.1733079Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.1733356Z %55 = tt.broadcast %53 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1733635Z %56 = arith.select %12, %55, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1733937Z %57 = tt.broadcast %54 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1734164Z %58 = arith.select %14, %57, %56 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.1734389Z %59 = tt.reshape %58 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.1734605Z %60 = arith.sitofp %59 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.1734855Z %61 = ttg.local_alloc %60 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.1735173Z %62 = ttg.local_load %61 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.1735627Z %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.1736009Z %64 = arith.truncf %63 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma> 2026-02-21T09:37:22.1736268Z %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:22.1736510Z %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma> 2026-02-21T09:37:22.1736743Z %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:37:22.1736999Z %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:22.1737239Z %69 = tt.broadcast %67 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:22.1737415Z %70 = arith.addi %68, %69 : tensor<16x256xi32, #mma> 2026-02-21T09:37:22.1737600Z %71 = tt.addptr %15, %70 : tensor<16x256x!tt.ptr, #mma>, tensor<16x256xi32, #mma> 2026-02-21T09:37:22.1737793Z tt.store %71, %64 : tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:22.1737929Z } {tt.num_stages = 1 : i32} 2026-02-21T09:37:22.1738035Z tt.return 2026-02-21T09:37:22.1738115Z } 2026-02-21T09:37:22.1738193Z } 2026-02-21T09:37:22.1738237Z 2026-02-21T09:37:22.1738269Z {-# 2026-02-21T09:37:22.1738350Z external_resources: { 2026-02-21T09:37:22.1738449Z mlir_reproducer: { 2026-02-21T09:37:22.1739452Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:22.1740437Z disable_threading: false, 2026-02-21T09:37:22.1740546Z verify_each: true 2026-02-21T09:37:22.1740635Z } 2026-02-21T09:37:22.1740709Z } 2026-02-21T09:37:22.1740780Z #-} 2026-02-21T09:37:22.1741055Z /tmp/torchinductor_root/5l/c5lkmmdndf4o3wrzqyrbnjortfj7bf4m24krxczkgzkiw6fnwv6l.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:22.1741730Z /tmp/torchinductor_root/5l/c5lkmmdndf4o3wrzqyrbnjortfj7bf4m24krxczkgzkiw6fnwv6l.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:22.1742281Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:22.1743102Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[1, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:37:22.1743813Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:22.1743980Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:22.3142701Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:22.3146565Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}> 2026-02-21T09:37:22.3147449Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:37:22.3147927Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:37:22.3148296Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}> 2026-02-21T09:37:22.3148646Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:22.3148968Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:22.3149195Z #smem = #ttg.shared_memory 2026-02-21T09:37:22.3149577Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:22.3150183Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:22.3150663Z %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:22.3150888Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.3151107Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.3151337Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma> 2026-02-21T09:37:22.3151573Z %cst_3 = arith.constant dense<7168> : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:22.3151796Z %cst_4 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2> 2026-02-21T09:37:22.3151987Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:22.3152138Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:37:22.3152292Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:22.3152437Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:22.3152580Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:37:22.3152732Z %c4092_i32 = arith.constant 4092 : i32 2026-02-21T09:37:22.3152878Z %c6_i32 = arith.constant 6 : i32 2026-02-21T09:37:22.3153058Z %cst_5 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3153245Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:37:22.3153391Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:22.3153530Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:22.3153764Z %cst_6 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3154003Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:22.3154250Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:22.3154595Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.3154989Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.3155331Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.3155668Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.3155997Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.3156295Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:22.3156544Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:37:22.3156892Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:22.3157387Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:22.3157810Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.3158076Z %12 = arith.cmpi eq, %11, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.3158279Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T09:37:22.3158486Z %14 = arith.cmpi eq, %11, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.3158683Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T09:37:22.3158897Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:22.3164425Z scf.for %arg3 = %0 to %c112_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:22.3164595Z %17 = arith.divsi %arg3, %c112_i32 : i32 2026-02-21T09:37:22.3164732Z %18 = arith.muli %17, %c4_i32 : i32 2026-02-21T09:37:22.3164856Z %19 = arith.subi %c4_i32, %18 : i32 2026-02-21T09:37:22.3164982Z %20 = arith.minsi %19, %c4_i32 : i32 2026-02-21T09:37:22.3165113Z %21 = arith.remsi %arg3, %c112_i32 : i32 2026-02-21T09:37:22.3165237Z %22 = arith.remsi %21, %20 : i32 2026-02-21T09:37:22.3165358Z %23 = arith.addi %18, %22 : i32 2026-02-21T09:37:22.3165475Z %24 = arith.divsi %21, %20 : i32 2026-02-21T09:37:22.3165595Z %25 = arith.muli %23, %c16_i32 : i32 2026-02-21T09:37:22.3165773Z %26 = tt.splat %25 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:22.3166001Z %27 = tt.splat %25 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.3166230Z %28 = arith.addi %26, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:22.3166449Z %29 = arith.addi %27, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.3166625Z %30 = arith.muli %24, %c256_i32 : i32 2026-02-21T09:37:22.3166799Z %31 = tt.splat %30 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.3167023Z %32 = tt.splat %30 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.3167242Z %33 = arith.addi %31, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.3167478Z %34 = arith.addi %32, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.3167744Z %35 = tt.expand_dims %28 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T09:37:22.3167997Z %36 = arith.muli %35, %cst_4 : tensor<16x1xi32, #blocked2> 2026-02-21T09:37:22.3168188Z %37 = tt.broadcast %36 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:22.3168474Z %38 = tt.expand_dims %33 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:37:22.3168801Z %39 = tt.broadcast %38 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:37:22.3169072Z %40 = scf.for %arg4 = %c0_i32 to %c4092_i32 step %c6_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T09:37:22.3169346Z %50 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.3169568Z %51 = arith.addi %50, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.3169744Z %52 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:22.3169914Z %53 = tt.splat %52 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.3170130Z %54 = arith.addi %53, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.3170403Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:37:22.3170680Z %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:22.3170878Z %57 = arith.addi %37, %56 : tensor<16x4xi32, #blocked2> 2026-02-21T09:37:22.3171079Z %58 = tt.addptr %7, %57 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:37:22.3171280Z %59 = tt.load %58 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:22.3171546Z %60 = ttg.convert_layout %59 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.3171980Z %61 = arith.extf %60 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.3172360Z %62 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:37:22.3172608Z %63 = arith.muli %62, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:22.3172803Z %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:37:22.3172998Z %65 = arith.addi %64, %39 : tensor<2x256xi32, #blocked1> 2026-02-21T09:37:22.3173193Z %66 = tt.addptr %8, %65 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:37:22.3173392Z %67 = tt.load %66 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:37:22.3173634Z %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3173913Z %69 = arith.shli %68, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3174151Z %70 = arith.shrsi %69, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3174384Z %71 = arith.shrsi %68, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3174674Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:37:22.3175012Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:37:22.3175294Z %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3175536Z %75 = arith.select %13, %74, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3175772Z %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3176005Z %77 = arith.select %15, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3176237Z %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:37:22.3176501Z %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:37:22.3176755Z %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem> 2026-02-21T09:37:22.3177076Z %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.3177545Z %82 = tt.dot %61, %81, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.3177889Z %83 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:22.3178058Z %84 = tt.splat %83 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.3178276Z %85 = arith.addi %84, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.3178449Z %86 = arith.muli %83, %c2_i32 : i32 2026-02-21T09:37:22.3178611Z %87 = tt.splat %86 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.3178821Z %88 = arith.addi %87, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.3179089Z %89 = tt.expand_dims %88 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:37:22.3179358Z %90 = tt.broadcast %89 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:22.3179547Z %91 = arith.addi %37, %90 : tensor<16x4xi32, #blocked2> 2026-02-21T09:37:22.3179744Z %92 = tt.addptr %7, %91 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:37:22.3179975Z %93 = tt.load %92 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:22.3180233Z %94 = ttg.convert_layout %93 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.3180625Z %95 = arith.extf %94 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.3180998Z %96 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:37:22.3181238Z %97 = arith.muli %96, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:22.3181427Z %98 = tt.broadcast %97 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:37:22.3181614Z %99 = arith.addi %98, %39 : tensor<2x256xi32, #blocked1> 2026-02-21T09:37:22.3181814Z %100 = tt.addptr %8, %99 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:37:22.3182020Z %101 = tt.load %100 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:37:22.3182270Z %102 = ttg.convert_layout %101 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3182553Z %103 = arith.shli %102, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3182790Z %104 = arith.shrsi %103, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3183027Z %105 = arith.shrsi %102, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3183315Z %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:37:22.3183658Z %107 = tt.expand_dims %105 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:37:22.3183951Z %108 = tt.broadcast %106 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3184228Z %109 = arith.select %13, %108, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3184467Z %110 = tt.broadcast %107 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3184702Z %111 = arith.select %15, %110, %109 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3184937Z %112 = tt.reshape %111 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:37:22.3185169Z %113 = arith.sitofp %112 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:37:22.3185426Z %114 = ttg.local_alloc %113 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem> 2026-02-21T09:37:22.3185755Z %115 = ttg.local_load %114 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.3186223Z %116 = tt.dot %95, %115, %82, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.3186568Z %117 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:37:22.3186744Z %118 = tt.splat %117 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.3186966Z %119 = arith.addi %118, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.3187144Z %120 = arith.muli %117, %c2_i32 : i32 2026-02-21T09:37:22.3187313Z %121 = tt.splat %120 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.3187532Z %122 = arith.addi %121, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.3187848Z %123 = tt.expand_dims %122 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:37:22.3188124Z %124 = tt.broadcast %123 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:22.3188319Z %125 = arith.addi %37, %124 : tensor<16x4xi32, #blocked2> 2026-02-21T09:37:22.3188520Z %126 = tt.addptr %7, %125 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:37:22.3188726Z %127 = tt.load %126 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:22.3188992Z %128 = ttg.convert_layout %127 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.3189384Z %129 = arith.extf %128 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.3189769Z %130 = tt.expand_dims %119 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:37:22.3190020Z %131 = arith.muli %130, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:22.3190213Z %132 = tt.broadcast %131 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:37:22.3190408Z %133 = arith.addi %132, %39 : tensor<2x256xi32, #blocked1> 2026-02-21T09:37:22.3190605Z %134 = tt.addptr %8, %133 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:37:22.3190811Z %135 = tt.load %134 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:37:22.3191056Z %136 = ttg.convert_layout %135 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3191339Z %137 = arith.shli %136, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3191579Z %138 = arith.shrsi %137, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3191812Z %139 = arith.shrsi %136, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3192138Z %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:37:22.3192474Z %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:37:22.3192759Z %142 = tt.broadcast %140 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3193004Z %143 = arith.select %13, %142, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3193242Z %144 = tt.broadcast %141 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3193477Z %145 = arith.select %15, %144, %143 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3193711Z %146 = tt.reshape %145 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:37:22.3193939Z %147 = arith.sitofp %146 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:37:22.3194192Z %148 = ttg.local_alloc %147 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem> 2026-02-21T09:37:22.3194515Z %149 = ttg.local_load %148 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.3194985Z %150 = tt.dot %129, %149, %116, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.3195332Z scf.yield %150 : tensor<16x256xf32, #mma> 2026-02-21T09:37:22.3195450Z } 2026-02-21T09:37:22.3195660Z %41 = scf.for %arg4 = %c4092_i32 to %c4096_i32 step %c2_i32 iter_args(%arg5 = %40) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T09:37:22.3195926Z %50 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.3196154Z %51 = arith.addi %50, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.3196329Z %52 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:22.3196493Z %53 = tt.splat %52 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.3196709Z %54 = arith.addi %53, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.3196979Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:37:22.3197253Z %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:22.3197446Z %57 = arith.addi %37, %56 : tensor<16x4xi32, #blocked2> 2026-02-21T09:37:22.3197642Z %58 = tt.addptr %7, %57 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:37:22.3197844Z %59 = tt.load %58 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:22.3198107Z %60 = ttg.convert_layout %59 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.3198508Z %61 = arith.extf %60 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.3198890Z %62 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:37:22.3199132Z %63 = arith.muli %62, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:22.3199322Z %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:37:22.3199512Z %65 = arith.addi %64, %39 : tensor<2x256xi32, #blocked1> 2026-02-21T09:37:22.3199705Z %66 = tt.addptr %8, %65 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:37:22.3199940Z %67 = tt.load %66 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:37:22.3200179Z %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3200455Z %69 = arith.shli %68, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3200683Z %70 = arith.shrsi %69, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3200915Z %71 = arith.shrsi %68, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.3201200Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:37:22.3201533Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:37:22.3201823Z %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3202057Z %75 = arith.select %13, %74, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3202292Z %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3202521Z %77 = arith.select %15, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:37:22.3202769Z %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:37:22.3202990Z %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:37:22.3203402Z %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem> 2026-02-21T09:37:22.3203724Z %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.3204192Z %82 = tt.dot %61, %81, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.3204533Z scf.yield %82 : tensor<16x256xf32, #mma> 2026-02-21T09:37:22.3204660Z } {tt.num_stages = 1 : i32} 2026-02-21T09:37:22.3204816Z %42 = arith.truncf %41 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma> 2026-02-21T09:37:22.3205073Z %43 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:22.3205302Z %44 = arith.muli %43, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:37:22.3205533Z %45 = tt.expand_dims %34 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:37:22.3205789Z %46 = tt.broadcast %44 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:22.3205988Z %47 = tt.broadcast %45 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:22.3206163Z %48 = arith.addi %46, %47 : tensor<16x256xi32, #mma> 2026-02-21T09:37:22.3206347Z %49 = tt.addptr %16, %48 : tensor<16x256x!tt.ptr, #mma>, tensor<16x256xi32, #mma> 2026-02-21T09:37:22.3206534Z tt.store %49, %42 : tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:22.3206671Z } {tt.num_stages = 2 : i32} 2026-02-21T09:37:22.3206775Z tt.return 2026-02-21T09:37:22.3206856Z } 2026-02-21T09:37:22.3206927Z } 2026-02-21T09:37:22.3206971Z 2026-02-21T09:37:22.3207001Z {-# 2026-02-21T09:37:22.3207082Z external_resources: { 2026-02-21T09:37:22.3207179Z mlir_reproducer: { 2026-02-21T09:37:22.3208175Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:22.3209193Z disable_threading: false, 2026-02-21T09:37:22.3209299Z verify_each: true 2026-02-21T09:37:22.3209390Z } 2026-02-21T09:37:22.3209461Z } 2026-02-21T09:37:22.3209532Z #-} 2026-02-21T09:37:22.3209808Z /tmp/torchinductor_root/4z/c4znhgsqfporwgfho3bxq3jlqbwbxoddlldndzgjqz64fqow6iws.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:22.3210498Z /tmp/torchinductor_root/4z/c4znhgsqfporwgfho3bxq3jlqbwbxoddlldndzgjqz64fqow6iws.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:22.3211055Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:22.3211835Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:37:22.3212545Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:22.3212750Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:22.5877601Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:22.5881261Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}> 2026-02-21T09:37:22.5882208Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:37:22.5883120Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}> 2026-02-21T09:37:22.5883896Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:22.5884600Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:22.5885109Z #smem = #ttg.shared_memory 2026-02-21T09:37:22.5885741Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:22.5887191Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:22.5887980Z %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:22.5888291Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.5888591Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.5888900Z %cst_2 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.5889164Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:22.5889423Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma> 2026-02-21T09:37:22.5889803Z %cst_4 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.5890178Z %cst_5 = arith.constant dense<29352960> : tensor<1x256xi32, #blocked2> 2026-02-21T09:37:22.5890539Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:22.5890735Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:22.5890936Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:37:22.5891130Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:22.5891318Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:37:22.5891504Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:22.5891697Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:37:22.5891893Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:22.5892138Z %cst_6 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5892387Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:37:22.5892577Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:22.5892764Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:22.5892947Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:22.5893255Z %cst_7 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5893581Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:22.5893912Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.5894380Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.5894824Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.5895285Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.5895743Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.5896222Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.5896559Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T09:37:22.5897022Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:22.5897605Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:22.5898109Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.5898426Z %11 = arith.cmpi eq, %10, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.5898673Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:37:22.5898923Z %13 = arith.cmpi eq, %10, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.5899160Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:37:22.5899423Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:22.5899687Z %16 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.5900034Z %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.5900372Z %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.5900618Z scf.for %arg3 = %0 to %c112_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:22.5900804Z %19 = arith.divsi %arg3, %c224_i32 : i32 2026-02-21T09:37:22.5900958Z %20 = arith.muli %19, %c8_i32 : i32 2026-02-21T09:37:22.5901107Z %21 = arith.subi %c4_i32, %20 : i32 2026-02-21T09:37:22.5901281Z %22 = arith.minsi %21, %c8_i32 : i32 2026-02-21T09:37:22.5901432Z %23 = arith.remsi %arg3, %c224_i32 : i32 2026-02-21T09:37:22.5901627Z %24 = arith.remsi %23, %22 : i32 2026-02-21T09:37:22.5901768Z %25 = arith.addi %20, %24 : i32 2026-02-21T09:37:22.5901910Z %26 = arith.divsi %23, %22 : i32 2026-02-21T09:37:22.5902052Z %27 = arith.muli %25, %c16_i32 : i32 2026-02-21T09:37:22.5902263Z %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.5902530Z %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.5902794Z %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.5903056Z %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.5903259Z %32 = arith.muli %26, %c256_i32 : i32 2026-02-21T09:37:22.5903462Z %33 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.5903729Z %34 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.5903998Z %35 = arith.addi %33, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.5904262Z %36 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.5904606Z %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.5904923Z %38 = arith.muli %37, %cst_2 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.5905163Z %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.5905513Z %40 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x256xi32, #blocked2> 2026-02-21T09:37:22.5905958Z %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst_3) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T09:37:22.5906234Z %73 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:22.5906454Z %74 = tt.splat %73 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.5906731Z %75 = arith.addi %74, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.5907070Z %76 = tt.expand_dims %75 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.5907415Z %77 = tt.broadcast %76 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.5907654Z %78 = arith.addi %39, %77 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.5907906Z %79 = tt.addptr %6, %78 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.5908146Z %80 = tt.load %79 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.5908417Z %81 = ttg.convert_layout %80 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.5908824Z %82 = arith.extf %81 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.5909107Z %83 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:22.5909254Z %84 = tt.splat %83 : i32 -> tensor<1x256xi32, #blocked2> 2026-02-21T09:37:22.5909414Z %85 = arith.addi %84, %40 : tensor<1x256xi32, #blocked2> 2026-02-21T09:37:22.5909608Z %86 = tt.addptr %7, %85 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi32, #blocked2> 2026-02-21T09:37:22.5909809Z %87 = tt.load %86 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T09:37:22.5910053Z %88 = ttg.convert_layout %87 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5910340Z %89 = arith.shli %88, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5910576Z %90 = arith.shrsi %89, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5910845Z %91 = arith.shrsi %88, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5911139Z %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.5911479Z %93 = tt.expand_dims %91 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.5911769Z %94 = tt.broadcast %92 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5912014Z %95 = arith.select %12, %94, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5912254Z %96 = tt.broadcast %93 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5912486Z %97 = arith.select %14, %96, %95 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5912720Z %98 = tt.reshape %97 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.5912944Z %99 = arith.sitofp %98 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.5913201Z %100 = ttg.local_alloc %99 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.5913533Z %101 = ttg.local_load %100 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.5914019Z %102 = tt.dot %82, %101, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.5914416Z %103 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:22.5914543Z %104 = arith.muli %103, %c2_i32 : i32 2026-02-21T09:37:22.5914722Z %105 = tt.splat %104 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.5914955Z %106 = arith.addi %105, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.5915237Z %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.5915514Z %108 = tt.broadcast %107 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.5915714Z %109 = arith.addi %39, %108 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.5915919Z %110 = tt.addptr %6, %109 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.5916128Z %111 = tt.load %110 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.5916401Z %112 = ttg.convert_layout %111 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.5916807Z %113 = arith.extf %112 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.5917096Z %114 = arith.muli %103, %c7168_i32 : i32 2026-02-21T09:37:22.5917247Z %115 = tt.splat %114 : i32 -> tensor<1x256xi32, #blocked2> 2026-02-21T09:37:22.5917405Z %116 = arith.addi %115, %40 : tensor<1x256xi32, #blocked2> 2026-02-21T09:37:22.5917601Z %117 = tt.addptr %7, %116 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi32, #blocked2> 2026-02-21T09:37:22.5917804Z %118 = tt.load %117 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T09:37:22.5918047Z %119 = ttg.convert_layout %118 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5918330Z %120 = arith.shli %119, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5918596Z %121 = arith.shrsi %120, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5918836Z %122 = arith.shrsi %119, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5919125Z %123 = tt.expand_dims %121 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.5919465Z %124 = tt.expand_dims %122 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.5919754Z %125 = tt.broadcast %123 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5919996Z %126 = arith.select %12, %125, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5920241Z %127 = tt.broadcast %124 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5920477Z %128 = arith.select %14, %127, %126 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5920715Z %129 = tt.reshape %128 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.5920943Z %130 = arith.sitofp %129 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.5921197Z %131 = ttg.local_alloc %130 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.5921524Z %132 = ttg.local_load %131 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.5922023Z %133 = tt.dot %113, %132, %102, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.5922365Z %134 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:22.5922489Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:37:22.5922701Z %136 = tt.splat %135 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.5922925Z %137 = arith.addi %136, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.5923202Z %138 = tt.expand_dims %137 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.5923475Z %139 = tt.broadcast %138 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.5923671Z %140 = arith.addi %39, %139 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.5923868Z %141 = tt.addptr %6, %140 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.5924079Z %142 = tt.load %141 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.5924342Z %143 = ttg.convert_layout %142 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.5924745Z %144 = arith.extf %143 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.5925030Z %145 = arith.muli %134, %c7168_i32 : i32 2026-02-21T09:37:22.5925173Z %146 = tt.splat %145 : i32 -> tensor<1x256xi32, #blocked2> 2026-02-21T09:37:22.5925334Z %147 = arith.addi %146, %40 : tensor<1x256xi32, #blocked2> 2026-02-21T09:37:22.5925529Z %148 = tt.addptr %7, %147 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi32, #blocked2> 2026-02-21T09:37:22.5925735Z %149 = tt.load %148 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T09:37:22.5925983Z %150 = ttg.convert_layout %149 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5926265Z %151 = arith.shli %150, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5926545Z %152 = arith.shrsi %151, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5926779Z %153 = arith.shrsi %150, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5927073Z %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.5927412Z %155 = tt.expand_dims %153 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.5927696Z %156 = tt.broadcast %154 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5927940Z %157 = arith.select %12, %156, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5928183Z %158 = tt.broadcast %155 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5928420Z %159 = arith.select %14, %158, %157 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5928653Z %160 = tt.reshape %159 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.5928878Z %161 = arith.sitofp %160 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.5929136Z %162 = ttg.local_alloc %161 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.5929464Z %163 = ttg.local_load %162 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.5929970Z %164 = tt.dot %144, %163, %133, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.5930318Z scf.yield %164 : tensor<16x256xf32, #mma> 2026-02-21T09:37:22.5930443Z } {tt.flatten} 2026-02-21T09:37:22.5930563Z %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.5930761Z %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.5930961Z %44 = tt.load %43 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.5931220Z %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.5931615Z %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.5931914Z %47 = arith.addi %40, %cst_5 : tensor<1x256xi32, #blocked2> 2026-02-21T09:37:22.5932114Z %48 = tt.addptr %7, %47 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi32, #blocked2> 2026-02-21T09:37:22.5932310Z %49 = tt.load %48 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T09:37:22.5932552Z %50 = ttg.convert_layout %49 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5932829Z %51 = arith.shli %50, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5933061Z %52 = arith.shrsi %51, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5933291Z %53 = arith.shrsi %50, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.5933573Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.5933910Z %55 = tt.expand_dims %53 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.5934190Z %56 = tt.broadcast %54 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5934457Z %57 = arith.select %12, %56, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5934690Z %58 = tt.broadcast %55 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5934916Z %59 = arith.select %14, %58, %57 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.5935141Z %60 = tt.reshape %59 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.5935358Z %61 = arith.sitofp %60 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.5935603Z %62 = ttg.local_alloc %61 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.5935923Z %63 = ttg.local_load %62 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.5936378Z %64 = tt.dot %46, %63, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.5936762Z %65 = arith.truncf %64 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma> 2026-02-21T09:37:22.5937021Z %66 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:22.5937251Z %67 = arith.muli %66, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:37:22.5937482Z %68 = tt.expand_dims %35 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:37:22.5937737Z %69 = tt.broadcast %67 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:22.5937966Z %70 = tt.broadcast %68 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:22.5938138Z %71 = arith.addi %69, %70 : tensor<16x256xi32, #mma> 2026-02-21T09:37:22.5938325Z %72 = tt.addptr %15, %71 : tensor<16x256x!tt.ptr, #mma>, tensor<16x256xi32, #mma> 2026-02-21T09:37:22.5938516Z tt.store %72, %65 : tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:22.5938650Z } {tt.num_stages = 2 : i32} 2026-02-21T09:37:22.5938756Z tt.return 2026-02-21T09:37:22.5938835Z } 2026-02-21T09:37:22.5938909Z } 2026-02-21T09:37:22.5938951Z 2026-02-21T09:37:22.5938982Z {-# 2026-02-21T09:37:22.5939065Z external_resources: { 2026-02-21T09:37:22.5939161Z mlir_reproducer: { 2026-02-21T09:37:22.5940150Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:22.5941136Z disable_threading: false, 2026-02-21T09:37:22.5941242Z verify_each: true 2026-02-21T09:37:22.5941331Z } 2026-02-21T09:37:22.5941405Z } 2026-02-21T09:37:22.5941472Z #-} 2026-02-21T09:37:22.5941749Z /tmp/torchinductor_root/6p/c6poxudocrgxudtay7fxqjmqmqdswjjmsl6bp7kfxegfvwk56ksq.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:22.5942427Z /tmp/torchinductor_root/6p/c6poxudocrgxudtay7fxqjmqmqdswjjmsl6bp7kfxegfvwk56ksq.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:22.5942973Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:22.5943786Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:37:22.5944489Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:22.5944654Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:22.6585434Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:22.6589189Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 4], order = [2, 1, 0]}> 2026-02-21T09:37:22.6590128Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:37:22.6590967Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 4], order = [1, 0]}> 2026-02-21T09:37:22.6591739Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:22.6592440Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:22.6592941Z #smem = #ttg.shared_memory 2026-02-21T09:37:22.6593571Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:22.6595038Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:22.6596100Z %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma> 2026-02-21T09:37:22.6596550Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:22.6596880Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:22.6597198Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:22.6597503Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:22.6597891Z %cst_0 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6598235Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:22.6598437Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:37:22.6598583Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:22.6598697Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:22.6598804Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:37:22.6598920Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:22.6599037Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:22.6599226Z %cst_1 = arith.constant dense<29352960> : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6599493Z %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.6599713Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.6599925Z %cst_4 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6600137Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.6600306Z %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.6600473Z %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:22.6600625Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:22.6600824Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.6601101Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.6601458Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:22.6601774Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.6602047Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.6602290Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.6602533Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6602956Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:22.6603454Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:22.6603913Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.6610674Z %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.6610903Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:37:22.6611106Z %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.6611294Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:37:22.6611503Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:22.6611796Z %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.6612073Z %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.6612349Z %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.6612542Z scf.for %arg3 = %0 to %c112_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:22.6612690Z %19 = arith.divsi %arg3, %c112_i32 : i32 2026-02-21T09:37:22.6612815Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T09:37:22.6612932Z %21 = arith.subi %c4_i32, %20 : i32 2026-02-21T09:37:22.6613049Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T09:37:22.6613168Z %23 = arith.remsi %arg3, %c112_i32 : i32 2026-02-21T09:37:22.6613290Z %24 = arith.remsi %23, %22 : i32 2026-02-21T09:37:22.6613402Z %25 = arith.addi %20, %24 : i32 2026-02-21T09:37:22.6613513Z %26 = arith.divsi %23, %22 : i32 2026-02-21T09:37:22.6613630Z %27 = arith.muli %25, %c16_i32 : i32 2026-02-21T09:37:22.6613797Z %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.6614016Z %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.6614249Z %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.6614457Z %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.6614620Z %32 = arith.muli %26, %c256_i32 : i32 2026-02-21T09:37:22.6614828Z %33 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:22.6615075Z %34 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.6615323Z %35 = arith.addi %33, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:22.6615570Z %36 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.6615877Z %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.6616124Z %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.6618175Z %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.6618582Z %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6618984Z %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T09:37:22.6619206Z %72 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:22.6619385Z %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.6619609Z %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.6619888Z %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.6620167Z %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.6620363Z %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.6620560Z %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.6620764Z %79 = tt.load %78 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.6621028Z %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.6621488Z %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.6621775Z %82 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:22.6621953Z %83 = tt.splat %82 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6622182Z %84 = arith.addi %83, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6622491Z %85 = tt.addptr %7, %84 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6622799Z %86 = tt.load %85 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6623028Z %87 = arith.shli %86, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6623259Z %88 = arith.shrsi %87, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6623491Z %89 = arith.shrsi %86, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6623780Z %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.6624117Z %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.6624400Z %92 = tt.broadcast %90 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6624637Z %93 = arith.select %12, %92, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6624874Z %94 = tt.broadcast %91 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6625103Z %95 = arith.select %14, %94, %93 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6625334Z %96 = tt.reshape %95 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.6625555Z %97 = arith.sitofp %96 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.6625842Z %98 = ttg.local_alloc %97 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.6626161Z %99 = ttg.local_load %98 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.6626630Z %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.6626974Z %101 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:22.6627099Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:37:22.6627271Z %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.6627495Z %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.6627775Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.6628048Z %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.6628243Z %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.6628442Z %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.6628648Z %109 = tt.load %108 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.6628914Z %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.6629340Z %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.6629627Z %112 = arith.muli %101, %c7168_i32 : i32 2026-02-21T09:37:22.6629805Z %113 = tt.splat %112 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6630039Z %114 = arith.addi %113, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6630349Z %115 = tt.addptr %7, %114 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6630657Z %116 = tt.load %115 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6630892Z %117 = arith.shli %116, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6631129Z %118 = arith.shrsi %117, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6631367Z %119 = arith.shrsi %116, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6631661Z %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.6631998Z %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.6632284Z %122 = tt.broadcast %120 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6632526Z %123 = arith.select %12, %122, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6632768Z %124 = tt.broadcast %121 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6633006Z %125 = arith.select %14, %124, %123 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6633240Z %126 = tt.reshape %125 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.6633502Z %127 = arith.sitofp %126 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.6633753Z %128 = ttg.local_alloc %127 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.6634078Z %129 = ttg.local_load %128 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.6634549Z %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.6634888Z %131 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:22.6635012Z %132 = arith.muli %131, %c2_i32 : i32 2026-02-21T09:37:22.6635183Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.6635406Z %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.6635685Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.6635957Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.6636150Z %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.6636348Z %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.6636552Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.6636817Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.6637241Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.6637522Z %142 = arith.muli %131, %c7168_i32 : i32 2026-02-21T09:37:22.6637701Z %143 = tt.splat %142 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6637930Z %144 = arith.addi %143, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6638244Z %145 = tt.addptr %7, %144 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6638556Z %146 = tt.load %145 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6638792Z %147 = arith.shli %146, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6639032Z %148 = arith.shrsi %147, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6639266Z %149 = arith.shrsi %146, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6639558Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.6639893Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.6640178Z %152 = tt.broadcast %150 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6640421Z %153 = arith.select %12, %152, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6640660Z %154 = tt.broadcast %151 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6640897Z %155 = arith.select %14, %154, %153 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6641131Z %156 = tt.reshape %155 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.6641396Z %157 = arith.sitofp %156 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.6641649Z %158 = ttg.local_alloc %157 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.6641973Z %159 = ttg.local_load %158 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.6642443Z %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.6642866Z scf.yield %160 : tensor<16x256xf32, #mma> 2026-02-21T09:37:22.6642987Z } {tt.flatten} 2026-02-21T09:37:22.6643108Z %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.6643300Z %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.6643505Z %44 = tt.load %43 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.6643759Z %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.6644149Z %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.6644477Z %47 = arith.addi %40, %cst_1 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6644784Z %48 = tt.addptr %7, %47 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6645126Z %49 = tt.load %48 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6645351Z %50 = arith.shli %49, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6645580Z %51 = arith.shrsi %50, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6645810Z %52 = arith.shrsi %49, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.6646090Z %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.6646419Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.6646696Z %55 = tt.broadcast %53 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6646928Z %56 = arith.select %12, %55, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6647162Z %57 = tt.broadcast %54 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6647389Z %58 = arith.select %14, %57, %56 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.6647615Z %59 = tt.reshape %58 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.6647834Z %60 = arith.sitofp %59 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.6648077Z %61 = ttg.local_alloc %60 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.6648393Z %62 = ttg.local_load %61 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.6648849Z %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.6649227Z %64 = arith.truncf %63 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma> 2026-02-21T09:37:22.6649518Z %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:22.6649748Z %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma> 2026-02-21T09:37:22.6649980Z %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:37:22.6650238Z %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:22.6650440Z %69 = tt.broadcast %67 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:22.6650615Z %70 = arith.addi %68, %69 : tensor<16x256xi32, #mma> 2026-02-21T09:37:22.6650795Z %71 = tt.addptr %15, %70 : tensor<16x256x!tt.ptr, #mma>, tensor<16x256xi32, #mma> 2026-02-21T09:37:22.6650989Z tt.store %71, %64 : tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:22.6651127Z } {tt.num_stages = 2 : i32} 2026-02-21T09:37:22.6651234Z tt.return 2026-02-21T09:37:22.6651315Z } 2026-02-21T09:37:22.6651389Z } 2026-02-21T09:37:22.6651431Z 2026-02-21T09:37:22.6651460Z {-# 2026-02-21T09:37:22.6651539Z external_resources: { 2026-02-21T09:37:22.6651635Z mlir_reproducer: { 2026-02-21T09:37:22.6652653Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:22.6653630Z disable_threading: false, 2026-02-21T09:37:22.6653734Z verify_each: true 2026-02-21T09:37:22.6653826Z } 2026-02-21T09:37:22.6653895Z } 2026-02-21T09:37:22.6653964Z #-} 2026-02-21T09:37:22.6654243Z /tmp/torchinductor_root/xl/cxl247ga7zjukj7rwk2nqnku4ple2jgfgavadqlguhja5hgrbtdf.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:22.6654924Z /tmp/torchinductor_root/xl/cxl247ga7zjukj7rwk2nqnku4ple2jgfgavadqlguhja5hgrbtdf.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:22.6655467Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:22.6656238Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:37:22.6656939Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:22.6657103Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:22.8531631Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:22.8535166Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 4], order = [2, 1, 0]}> 2026-02-21T09:37:22.8536113Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:37:22.8536953Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 4], order = [1, 0]}> 2026-02-21T09:37:22.8537688Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:22.8538030Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:22.8538266Z #smem = #ttg.shared_memory 2026-02-21T09:37:22.8538568Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:22.8539194Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:22.8539695Z %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma> 2026-02-21T09:37:22.8539907Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:22.8540064Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:22.8540224Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:22.8540365Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:22.8540552Z %cst_0 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8540751Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:37:22.8540902Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T09:37:22.8541054Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:37:22.8541203Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:22.8541350Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:22.8541490Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:37:22.8541639Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:22.8541787Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:22.8542096Z %cst_1 = arith.constant dense<29352960> : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8542447Z %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.8542785Z %cst_3 = arith.constant dense<0> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8543115Z %cst_4 = arith.constant dense<7168> : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8543442Z %cst_5 = arith.constant dense<0> : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8543743Z %cst_6 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.8544016Z %cst_7 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8544287Z %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.8544505Z %cst_9 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.8544728Z %cst_10 = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:22.8544917Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:22.8545169Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.8545523Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.8545923Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:22.8546323Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.8546673Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.8546980Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.8547275Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8547699Z %8 = arith.extsi %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:22.8548188Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:22.8548622Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:22.8549043Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.8549306Z %12 = arith.cmpi eq, %11, %cst_8 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.8549514Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:37:22.8549722Z %14 = arith.cmpi eq, %11, %cst_9 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.8549924Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:37:22.8550144Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:22.8550366Z %17 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.8550658Z %18 = tt.expand_dims %17 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.8550942Z %19 = tt.broadcast %18 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.8551145Z scf.for %arg3 = %0 to %c112_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:22.8551301Z %20 = arith.divsi %arg3, %c112_i32 : i32 2026-02-21T09:37:22.8551429Z %21 = arith.muli %20, %c4_i32 : i32 2026-02-21T09:37:22.8551584Z %22 = arith.subi %c4_i32, %21 : i32 2026-02-21T09:37:22.8551705Z %23 = arith.minsi %22, %c4_i32 : i32 2026-02-21T09:37:22.8551837Z %24 = arith.remsi %arg3, %c112_i32 : i32 2026-02-21T09:37:22.8551961Z %25 = arith.remsi %24, %23 : i32 2026-02-21T09:37:22.8552099Z %26 = arith.addi %21, %25 : i32 2026-02-21T09:37:22.8552215Z %27 = arith.divsi %24, %23 : i32 2026-02-21T09:37:22.8552337Z %28 = arith.muli %26, %c16_i32 : i32 2026-02-21T09:37:22.8552513Z %29 = tt.splat %28 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.8552740Z %30 = tt.splat %28 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.8552962Z %31 = arith.addi %29, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.8553183Z %32 = arith.addi %30, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.8553357Z %33 = arith.muli %27, %c256_i32 : i32 2026-02-21T09:37:22.8553524Z %34 = tt.splat %33 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.8553741Z %35 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.8554022Z %36 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.8554285Z %37 = arith.muli %36, %cst_6 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.8554489Z %38 = tt.broadcast %37 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.8554668Z %39 = arith.extsi %33 : i32 to i64 2026-02-21T09:37:22.8554887Z %40 = tt.splat %39 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:22.8555201Z %41 = arith.addi %40, %8 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:22.8555617Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8556041Z %43 = arith.cmpi sge, %42, %cst_5 : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8556297Z %44 = arith.cmpi slt, %42, %cst_4 : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8556542Z %45 = arith.andi %43, %44 : tensor<1x256xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8556820Z %46 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T09:37:22.8557049Z %77 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:22.8557227Z %78 = tt.splat %77 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.8557444Z %79 = arith.addi %78, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.8557714Z %80 = tt.expand_dims %79 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.8557988Z %81 = tt.broadcast %80 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.8558177Z %82 = arith.addi %38, %81 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.8558373Z %83 = tt.addptr %6, %82 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.8558572Z %84 = tt.load %83 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.8558834Z %85 = ttg.convert_layout %84 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.8559234Z %86 = arith.extf %85 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.8559552Z %87 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:37:22.8559673Z %88 = arith.muli %87, %c7168_i64 : i64 2026-02-21T09:37:22.8559847Z %89 = tt.splat %88 : i64 -> tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8560072Z %90 = arith.addi %89, %42 : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8560383Z %91 = tt.addptr %7, %90 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8560703Z %92 = tt.load %91, %45, %cst_3 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8560944Z %93 = arith.shli %92, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8561174Z %94 = arith.shrsi %93, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8561407Z %95 = arith.shrsi %92, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8561691Z %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.8562026Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.8562306Z %98 = tt.broadcast %96 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8562542Z %99 = arith.select %13, %98, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8562852Z %100 = tt.broadcast %97 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8563087Z %101 = arith.select %15, %100, %99 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8563322Z %102 = tt.reshape %101 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.8563551Z %103 = arith.sitofp %102 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.8563841Z %104 = ttg.local_alloc %103 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.8564169Z %105 = ttg.local_load %104 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.8564641Z %106 = tt.dot %86, %105, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.8564985Z %107 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:22.8565107Z %108 = arith.muli %107, %c2_i32 : i32 2026-02-21T09:37:22.8565275Z %109 = tt.splat %108 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.8565498Z %110 = arith.addi %109, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.8565777Z %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.8566049Z %112 = tt.broadcast %111 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.8566241Z %113 = arith.addi %38, %112 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.8566438Z %114 = tt.addptr %6, %113 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.8566642Z %115 = tt.load %114 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.8566905Z %116 = ttg.convert_layout %115 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.8567345Z %117 = arith.extf %116 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.8567621Z %118 = arith.extsi %107 : i32 to i64 2026-02-21T09:37:22.8567743Z %119 = arith.muli %118, %c7168_i64 : i64 2026-02-21T09:37:22.8567918Z %120 = tt.splat %119 : i64 -> tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8568145Z %121 = arith.addi %120, %42 : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8568454Z %122 = tt.addptr %7, %121 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8568776Z %123 = tt.load %122, %45, %cst_3 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8569019Z %124 = arith.shli %123, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8569255Z %125 = arith.shrsi %124, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8569490Z %126 = arith.shrsi %123, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8569779Z %127 = tt.expand_dims %125 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.8570112Z %128 = tt.expand_dims %126 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.8570396Z %129 = tt.broadcast %127 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8570638Z %130 = arith.select %13, %129, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8570876Z %131 = tt.broadcast %128 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8571110Z %132 = arith.select %15, %131, %130 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8571340Z %133 = tt.reshape %132 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.8571598Z %134 = arith.sitofp %133 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.8571849Z %135 = ttg.local_alloc %134 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.8572169Z %136 = ttg.local_load %135 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.8572633Z %137 = tt.dot %117, %136, %106, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.8572975Z %138 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:22.8573094Z %139 = arith.muli %138, %c2_i32 : i32 2026-02-21T09:37:22.8573264Z %140 = tt.splat %139 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.8573488Z %141 = arith.addi %140, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.8573760Z %142 = tt.expand_dims %141 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.8574036Z %143 = tt.broadcast %142 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.8574226Z %144 = arith.addi %38, %143 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.8574422Z %145 = tt.addptr %6, %144 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.8574624Z %146 = tt.load %145 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.8574923Z %147 = ttg.convert_layout %146 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.8575318Z %148 = arith.extf %147 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.8575595Z %149 = arith.extsi %138 : i32 to i64 2026-02-21T09:37:22.8575715Z %150 = arith.muli %149, %c7168_i64 : i64 2026-02-21T09:37:22.8575893Z %151 = tt.splat %150 : i64 -> tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8576125Z %152 = arith.addi %151, %42 : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8576438Z %153 = tt.addptr %7, %152 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8576704Z %154 = arith.cmpi slt, %149, %c4096_i64 : i64 2026-02-21T09:37:22.8576885Z %155 = tt.splat %154 : i1 -> tensor<1x256xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8577112Z %156 = arith.andi %155, %45 : tensor<1x256xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8577357Z %157 = tt.load %153, %156, %cst_3 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8577603Z %158 = arith.shli %157, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8577837Z %159 = arith.shrsi %158, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8578073Z %160 = arith.shrsi %157, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8578363Z %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.8578696Z %162 = tt.expand_dims %160 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.8578983Z %163 = tt.broadcast %161 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8579256Z %164 = arith.select %13, %163, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8579499Z %165 = tt.broadcast %162 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8579738Z %166 = arith.select %15, %165, %164 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8579972Z %167 = tt.reshape %166 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.8580198Z %168 = arith.sitofp %167 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.8580448Z %169 = ttg.local_alloc %168 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.8580774Z %170 = ttg.local_load %169 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.8581236Z %171 = tt.dot %148, %170, %137, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.8581583Z scf.yield %171 : tensor<16x256xf32, #mma> 2026-02-21T09:37:22.8581719Z } {tt.disallow_acc_multi_buffer, tt.flatten} 2026-02-21T09:37:22.8581871Z %47 = arith.addi %38, %19 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.8582063Z %48 = tt.addptr %6, %47 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.8582260Z %49 = tt.load %48 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.8582514Z %50 = ttg.convert_layout %49 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.8582934Z %51 = arith.extf %50 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.8583264Z %52 = arith.addi %42, %cst_1 : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8583571Z %53 = tt.addptr %7, %52 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8583884Z %54 = tt.load %53, %45, %cst_3 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8584118Z %55 = arith.shli %54, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8584345Z %56 = arith.shrsi %55, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8584571Z %57 = arith.shrsi %54, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.8584851Z %58 = tt.expand_dims %56 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.8585179Z %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:22.8585454Z %60 = tt.broadcast %58 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8585687Z %61 = arith.select %13, %60, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8585918Z %62 = tt.broadcast %59 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8586142Z %63 = arith.select %15, %62, %61 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:22.8586366Z %64 = tt.reshape %63 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:22.8586584Z %65 = arith.sitofp %64 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:22.8586830Z %66 = ttg.local_alloc %65 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:22.8587180Z %67 = ttg.local_load %66 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.8587637Z %68 = tt.dot %51, %67, %46, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:22.8588012Z %69 = arith.truncf %68 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma> 2026-02-21T09:37:22.8588271Z %70 = tt.expand_dims %32 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:22.8588500Z %71 = arith.muli %70, %cst_10 : tensor<16x1xi32, #mma> 2026-02-21T09:37:22.8588734Z %72 = tt.expand_dims %35 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:37:22.8588987Z %73 = tt.broadcast %71 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:22.8589188Z %74 = tt.broadcast %72 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:22.8589363Z %75 = arith.addi %73, %74 : tensor<16x256xi32, #mma> 2026-02-21T09:37:22.8589543Z %76 = tt.addptr %16, %75 : tensor<16x256x!tt.ptr, #mma>, tensor<16x256xi32, #mma> 2026-02-21T09:37:22.8589733Z tt.store %76, %69 : tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:22.8589867Z } {tt.num_stages = 2 : i32} 2026-02-21T09:37:22.8589970Z tt.return 2026-02-21T09:37:22.8590048Z } 2026-02-21T09:37:22.8590120Z } 2026-02-21T09:37:22.8590162Z 2026-02-21T09:37:22.8590192Z {-# 2026-02-21T09:37:22.8590271Z external_resources: { 2026-02-21T09:37:22.8590370Z mlir_reproducer: { 2026-02-21T09:37:22.8591389Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:22.8592379Z disable_threading: false, 2026-02-21T09:37:22.8592484Z verify_each: true 2026-02-21T09:37:22.8592570Z } 2026-02-21T09:37:22.8592643Z } 2026-02-21T09:37:22.8592712Z #-} 2026-02-21T09:37:22.8592986Z /tmp/torchinductor_root/vq/cvqpt4p6b7rcrefty43lyx4zk3ehkrwcm5jbv3au3db4fydatzse.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:22.8593661Z /tmp/torchinductor_root/vq/cvqpt4p6b7rcrefty43lyx4zk3ehkrwcm5jbv3au3db4fydatzse.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:22.8594207Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:22.8594988Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:37:22.8595704Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:22.8595872Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:22.9857107Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:22.9859571Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:37:22.9860467Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:37:22.9861310Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:37:22.9862129Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:37:22.9862946Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:22.9863827Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:22.9865145Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:22.9866284Z %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:22.9866795Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.9867269Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.9867776Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x128xf32, #mma> 2026-02-21T09:37:22.9868278Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.9868673Z %cst_4 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.9869149Z %cst_5 = arith.constant dense<29352960> : tensor<1x128xi32, #blocked2> 2026-02-21T09:37:22.9869419Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:22.9869629Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:22.9869836Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:37:22.9870032Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:22.9870229Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:22.9870425Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:37:22.9870626Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:22.9870813Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:22.9871065Z %cst_6 = arith.constant dense<0> : tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9871311Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:22.9871499Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:22.9871688Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:22.9872008Z %cst_7 = arith.constant dense<4> : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9872326Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:22.9872667Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.9873140Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.9873608Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.9874077Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.9874535Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.9874951Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.9875296Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x!tt.ptr, #blocked2> 2026-02-21T09:37:22.9875759Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:22.9876525Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:22.9877220Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.9877625Z %11 = arith.cmpi eq, %10, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.9877950Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x128xi1, #blocked> 2026-02-21T09:37:22.9878208Z %13 = arith.cmpi eq, %10, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:22.9878451Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x128xi1, #blocked> 2026-02-21T09:37:22.9878722Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x128x!tt.ptr, #mma> 2026-02-21T09:37:22.9878991Z %16 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.9879351Z %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.9879700Z %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.9879949Z scf.for %arg3 = %0 to %c224_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:22.9880134Z %19 = arith.divsi %arg3, %c224_i32 : i32 2026-02-21T09:37:22.9880294Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T09:37:22.9880444Z %21 = arith.subi %c4_i32, %20 : i32 2026-02-21T09:37:22.9880587Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T09:37:22.9880742Z %23 = arith.remsi %arg3, %c224_i32 : i32 2026-02-21T09:37:22.9880895Z %24 = arith.remsi %23, %22 : i32 2026-02-21T09:37:22.9881084Z %25 = arith.addi %20, %24 : i32 2026-02-21T09:37:22.9881230Z %26 = arith.divsi %23, %22 : i32 2026-02-21T09:37:22.9881382Z %27 = arith.muli %25, %c16_i32 : i32 2026-02-21T09:37:22.9881596Z %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.9881865Z %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.9882143Z %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:22.9882410Z %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:22.9882693Z %32 = arith.muli %26, %c128_i32 : i32 2026-02-21T09:37:22.9882911Z %33 = tt.splat %32 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.9883181Z %34 = tt.splat %32 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.9883455Z %35 = arith.addi %33, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:22.9883733Z %36 = arith.addi %34, %4 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:22.9884081Z %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.9884406Z %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:22.9884653Z %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.9885013Z %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi32, #blocked2> 2026-02-21T09:37:22.9885434Z %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x128xf32, #mma>) : i32 { 2026-02-21T09:37:22.9885713Z %72 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:22.9885941Z %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.9886212Z %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.9886606Z %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.9886909Z %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.9887103Z %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.9887308Z %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.9887514Z %79 = tt.load %78 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.9887788Z %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.9888202Z %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.9888498Z %82 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:22.9888642Z %83 = tt.splat %82 : i32 -> tensor<1x128xi32, #blocked2> 2026-02-21T09:37:22.9888800Z %84 = arith.addi %83, %40 : tensor<1x128xi32, #blocked2> 2026-02-21T09:37:22.9888999Z %85 = tt.addptr %7, %84 : tensor<1x128x!tt.ptr, #blocked2>, tensor<1x128xi32, #blocked2> 2026-02-21T09:37:22.9889201Z %86 = tt.load %85 : tensor<1x128x!tt.ptr, #blocked2> 2026-02-21T09:37:22.9889449Z %87 = ttg.convert_layout %86 : tensor<1x128xi8, #blocked2> -> tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9889741Z %88 = arith.shli %87, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9890019Z %89 = arith.shrsi %88, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9890259Z %90 = arith.shrsi %87, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9890556Z %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:22.9890899Z %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:22.9891188Z %93 = tt.broadcast %91 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9891432Z %94 = arith.select %12, %93, %cst_6 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9891674Z %95 = tt.broadcast %92 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9891907Z %96 = arith.select %14, %95, %94 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9892143Z %97 = tt.reshape %96 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked3> 2026-02-21T09:37:22.9892372Z %98 = arith.sitofp %97 : tensor<2x128xi8, #blocked3> to tensor<2x128xf32, #blocked3> 2026-02-21T09:37:22.9892687Z %99 = ttg.convert_layout %98 : tensor<2x128xf32, #blocked3> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.9893167Z %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma> 2026-02-21T09:37:22.9893527Z %101 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:22.9893651Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:37:22.9893828Z %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.9894061Z %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.9894344Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.9894687Z %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.9894886Z %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.9895094Z %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.9895311Z %109 = tt.load %108 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.9895585Z %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.9895998Z %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.9896292Z %112 = arith.muli %101, %c7168_i32 : i32 2026-02-21T09:37:22.9896441Z %113 = tt.splat %112 : i32 -> tensor<1x128xi32, #blocked2> 2026-02-21T09:37:22.9896611Z %114 = arith.addi %113, %40 : tensor<1x128xi32, #blocked2> 2026-02-21T09:37:22.9896813Z %115 = tt.addptr %7, %114 : tensor<1x128x!tt.ptr, #blocked2>, tensor<1x128xi32, #blocked2> 2026-02-21T09:37:22.9897023Z %116 = tt.load %115 : tensor<1x128x!tt.ptr, #blocked2> 2026-02-21T09:37:22.9897273Z %117 = ttg.convert_layout %116 : tensor<1x128xi8, #blocked2> -> tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9897564Z %118 = arith.shli %117, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9897803Z %119 = arith.shrsi %118, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9898068Z %120 = arith.shrsi %117, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9898359Z %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:22.9898697Z %122 = tt.expand_dims %120 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:22.9898985Z %123 = tt.broadcast %121 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9899230Z %124 = arith.select %12, %123, %cst_6 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9899474Z %125 = tt.broadcast %122 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9899713Z %126 = arith.select %14, %125, %124 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9899944Z %127 = tt.reshape %126 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked3> 2026-02-21T09:37:22.9900175Z %128 = arith.sitofp %127 : tensor<2x128xi8, #blocked3> to tensor<2x128xf32, #blocked3> 2026-02-21T09:37:22.9900478Z %129 = ttg.convert_layout %128 : tensor<2x128xf32, #blocked3> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.9900939Z %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma> 2026-02-21T09:37:22.9901281Z %131 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:22.9901401Z %132 = arith.muli %131, %c2_i32 : i32 2026-02-21T09:37:22.9901572Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.9901796Z %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:22.9902074Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:22.9902386Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.9902577Z %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.9902775Z %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.9902977Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.9903244Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.9903640Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.9903919Z %142 = arith.muli %131, %c7168_i32 : i32 2026-02-21T09:37:22.9904063Z %143 = tt.splat %142 : i32 -> tensor<1x128xi32, #blocked2> 2026-02-21T09:37:22.9904221Z %144 = arith.addi %143, %40 : tensor<1x128xi32, #blocked2> 2026-02-21T09:37:22.9904418Z %145 = tt.addptr %7, %144 : tensor<1x128x!tt.ptr, #blocked2>, tensor<1x128xi32, #blocked2> 2026-02-21T09:37:22.9904618Z %146 = tt.load %145 : tensor<1x128x!tt.ptr, #blocked2> 2026-02-21T09:37:22.9904862Z %147 = ttg.convert_layout %146 : tensor<1x128xi8, #blocked2> -> tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9905145Z %148 = arith.shli %147, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9905381Z %149 = arith.shrsi %148, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9905616Z %150 = arith.shrsi %147, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9905937Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:22.9906276Z %152 = tt.expand_dims %150 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:22.9906561Z %153 = tt.broadcast %151 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9906804Z %154 = arith.select %12, %153, %cst_6 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9907044Z %155 = tt.broadcast %152 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9907282Z %156 = arith.select %14, %155, %154 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9907517Z %157 = tt.reshape %156 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked3> 2026-02-21T09:37:22.9907742Z %158 = arith.sitofp %157 : tensor<2x128xi8, #blocked3> to tensor<2x128xf32, #blocked3> 2026-02-21T09:37:22.9908039Z %159 = ttg.convert_layout %158 : tensor<2x128xf32, #blocked3> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.9908505Z %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma> 2026-02-21T09:37:22.9908851Z scf.yield %160 : tensor<16x128xf32, #mma> 2026-02-21T09:37:22.9908973Z } {tt.flatten} 2026-02-21T09:37:22.9909088Z %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.9909281Z %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:22.9909477Z %44 = tt.load %43 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:22.9909732Z %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.9910117Z %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.9910441Z %47 = arith.addi %40, %cst_5 : tensor<1x128xi32, #blocked2> 2026-02-21T09:37:22.9910637Z %48 = tt.addptr %7, %47 : tensor<1x128x!tt.ptr, #blocked2>, tensor<1x128xi32, #blocked2> 2026-02-21T09:37:22.9910831Z %49 = tt.load %48 : tensor<1x128x!tt.ptr, #blocked2> 2026-02-21T09:37:22.9911068Z %50 = ttg.convert_layout %49 : tensor<1x128xi8, #blocked2> -> tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9911341Z %51 = arith.shli %50, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9911570Z %52 = arith.shrsi %51, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9911800Z %53 = arith.shrsi %50, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:22.9912082Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:22.9912415Z %55 = tt.expand_dims %53 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:37:22.9912691Z %56 = tt.broadcast %54 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9912922Z %57 = arith.select %12, %56, %cst_6 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9913154Z %58 = tt.broadcast %55 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9913379Z %59 = arith.select %14, %58, %57 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:37:22.9913600Z %60 = tt.reshape %59 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked3> 2026-02-21T09:37:22.9913863Z %61 = arith.sitofp %60 : tensor<2x128xi8, #blocked3> to tensor<2x128xf32, #blocked3> 2026-02-21T09:37:22.9914152Z %62 = ttg.convert_layout %61 : tensor<2x128xf32, #blocked3> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:22.9914604Z %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma> 2026-02-21T09:37:22.9914980Z %64 = arith.truncf %63 : tensor<16x128xf32, #mma> to tensor<16x128xbf16, #mma> 2026-02-21T09:37:22.9915243Z %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:22.9915481Z %66 = arith.muli %65, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:37:22.9915711Z %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:37:22.9915969Z %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x128xi32, #mma> 2026-02-21T09:37:22.9916169Z %69 = tt.broadcast %67 : tensor<1x128xi32, #mma> -> tensor<16x128xi32, #mma> 2026-02-21T09:37:22.9916347Z %70 = arith.addi %68, %69 : tensor<16x128xi32, #mma> 2026-02-21T09:37:22.9916530Z %71 = tt.addptr %15, %70 : tensor<16x128x!tt.ptr, #mma>, tensor<16x128xi32, #mma> 2026-02-21T09:37:22.9916719Z tt.store %71, %64 : tensor<16x128x!tt.ptr, #mma> 2026-02-21T09:37:22.9916881Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 2 : i32} 2026-02-21T09:37:22.9917015Z tt.return 2026-02-21T09:37:22.9917098Z } 2026-02-21T09:37:22.9917170Z } 2026-02-21T09:37:22.9917216Z 2026-02-21T09:37:22.9917247Z {-# 2026-02-21T09:37:22.9917326Z external_resources: { 2026-02-21T09:37:22.9917429Z mlir_reproducer: { 2026-02-21T09:37:22.9918424Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:22.9919443Z disable_threading: false, 2026-02-21T09:37:22.9919547Z verify_each: true 2026-02-21T09:37:22.9919641Z } 2026-02-21T09:37:22.9919713Z } 2026-02-21T09:37:22.9919785Z #-} 2026-02-21T09:37:22.9920065Z /tmp/torchinductor_root/am/cambokeaynklkiwh66dt4xp3c3u5b7f7n3f6v6blvel4ay7c5o5p.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:22.9920748Z /tmp/torchinductor_root/am/cambokeaynklkiwh66dt4xp3c3u5b7f7n3f6v6blvel4ay7c5o5p.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:22.9921307Z [79s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:22.9922111Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:37:22.9922867Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:22.9923090Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:23.1257569Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:23.1260979Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T09:37:23.1261937Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:37:23.1262791Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:37:23.1263599Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:37:23.1264359Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:23.1265065Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:23.1265565Z #smem = #ttg.shared_memory 2026-02-21T09:37:23.1266212Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:23.1267520Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:23.1268577Z %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:23.1268955Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:23.1269297Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:23.1269641Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma> 2026-02-21T09:37:23.1270015Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:23.1270431Z %cst_4 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.1271104Z %cst_5 = arith.constant dense<29352960> : tensor<1x256xi32, #blocked2> 2026-02-21T09:37:23.1271409Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:23.1271639Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:23.1271875Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:37:23.1272095Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:23.1272317Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:23.1272535Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:37:23.1272750Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:23.1272966Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:23.1273251Z %cst_6 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1273531Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:23.1273753Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:23.1273968Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:23.1274319Z %cst_7 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1274689Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:23.1275065Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:23.1275599Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:23.1276120Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:23.1276639Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:23.1277152Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.1277652Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:23.1277949Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T09:37:23.1278330Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:23.1278903Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:23.1279468Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:23.1279821Z %11 = arith.cmpi eq, %10, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:23.1280109Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:37:23.1280383Z %13 = arith.cmpi eq, %10, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:23.1280643Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:37:23.1280945Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:23.1281245Z %16 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.1281631Z %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:23.1282014Z %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.1282285Z scf.for %arg3 = %0 to %c112_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:23.1282495Z %19 = arith.divsi %arg3, %c112_i32 : i32 2026-02-21T09:37:23.1282738Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T09:37:23.1282907Z %21 = arith.subi %c4_i32, %20 : i32 2026-02-21T09:37:23.1283076Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T09:37:23.1283242Z %23 = arith.remsi %arg3, %c112_i32 : i32 2026-02-21T09:37:23.1283464Z %24 = arith.remsi %23, %22 : i32 2026-02-21T09:37:23.1283625Z %25 = arith.addi %20, %24 : i32 2026-02-21T09:37:23.1283787Z %26 = arith.divsi %23, %22 : i32 2026-02-21T09:37:23.1283947Z %27 = arith.muli %25, %c16_i32 : i32 2026-02-21T09:37:23.1284180Z %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:23.1284479Z %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:23.1284776Z %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:23.1285091Z %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:23.1285317Z %32 = arith.muli %26, %c256_i32 : i32 2026-02-21T09:37:23.1285551Z %33 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:23.1285850Z %34 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:23.1286151Z %35 = arith.addi %33, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:23.1286445Z %36 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:23.1286803Z %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:23.1287078Z %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:23.1287292Z %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.1287599Z %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x256xi32, #blocked2> 2026-02-21T09:37:23.1288116Z %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T09:37:23.1288363Z %73 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:23.1288554Z %74 = tt.splat %73 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.1288796Z %75 = arith.addi %74, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.1289092Z %76 = tt.expand_dims %75 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:23.1289394Z %77 = tt.broadcast %76 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.1289608Z %78 = arith.addi %39, %77 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.1289824Z %79 = tt.addptr %6, %78 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.1290049Z %80 = tt.load %79 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:23.1290339Z %81 = ttg.convert_layout %80 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.1290786Z %82 = arith.extf %81 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.1291101Z %83 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:23.1291261Z %84 = tt.splat %83 : i32 -> tensor<1x256xi32, #blocked2> 2026-02-21T09:37:23.1291434Z %85 = arith.addi %84, %40 : tensor<1x256xi32, #blocked2> 2026-02-21T09:37:23.1291644Z %86 = tt.addptr %7, %85 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi32, #blocked2> 2026-02-21T09:37:23.1291860Z %87 = tt.load %86 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T09:37:23.1292122Z %88 = ttg.convert_layout %87 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1292434Z %89 = arith.shli %88, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1292691Z %90 = arith.shrsi %89, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1292982Z %91 = arith.shrsi %88, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1293304Z %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.1293677Z %93 = tt.expand_dims %91 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.1293986Z %94 = tt.broadcast %92 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1294255Z %95 = arith.select %12, %94, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1294516Z %96 = tt.broadcast %93 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1294772Z %97 = arith.select %14, %96, %95 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1295033Z %98 = tt.reshape %97 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3> 2026-02-21T09:37:23.1295277Z %99 = arith.sitofp %98 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3> 2026-02-21T09:37:23.1295559Z %100 = ttg.local_alloc %99 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:23.1295922Z %101 = ttg.local_load %100 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.1296453Z %102 = tt.dot %82, %101, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:23.1296886Z %103 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:23.1297024Z %104 = arith.muli %103, %c2_i32 : i32 2026-02-21T09:37:23.1297219Z %105 = tt.splat %104 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.1297443Z %106 = arith.addi %105, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.1297725Z %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:23.1298007Z %108 = tt.broadcast %107 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.1298202Z %109 = arith.addi %39, %108 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.1298403Z %110 = tt.addptr %6, %109 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.1298610Z %111 = tt.load %110 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:23.1298883Z %112 = ttg.convert_layout %111 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.1299285Z %113 = arith.extf %112 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.1299563Z %114 = arith.muli %103, %c7168_i32 : i32 2026-02-21T09:37:23.1299710Z %115 = tt.splat %114 : i32 -> tensor<1x256xi32, #blocked2> 2026-02-21T09:37:23.1299870Z %116 = arith.addi %115, %40 : tensor<1x256xi32, #blocked2> 2026-02-21T09:37:23.1300069Z %117 = tt.addptr %7, %116 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi32, #blocked2> 2026-02-21T09:37:23.1300274Z %118 = tt.load %117 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T09:37:23.1300517Z %119 = ttg.convert_layout %118 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1300811Z %120 = arith.shli %119, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1301094Z %121 = arith.shrsi %120, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1301336Z %122 = arith.shrsi %119, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1301629Z %123 = tt.expand_dims %121 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.1301965Z %124 = tt.expand_dims %122 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.1302255Z %125 = tt.broadcast %123 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1302498Z %126 = arith.select %12, %125, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1302742Z %127 = tt.broadcast %124 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1302977Z %128 = arith.select %14, %127, %126 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1303207Z %129 = tt.reshape %128 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3> 2026-02-21T09:37:23.1303431Z %130 = arith.sitofp %129 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3> 2026-02-21T09:37:23.1303708Z %131 = ttg.local_alloc %130 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:23.1304123Z %132 = ttg.local_load %131 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.1304647Z %133 = tt.dot %113, %132, %102, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:23.1305027Z %134 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:23.1305193Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:37:23.1305389Z %136 = tt.splat %135 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.1305640Z %137 = arith.addi %136, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.1305939Z %138 = tt.expand_dims %137 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:23.1312190Z %139 = tt.broadcast %138 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.1312416Z %140 = arith.addi %39, %139 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.1312621Z %141 = tt.addptr %6, %140 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.1312832Z %142 = tt.load %141 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:23.1313106Z %143 = ttg.convert_layout %142 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.1313512Z %144 = arith.extf %143 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.1313799Z %145 = arith.muli %134, %c7168_i32 : i32 2026-02-21T09:37:23.1313946Z %146 = tt.splat %145 : i32 -> tensor<1x256xi32, #blocked2> 2026-02-21T09:37:23.1314107Z %147 = arith.addi %146, %40 : tensor<1x256xi32, #blocked2> 2026-02-21T09:37:23.1314304Z %148 = tt.addptr %7, %147 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi32, #blocked2> 2026-02-21T09:37:23.1314508Z %149 = tt.load %148 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T09:37:23.1314756Z %150 = ttg.convert_layout %149 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1315036Z %151 = arith.shli %150, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1315337Z %152 = arith.shrsi %151, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1315573Z %153 = arith.shrsi %150, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1315864Z %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.1316205Z %155 = tt.expand_dims %153 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.1316492Z %156 = tt.broadcast %154 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1316739Z %157 = arith.select %12, %156, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1316982Z %158 = tt.broadcast %155 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1317219Z %159 = arith.select %14, %158, %157 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1317450Z %160 = tt.reshape %159 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3> 2026-02-21T09:37:23.1317675Z %161 = arith.sitofp %160 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3> 2026-02-21T09:37:23.1317931Z %162 = ttg.local_alloc %161 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:23.1318255Z %163 = ttg.local_load %162 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.1318753Z %164 = tt.dot %144, %163, %133, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:23.1319104Z scf.yield %164 : tensor<16x256xf32, #mma> 2026-02-21T09:37:23.1319228Z } {tt.flatten} 2026-02-21T09:37:23.1319347Z %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.1319542Z %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.1319739Z %44 = tt.load %43 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:23.1319998Z %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.1320389Z %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.1320685Z %47 = arith.addi %40, %cst_5 : tensor<1x256xi32, #blocked2> 2026-02-21T09:37:23.1320884Z %48 = tt.addptr %7, %47 : tensor<1x256x!tt.ptr, #blocked2>, tensor<1x256xi32, #blocked2> 2026-02-21T09:37:23.1321078Z %49 = tt.load %48 : tensor<1x256x!tt.ptr, #blocked2> 2026-02-21T09:37:23.1321316Z %50 = ttg.convert_layout %49 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1321589Z %51 = arith.shli %50, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1321821Z %52 = arith.shrsi %51, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1322050Z %53 = arith.shrsi %50, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.1322333Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.1322719Z %55 = tt.expand_dims %53 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.1323001Z %56 = tt.broadcast %54 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1323396Z %57 = arith.select %12, %56, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1323627Z %58 = tt.broadcast %55 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1323853Z %59 = arith.select %14, %58, %57 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.1324080Z %60 = tt.reshape %59 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3> 2026-02-21T09:37:23.1324296Z %61 = arith.sitofp %60 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3> 2026-02-21T09:37:23.1324546Z %62 = ttg.local_alloc %61 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:23.1324866Z %63 = ttg.local_load %62 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.1325323Z %64 = tt.dot %46, %63, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:23.1325704Z %65 = arith.truncf %64 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma> 2026-02-21T09:37:23.1325961Z %66 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:23.1326193Z %67 = arith.muli %66, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:37:23.1326424Z %68 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:37:23.1326680Z %69 = tt.broadcast %67 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:23.1326921Z %70 = tt.broadcast %68 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:23.1327093Z %71 = arith.addi %69, %70 : tensor<16x256xi32, #mma> 2026-02-21T09:37:23.1327277Z %72 = tt.addptr %15, %71 : tensor<16x256x!tt.ptr, #mma>, tensor<16x256xi32, #mma> 2026-02-21T09:37:23.1327465Z tt.store %72, %65 : tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:23.1327601Z } {tt.num_stages = 2 : i32} 2026-02-21T09:37:23.1327702Z tt.return 2026-02-21T09:37:23.1327781Z } 2026-02-21T09:37:23.1327853Z } 2026-02-21T09:37:23.1327895Z 2026-02-21T09:37:23.1327925Z {-# 2026-02-21T09:37:23.1327967Z external_resources: { 2026-02-21T09:37:23.1328004Z mlir_reproducer: { 2026-02-21T09:37:23.1328945Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:23.1328989Z disable_threading: false, 2026-02-21T09:37:23.1329025Z verify_each: true 2026-02-21T09:37:23.1329056Z } 2026-02-21T09:37:23.1329089Z } 2026-02-21T09:37:23.1329119Z #-} 2026-02-21T09:37:23.1329354Z /tmp/torchinductor_root/75/c75e3kbo34qoroitpjojd5lcsamuqtto3niepbqlxidsoepxczrm.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:23.1329767Z /tmp/torchinductor_root/75/c75e3kbo34qoroitpjojd5lcsamuqtto3niepbqlxidsoepxczrm.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:23.1329880Z [79s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:23.1330504Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:37:23.1330595Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:23.1330676Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:23.3520446Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:23.3524388Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 4], order = [2, 1, 0]}> 2026-02-21T09:37:23.3525336Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:37:23.3526175Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 4], order = [1, 0]}> 2026-02-21T09:37:23.3526960Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:23.3527523Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:23.3527868Z #smem = #ttg.shared_memory 2026-02-21T09:37:23.3528241Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:23.3529108Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:23.3529756Z %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma> 2026-02-21T09:37:23.3530008Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:23.3530197Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:23.3530378Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:23.3530550Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:23.3530770Z %cst_0 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3531006Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:23.3531191Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:37:23.3531375Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:23.3531551Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:23.3531722Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:37:23.3531903Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:23.3532086Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:23.3532389Z %cst_1 = arith.constant dense<29352960> : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3532812Z %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.3533152Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:23.3533482Z %cst_4 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3533812Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:23.3534079Z %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:23.3534340Z %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:23.3534568Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:23.3534874Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:23.3535300Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:23.3535847Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:23.3536333Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:23.3536749Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.3537121Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:23.3537411Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3537778Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:23.3538272Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:23.3538762Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:23.3539065Z %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:23.3539298Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:37:23.3539535Z %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:23.3539758Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:37:23.3540009Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:23.3542646Z %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.3542989Z %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:23.3543318Z %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.3543549Z scf.for %arg3 = %0 to %c112_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:23.3543725Z %19 = arith.divsi %arg3, %c112_i32 : i32 2026-02-21T09:37:23.3543871Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T09:37:23.3544009Z %21 = arith.subi %c4_i32, %20 : i32 2026-02-21T09:37:23.3544146Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T09:37:23.3544290Z %23 = arith.remsi %arg3, %c112_i32 : i32 2026-02-21T09:37:23.3544431Z %24 = arith.remsi %23, %22 : i32 2026-02-21T09:37:23.3544564Z %25 = arith.addi %20, %24 : i32 2026-02-21T09:37:23.3544696Z %26 = arith.divsi %23, %22 : i32 2026-02-21T09:37:23.3544832Z %27 = arith.muli %25, %c16_i32 : i32 2026-02-21T09:37:23.3545034Z %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:23.3545294Z %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:23.3545544Z %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:23.3545795Z %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:23.3545987Z %32 = arith.muli %26, %c256_i32 : i32 2026-02-21T09:37:23.3546236Z %33 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:23.3546544Z %34 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:23.3546848Z %35 = arith.addi %33, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:23.3547155Z %36 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:23.3547543Z %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:23.3547844Z %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:23.3548048Z %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.3548401Z %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3548794Z %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T09:37:23.3549008Z %72 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:23.3549181Z %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.3549396Z %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.3549670Z %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:23.3549936Z %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.3550130Z %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.3550324Z %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.3550522Z %79 = tt.load %78 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:23.3550783Z %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.3551208Z %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.3551493Z %82 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:23.3551665Z %83 = tt.splat %82 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3551887Z %84 = arith.addi %83, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3552195Z %85 = tt.addptr %7, %84 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3552496Z %86 = tt.load %85 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3552723Z %87 = arith.shli %86, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3552953Z %88 = arith.shrsi %87, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3553184Z %89 = arith.shrsi %86, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3553471Z %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.3553799Z %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.3554082Z %92 = tt.broadcast %90 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3554316Z %93 = arith.select %12, %92, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3554547Z %94 = tt.broadcast %91 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3554772Z %95 = arith.select %14, %94, %93 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3554996Z %96 = tt.reshape %95 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:23.3555214Z %97 = arith.sitofp %96 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:23.3555512Z %98 = ttg.local_alloc %97 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:23.3555832Z %99 = ttg.local_load %98 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.3556300Z %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:23.3556642Z %101 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:23.3556762Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:37:23.3556934Z %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.3557153Z %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.3557429Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:23.3557700Z %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.3557892Z %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.3558087Z %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.3558289Z %109 = tt.load %108 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:23.3558554Z %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.3558979Z %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.3559261Z %112 = arith.muli %101, %c7168_i32 : i32 2026-02-21T09:37:23.3559437Z %113 = tt.splat %112 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3559665Z %114 = arith.addi %113, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3559974Z %115 = tt.addptr %7, %114 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3560280Z %116 = tt.load %115 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3560510Z %117 = arith.shli %116, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3560747Z %118 = arith.shrsi %117, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3560985Z %119 = arith.shrsi %116, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3561276Z %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.3561609Z %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.3561898Z %122 = tt.broadcast %120 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3562142Z %123 = arith.select %12, %122, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3562381Z %124 = tt.broadcast %121 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3562684Z %125 = arith.select %14, %124, %123 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3562916Z %126 = tt.reshape %125 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:23.3563179Z %127 = arith.sitofp %126 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:23.3563431Z %128 = ttg.local_alloc %127 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:23.3563757Z %129 = ttg.local_load %128 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.3564229Z %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:23.3564572Z %131 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:23.3564693Z %132 = arith.muli %131, %c2_i32 : i32 2026-02-21T09:37:23.3564865Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.3565084Z %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:23.3565368Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:23.3565644Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.3565838Z %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.3566035Z %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.3566237Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:23.3566501Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.3566934Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.3567211Z %142 = arith.muli %131, %c7168_i32 : i32 2026-02-21T09:37:23.3567388Z %143 = tt.splat %142 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3567615Z %144 = arith.addi %143, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3567923Z %145 = tt.addptr %7, %144 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3568231Z %146 = tt.load %145 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3568461Z %147 = arith.shli %146, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3568701Z %148 = arith.shrsi %147, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3568938Z %149 = arith.shrsi %146, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3569232Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.3569570Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.3569858Z %152 = tt.broadcast %150 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3570102Z %153 = arith.select %12, %152, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3570341Z %154 = tt.broadcast %151 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3570582Z %155 = arith.select %14, %154, %153 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3570816Z %156 = tt.reshape %155 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:23.3571072Z %157 = arith.sitofp %156 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:23.3571327Z %158 = ttg.local_alloc %157 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:23.3571648Z %159 = ttg.local_load %158 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.3572118Z %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:23.3572462Z scf.yield %160 : tensor<16x256xf32, #mma> 2026-02-21T09:37:23.3572585Z } {tt.flatten} 2026-02-21T09:37:23.3572705Z %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.3572894Z %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:23.3573097Z %44 = tt.load %43 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:23.3573355Z %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.3573741Z %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.3574066Z %47 = arith.addi %40, %cst_1 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3574374Z %48 = tt.addptr %7, %47 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3574702Z %49 = tt.load %48 : tensor<1x256x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3574926Z %50 = arith.shli %49, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3575158Z %51 = arith.shrsi %50, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3575387Z %52 = arith.shrsi %49, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:23.3575672Z %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.3576003Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:37:23.3576281Z %55 = tt.broadcast %53 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3576513Z %56 = arith.select %12, %55, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3576750Z %57 = tt.broadcast %54 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3576982Z %58 = arith.select %14, %57, %56 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:37:23.3577205Z %59 = tt.reshape %58 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2> 2026-02-21T09:37:23.3577426Z %60 = arith.sitofp %59 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2> 2026-02-21T09:37:23.3577672Z %61 = ttg.local_alloc %60 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:37:23.3577992Z %62 = ttg.local_load %61 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:23.3578454Z %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:37:23.3578831Z %64 = arith.truncf %63 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma> 2026-02-21T09:37:23.3579131Z %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:23.3579360Z %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma> 2026-02-21T09:37:23.3579591Z %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:37:23.3579846Z %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:23.3580043Z %69 = tt.broadcast %67 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:37:23.3580218Z %70 = arith.addi %68, %69 : tensor<16x256xi32, #mma> 2026-02-21T09:37:23.3580400Z %71 = tt.addptr %15, %70 : tensor<16x256x!tt.ptr, #mma>, tensor<16x256xi32, #mma> 2026-02-21T09:37:23.3580592Z tt.store %71, %64 : tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:37:23.3580731Z } {tt.num_stages = 2 : i32} 2026-02-21T09:37:23.3580837Z tt.return 2026-02-21T09:37:23.3580917Z } 2026-02-21T09:37:23.3580988Z } 2026-02-21T09:37:23.3581031Z 2026-02-21T09:37:23.3581065Z {-# 2026-02-21T09:37:23.3581143Z external_resources: { 2026-02-21T09:37:23.3581246Z mlir_reproducer: { 2026-02-21T09:37:23.3582273Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:23.3583258Z disable_threading: false, 2026-02-21T09:37:23.3583361Z verify_each: true 2026-02-21T09:37:23.3583454Z } 2026-02-21T09:37:23.3583527Z } 2026-02-21T09:37:23.3583596Z #-} 2026-02-21T09:37:23.3583869Z /tmp/torchinductor_root/fy/cfyx5plztc5dsvzazhgqd5rvw2taewki2ywtdqugmslksmnddieo.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:23.3584548Z /tmp/torchinductor_root/fy/cfyx5plztc5dsvzazhgqd5rvw2taewki2ywtdqugmslksmnddieo.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:23.3585097Z [79s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:23.3585873Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:37:23.3586583Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:23.3586754Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:24.8588522Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 104/104 16.4 configs/s 2026-02-21T09:37:27.1489159Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 353.2 2026-02-21T09:37:27.1489752Z configs/s 2026-02-21T09:37:27.7292300Z [83s] Generation 1 complete: 2026-02-21T09:37:27.7292668Z error=15 2026-02-21T09:37:27.7292873Z ok=91 2026-02-21T09:37:27.7293087Z min=0.1501 2026-02-21T09:37:27.7293324Z mid=0.5532 2026-02-21T09:37:27.7293530Z max=8.1947 2026-02-21T09:37:27.7293766Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:37:27.7294443Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:37:27.7294804Z 'l2_groupings': [1], 2026-02-21T09:37:27.7295079Z 'load_eviction_policies': ['', ''], 2026-02-21T09:37:27.7295406Z 'loop_orders': [[0, 1]], 2026-02-21T09:37:27.7295682Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:37:27.7295967Z 'num_stages': 2, 2026-02-21T09:37:27.7296193Z 'num_warps': 4, 2026-02-21T09:37:27.7296437Z 'pid_type': 'flat', 2026-02-21T09:37:27.7296703Z 'range_flattens': [None, None], 2026-02-21T09:37:27.7297013Z 'range_multi_buffers': [None, True], 2026-02-21T09:37:27.7297334Z 'range_num_stages': [0, 1], 2026-02-21T09:37:27.7297579Z 'range_unroll_factors': [0, 1], 2026-02-21T09:37:27.7297845Z 'range_warp_specializes': [], 2026-02-21T09:37:27.7298094Z 'waves_per_eu': 3} 2026-02-21T09:37:27.7450794Z [83s] Fitting surrogate: 206 points, 206 targets 2026-02-21T09:37:29.3756379Z [85s] Generation 2 starting: 97 neighbors, 5 active search path(s) 2026-02-21T09:37:45.1001155Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 5.6 configs/s 2026-02-21T09:37:48.3940166Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:48.3943913Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:37:48.3946613Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:37:48.3947501Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:37:48.3948252Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:48.3949549Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:48.3950859Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:48.3951863Z %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:48.3952344Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.3952785Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.3953246Z %cst_2 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:48.3953734Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma> 2026-02-21T09:37:48.3954315Z %cst_4 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.3954842Z %cst_5 = arith.constant dense<29352960> : tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.3955160Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:48.3955395Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:48.3955631Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:48.3955848Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:48.3956063Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:48.3956290Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:37:48.3956511Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:48.3956802Z %cst_6 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.3957089Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:37:48.3957319Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:48.3957536Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:48.3957888Z %cst_7 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3958269Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:48.3958496Z %1 = arith.divsi %0, %c224_i32 : i32 2026-02-21T09:37:48.3958889Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:37:48.3959109Z %3 = arith.subi %c4_i32, %2 : i32 2026-02-21T09:37:48.3959325Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:37:48.3959541Z %5 = arith.remsi %0, %c224_i32 : i32 2026-02-21T09:37:48.3959766Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:37:48.3959983Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:37:48.3960189Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:37:48.3960399Z %9 = arith.muli %7, %c16_i32 : i32 2026-02-21T09:37:48.3960796Z %10 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:48.3961338Z %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:48.3961832Z %12 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:48.3962247Z %13 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:48.3962865Z %14 = arith.addi %12, %10 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:48.3963278Z %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:48.3963599Z %16 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:37:48.3963970Z %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:48.3964493Z %18 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:48.3964888Z %19 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:48.3965176Z %20 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:48.3965531Z %21 = arith.addi %19, %17 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:48.3965819Z %22 = arith.addi %20, %18 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:48.3966165Z %23 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.3966594Z %24 = tt.expand_dims %14 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:48.3966945Z %25 = arith.muli %24, %cst_2 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:48.3967210Z %26 = tt.broadcast %25 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.3967511Z %27 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.3967881Z %28 = tt.expand_dims %22 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.3968256Z %29 = tt.splat %arg1 : !tt.ptr -> tensor<1x64x!tt.ptr, #blocked2> 2026-02-21T09:37:48.3968638Z %30 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:48.3969226Z %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:48.3969787Z %32 = tt.expand_dims %31 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.3970139Z %33 = arith.cmpi eq, %32, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.3970414Z %34 = tt.broadcast %33 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:48.3970686Z %35 = arith.cmpi eq, %32, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.3970953Z %36 = tt.broadcast %35 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:48.3971321Z %37 = scf.for %arg3 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg4 = %cst_3) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:48.3971676Z %72 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:37:48.3971915Z %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.3972218Z %74 = arith.addi %73, %23 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.3972593Z %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:48.3972969Z %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.3973235Z %77 = arith.addi %26, %76 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.3973534Z %78 = tt.addptr %27, %77 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.3973820Z %79 = tt.load %78 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.3974190Z %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.3974700Z %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.3975014Z %82 = arith.muli %arg3, %c7168_i32 : i32 2026-02-21T09:37:48.3975169Z %83 = tt.splat %82 : i32 -> tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.3975336Z %84 = arith.addi %83, %28 : tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.3975544Z %85 = tt.addptr %29, %84 : tensor<1x64x!tt.ptr, #blocked2>, tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.3975757Z %86 = tt.load %85 : tensor<1x64x!tt.ptr, #blocked2> 2026-02-21T09:37:48.3976050Z %87 = ttg.convert_layout %86 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3976349Z %88 = arith.shli %87, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3976602Z %89 = arith.shrsi %88, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3976851Z %90 = arith.shrsi %87, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3977161Z %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.3977521Z %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.3977820Z %93 = tt.broadcast %91 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.3978073Z %94 = arith.select %34, %93, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.3978325Z %95 = tt.broadcast %92 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.3978573Z %96 = arith.select %36, %95, %94 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.3978815Z %97 = tt.reshape %96 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:48.3979049Z %98 = arith.sitofp %97 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:48.3979362Z %99 = ttg.convert_layout %98 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.3979871Z %100 = tt.dot %81, %99, %arg4, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:48.3980250Z %101 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:37:48.3980388Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:37:48.3980573Z %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.3980866Z %104 = arith.addi %103, %23 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.3981168Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:48.3981474Z %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.3981687Z %107 = arith.addi %26, %106 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.3981905Z %108 = tt.addptr %27, %107 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.3982133Z %109 = tt.load %108 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.3982426Z %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.3982883Z %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.3983198Z %112 = arith.muli %101, %c7168_i32 : i32 2026-02-21T09:37:48.3983352Z %113 = tt.splat %112 : i32 -> tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.3983525Z %114 = arith.addi %113, %28 : tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.3983736Z %115 = tt.addptr %29, %114 : tensor<1x64x!tt.ptr, #blocked2>, tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.3983956Z %116 = tt.load %115 : tensor<1x64x!tt.ptr, #blocked2> 2026-02-21T09:37:48.3984221Z %117 = ttg.convert_layout %116 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3984526Z %118 = arith.shli %117, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3984797Z %119 = arith.shrsi %118, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3985027Z %120 = arith.shrsi %117, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3985312Z %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.3985640Z %122 = tt.expand_dims %120 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.3985918Z %123 = tt.broadcast %121 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.3986155Z %124 = arith.select %34, %123, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.3986390Z %125 = tt.broadcast %122 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.3986623Z %126 = arith.select %36, %125, %124 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.3986849Z %127 = tt.reshape %126 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:48.3987065Z %128 = arith.sitofp %127 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:48.3987356Z %129 = ttg.convert_layout %128 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.3987810Z %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:48.3988148Z %131 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:37:48.3988269Z %132 = arith.muli %131, %c2_i32 : i32 2026-02-21T09:37:48.3988435Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.3988658Z %134 = arith.addi %133, %23 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.3988930Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:48.3989252Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.3989441Z %137 = arith.addi %26, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.3989635Z %138 = tt.addptr %27, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.3989842Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.3990102Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.3990498Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.3990776Z %142 = arith.muli %131, %c7168_i32 : i32 2026-02-21T09:37:48.3990915Z %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.3991070Z %144 = arith.addi %143, %28 : tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.3991259Z %145 = tt.addptr %29, %144 : tensor<1x64x!tt.ptr, #blocked2>, tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.3991456Z %146 = tt.load %145 : tensor<1x64x!tt.ptr, #blocked2> 2026-02-21T09:37:48.3991692Z %147 = ttg.convert_layout %146 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3991968Z %148 = arith.shli %147, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3992201Z %149 = arith.shrsi %148, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3992465Z %150 = arith.shrsi %147, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3992748Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.3993081Z %152 = tt.expand_dims %150 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.3993359Z %153 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.3993594Z %154 = arith.select %34, %153, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.3993824Z %155 = tt.broadcast %152 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.3994054Z %156 = arith.select %36, %155, %154 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.3994279Z %157 = tt.reshape %156 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:48.3994495Z %158 = arith.sitofp %157 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:48.3994784Z %159 = ttg.convert_layout %158 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.3995242Z %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:48.3995582Z scf.yield %160 : tensor<16x64xf32, #mma> 2026-02-21T09:37:48.3995700Z } {tt.flatten} 2026-02-21T09:37:48.3995843Z %38 = arith.addi %23, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.3996117Z %39 = tt.expand_dims %38 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:48.3996380Z %40 = tt.broadcast %39 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.3996572Z %41 = arith.addi %26, %40 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.3996764Z %42 = tt.addptr %27, %41 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.3997003Z %43 = tt.load %42 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.3997257Z %44 = ttg.convert_layout %43 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.3997640Z %45 = arith.extf %44 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.3997932Z %46 = arith.addi %28, %cst_5 : tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.3998127Z %47 = tt.addptr %29, %46 : tensor<1x64x!tt.ptr, #blocked2>, tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.3998315Z %48 = tt.load %47 : tensor<1x64x!tt.ptr, #blocked2> 2026-02-21T09:37:48.3998547Z %49 = ttg.convert_layout %48 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3998821Z %50 = arith.shli %49, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3999046Z %51 = arith.shrsi %50, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3999268Z %52 = arith.shrsi %49, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.3999540Z %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.3999863Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.4000129Z %55 = tt.broadcast %53 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.4000389Z %56 = arith.select %34, %55, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.4000615Z %57 = tt.broadcast %54 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.4000835Z %58 = arith.select %36, %57, %56 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.4001051Z %59 = tt.reshape %58 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:48.4001259Z %60 = arith.sitofp %59 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:48.4001540Z %61 = ttg.convert_layout %60 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.4001982Z %62 = tt.dot %45, %61, %37, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:48.4002348Z %63 = arith.truncf %62 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:48.4002661Z %64 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:48.4002892Z %65 = arith.muli %64, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:37:48.4003112Z %66 = tt.expand_dims %21 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:48.4003360Z %67 = tt.broadcast %65 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:48.4003552Z %68 = tt.broadcast %66 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:48.4003723Z %69 = arith.addi %67, %68 : tensor<16x64xi32, #mma> 2026-02-21T09:37:48.4003888Z %70 = tt.splat %arg2 : !tt.ptr -> tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:48.4004093Z %71 = tt.addptr %70, %69 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:48.4004278Z tt.store %71, %63 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:48.4004406Z tt.return 2026-02-21T09:37:48.4004487Z } 2026-02-21T09:37:48.4004561Z } 2026-02-21T09:37:48.4004644Z 2026-02-21T09:37:48.4004678Z {-# 2026-02-21T09:37:48.4004758Z external_resources: { 2026-02-21T09:37:48.4004857Z mlir_reproducer: { 2026-02-21T09:37:48.4005842Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:48.4006828Z disable_threading: false, 2026-02-21T09:37:48.4006937Z verify_each: true 2026-02-21T09:37:48.4007024Z } 2026-02-21T09:37:48.4007099Z } 2026-02-21T09:37:48.4007167Z #-} 2026-02-21T09:37:48.4007448Z /tmp/torchinductor_root/ub/cubzumuvy6exska34ne3anqev2uijxs3c2fdfge754rtqpbtslar.py:12:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:48.4008130Z /tmp/torchinductor_root/ub/cubzumuvy6exska34ne3anqev2uijxs3c2fdfge754rtqpbtslar.py:12:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:48.4008677Z [104s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:48.4009430Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:37:48.4010080Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:48.4010246Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:48.5875554Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:48.5878819Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:37:48.5879749Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:37:48.5880599Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:37:48.5881380Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:48.5882275Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:48.5883691Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:48.5884685Z %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:48.5885034Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.5885299Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.5885564Z %cst_2 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:48.5885797Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:48.5886029Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma> 2026-02-21T09:37:48.5886473Z %cst_4 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.5886860Z %cst_5 = arith.constant dense<29352960> : tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.5887094Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:48.5887276Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:48.5887455Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:37:48.5887634Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:37:48.5887809Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:48.5887977Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:37:48.5888148Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:48.5888323Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:48.5888539Z %cst_6 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5888755Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:48.5888920Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:48.5889083Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:48.5889363Z %cst_7 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5889644Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:48.5889924Z %1 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:48.5890333Z %2 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:48.5890739Z %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:48.5891135Z %4 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:48.5891599Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.5891957Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.5892258Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x64x!tt.ptr, #blocked2> 2026-02-21T09:37:48.5892663Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:48.5893282Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:48.5893884Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.5894259Z %11 = arith.cmpi eq, %10, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.5894517Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:48.5894740Z %13 = arith.cmpi eq, %10, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.5894959Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:48.5895196Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:48.5895436Z %16 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.5895755Z %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:48.5896064Z %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.5896287Z scf.for %arg3 = %0 to %c448_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:48.5896457Z %19 = arith.divsi %arg3, %c16_i32 : i32 2026-02-21T09:37:48.5896595Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T09:37:48.5896737Z %21 = arith.subi %c112_i32, %20 : i32 2026-02-21T09:37:48.5896869Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T09:37:48.5897005Z %23 = arith.remsi %arg3, %c16_i32 : i32 2026-02-21T09:37:48.5897178Z %24 = arith.remsi %23, %22 : i32 2026-02-21T09:37:48.5897310Z %25 = arith.addi %20, %24 : i32 2026-02-21T09:37:48.5897435Z %26 = arith.divsi %23, %22 : i32 2026-02-21T09:37:48.5897567Z %27 = arith.muli %25, %c64_i32 : i32 2026-02-21T09:37:48.5897752Z %28 = tt.splat %27 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:48.5897998Z %29 = tt.splat %27 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:48.5898245Z %30 = arith.addi %28, %1 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:48.5898485Z %31 = arith.addi %29, %2 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:48.5898680Z %32 = arith.muli %26, %c16_i32 : i32 2026-02-21T09:37:48.5898871Z %33 = tt.splat %32 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:48.5899111Z %34 = tt.splat %32 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:48.5899356Z %35 = arith.addi %33, %3 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:48.5899616Z %36 = arith.addi %34, %4 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:48.5899922Z %37 = tt.expand_dims %35 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:48.5900211Z %38 = arith.muli %37, %cst_2 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:48.5900436Z %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.5900751Z %40 = tt.expand_dims %31 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.5901165Z %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst_3) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:48.5901422Z %72 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:48.5901620Z %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.5901875Z %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.5902185Z %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:48.5902499Z %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.5902716Z %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.5902946Z %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.5903182Z %79 = tt.load %78 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.5903487Z %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.5903959Z %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.5904288Z %82 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:48.5904452Z %83 = tt.splat %82 : i32 -> tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.5904609Z %84 = arith.addi %83, %40 : tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.5904798Z %85 = tt.addptr %7, %84 : tensor<1x64x!tt.ptr, #blocked2>, tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.5904993Z %86 = tt.load %85 : tensor<1x64x!tt.ptr, #blocked2> 2026-02-21T09:37:48.5905229Z %87 = ttg.convert_layout %86 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5905509Z %88 = arith.shli %87, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5905772Z %89 = arith.shrsi %88, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5906000Z %90 = arith.shrsi %87, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5906282Z %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.5906607Z %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.5906882Z %93 = tt.broadcast %91 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5907112Z %94 = arith.select %12, %93, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5907342Z %95 = tt.broadcast %92 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5907566Z %96 = arith.select %14, %95, %94 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5907794Z %97 = tt.reshape %96 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:48.5908010Z %98 = arith.sitofp %97 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:48.5908296Z %99 = ttg.convert_layout %98 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.5908756Z %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:48.5909101Z %101 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:48.5909224Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:37:48.5909426Z %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.5909656Z %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.5909932Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:48.5910209Z %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.5910401Z %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.5910600Z %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.5910806Z %109 = tt.load %108 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.5911071Z %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.5911468Z %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.5911752Z %112 = arith.muli %101, %c7168_i32 : i32 2026-02-21T09:37:48.5911896Z %113 = tt.splat %112 : i32 -> tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.5912054Z %114 = arith.addi %113, %40 : tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.5912245Z %115 = tt.addptr %7, %114 : tensor<1x64x!tt.ptr, #blocked2>, tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.5912442Z %116 = tt.load %115 : tensor<1x64x!tt.ptr, #blocked2> 2026-02-21T09:37:48.5912680Z %117 = ttg.convert_layout %116 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5912961Z %118 = arith.shli %117, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5913199Z %119 = arith.shrsi %118, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5913463Z %120 = arith.shrsi %117, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5913753Z %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.5914084Z %122 = tt.expand_dims %120 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.5914370Z %123 = tt.broadcast %121 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5914608Z %124 = arith.select %12, %123, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5914843Z %125 = tt.broadcast %122 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5915078Z %126 = arith.select %14, %125, %124 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5915308Z %127 = tt.reshape %126 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:48.5915536Z %128 = arith.sitofp %127 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:48.5915832Z %129 = ttg.convert_layout %128 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.5916298Z %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:48.5916644Z %131 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:48.5916769Z %132 = arith.muli %131, %c2_i32 : i32 2026-02-21T09:37:48.5916938Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.5917196Z %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.5917471Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:48.5917749Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.5917939Z %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.5918140Z %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.5918344Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.5918607Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.5919010Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.5919290Z %142 = arith.muli %131, %c7168_i32 : i32 2026-02-21T09:37:48.5919432Z %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.5919590Z %144 = arith.addi %143, %40 : tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.5919781Z %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr, #blocked2>, tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.5919979Z %146 = tt.load %145 : tensor<1x64x!tt.ptr, #blocked2> 2026-02-21T09:37:48.5920217Z %147 = ttg.convert_layout %146 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5920495Z %148 = arith.shli %147, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5920728Z %149 = arith.shrsi %148, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5920963Z %150 = arith.shrsi %147, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5921251Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.5921626Z %152 = tt.expand_dims %150 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.5921908Z %153 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5922145Z %154 = arith.select %12, %153, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5922378Z %155 = tt.broadcast %152 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5922648Z %156 = arith.select %14, %155, %154 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5922877Z %157 = tt.reshape %156 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:48.5923100Z %158 = arith.sitofp %157 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:48.5923397Z %159 = ttg.convert_layout %158 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.5923852Z %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:48.5924195Z scf.yield %160 : tensor<16x64xf32, #mma> 2026-02-21T09:37:48.5924315Z } {tt.flatten} 2026-02-21T09:37:48.5924430Z %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.5924624Z %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.5924865Z %44 = tt.load %43 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.5925122Z %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.5925516Z %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.5925814Z %47 = arith.addi %40, %cst_5 : tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.5926014Z %48 = tt.addptr %7, %47 : tensor<1x64x!tt.ptr, #blocked2>, tensor<1x64xi32, #blocked2> 2026-02-21T09:37:48.5926208Z %49 = tt.load %48 : tensor<1x64x!tt.ptr, #blocked2> 2026-02-21T09:37:48.5926448Z %50 = ttg.convert_layout %49 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5926721Z %51 = arith.shli %50, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5926957Z %52 = arith.shrsi %51, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5927189Z %53 = arith.shrsi %50, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.5927471Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.5927798Z %55 = tt.expand_dims %53 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.5928071Z %56 = tt.broadcast %54 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5928298Z %57 = arith.select %12, %56, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5928525Z %58 = tt.broadcast %55 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5928746Z %59 = arith.select %14, %58, %57 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.5928966Z %60 = tt.reshape %59 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:48.5929215Z %61 = arith.sitofp %60 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:48.5929501Z %62 = ttg.convert_layout %61 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.5929950Z %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:48.5930326Z %64 = arith.truncf %63 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:48.5930583Z %65 = tt.expand_dims %36 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:48.5930934Z %66 = arith.muli %65, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:37:48.5931184Z %67 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:48.5931547Z %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:48.5931759Z %69 = tt.broadcast %67 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:48.5931961Z %70 = arith.addi %68, %69 : tensor<16x64xi32, #mma> 2026-02-21T09:37:48.5932182Z %71 = tt.addptr %15, %70 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:48.5942858Z tt.store %71, %64 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:48.5943045Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:37:48.5943188Z tt.return 2026-02-21T09:37:48.5943269Z } 2026-02-21T09:37:48.5943348Z } 2026-02-21T09:37:48.5943392Z 2026-02-21T09:37:48.5943424Z {-# 2026-02-21T09:37:48.5943504Z external_resources: { 2026-02-21T09:37:48.5943666Z mlir_reproducer: { 2026-02-21T09:37:48.5944659Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:48.5945653Z disable_threading: false, 2026-02-21T09:37:48.5945759Z verify_each: true 2026-02-21T09:37:48.5945847Z } 2026-02-21T09:37:48.5945919Z } 2026-02-21T09:37:48.5945991Z #-} 2026-02-21T09:37:48.5946273Z /tmp/torchinductor_root/3g/c3gtwuhxllsqoypqhlqwz2aa4bkm5b5sdacguxpaftodwwlmfw7h.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:48.5946965Z /tmp/torchinductor_root/3g/c3gtwuhxllsqoypqhlqwz2aa4bkm5b5sdacguxpaftodwwlmfw7h.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:48.5947522Z [104s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:48.5948311Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[2, 0], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:37:48.5949020Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:48.5949188Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:48.7794514Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:48.7798084Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:37:48.7799033Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:37:48.7799868Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:37:48.7800652Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:48.7801370Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:48.7801870Z #smem = #ttg.shared_memory 2026-02-21T09:37:48.7802520Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:48.7803937Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:48.7805029Z %cst = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma> 2026-02-21T09:37:48.7805396Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:48.7805639Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:48.7805821Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:48.7805958Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:48.7806135Z %cst_0 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7806445Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:48.7806593Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:37:48.7806735Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:48.7806876Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:48.7807015Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:37:48.7807158Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:48.7807301Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:48.7807536Z %cst_1 = arith.constant dense<29352960> : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7807871Z %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.7808143Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:48.7808415Z %cst_4 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7808689Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.7808899Z %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.7809112Z %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:48.7809291Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:48.7809536Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:48.7809879Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:48.7810262Z %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:48.7810648Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:48.7810974Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.7811280Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.7811642Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7812016Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:48.7812531Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:48.7813024Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.7813340Z %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.7813582Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:48.7813822Z %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:48.7814056Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:48.7814312Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:48.7814573Z %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.7814912Z %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:48.7815242Z %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.7815465Z scf.for %arg3 = %0 to %c448_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:48.7815613Z %19 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:37:48.7815738Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T09:37:48.7815854Z %21 = arith.subi %c4_i32, %20 : i32 2026-02-21T09:37:48.7816002Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T09:37:48.7816122Z %23 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:37:48.7816241Z %24 = arith.remsi %23, %22 : i32 2026-02-21T09:37:48.7816355Z %25 = arith.addi %20, %24 : i32 2026-02-21T09:37:48.7816465Z %26 = arith.divsi %23, %22 : i32 2026-02-21T09:37:48.7816578Z %27 = arith.muli %25, %c16_i32 : i32 2026-02-21T09:37:48.7816746Z %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:48.7816957Z %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:48.7817167Z %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:48.7817373Z %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:48.7817535Z %32 = arith.muli %26, %c64_i32 : i32 2026-02-21T09:37:48.7817739Z %33 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:48.7817988Z %34 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:48.7818236Z %35 = arith.addi %33, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:48.7818480Z %36 = arith.addi %34, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:48.7818743Z %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:48.7818989Z %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:48.7819179Z %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.7819546Z %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7819937Z %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:48.7820184Z %72 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:48.7820356Z %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.7820576Z %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.7820844Z %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:48.7821117Z %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.7821307Z %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.7821510Z %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.7821710Z %79 = tt.load %78 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.7821977Z %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.7822383Z %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.7822664Z %82 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:48.7822842Z %83 = tt.splat %82 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7823067Z %84 = arith.addi %83, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7823366Z %85 = tt.addptr %7, %84 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7823699Z %86 = tt.load %85 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7823929Z %87 = arith.shli %86, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7824162Z %88 = arith.shrsi %87, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7824393Z %89 = arith.shrsi %86, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7824674Z %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.7825000Z %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.7825274Z %92 = tt.broadcast %90 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7825509Z %93 = arith.select %12, %92, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7825740Z %94 = tt.broadcast %91 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7825970Z %95 = arith.select %14, %94, %93 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7826190Z %96 = tt.reshape %95 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:48.7826403Z %97 = arith.sitofp %96 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:48.7826648Z %98 = ttg.local_alloc %97 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:48.7826965Z %99 = ttg.local_load %98 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.7827429Z %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:48.7827774Z %101 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:48.7827927Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:37:48.7828098Z %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.7828322Z %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.7828594Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:48.7828870Z %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.7829063Z %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.7829264Z %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.7829477Z %109 = tt.load %108 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.7829742Z %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.7830151Z %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.7830432Z %112 = arith.muli %101, %c7168_i32 : i32 2026-02-21T09:37:48.7830610Z %113 = tt.splat %112 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7830839Z %114 = arith.addi %113, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7831146Z %115 = tt.addptr %7, %114 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7831483Z %116 = tt.load %115 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7831712Z %117 = arith.shli %116, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7831950Z %118 = arith.shrsi %117, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7832186Z %119 = arith.shrsi %116, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7832473Z %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.7832810Z %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.7833092Z %122 = tt.broadcast %120 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7833329Z %123 = arith.select %12, %122, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7833566Z %124 = tt.broadcast %121 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7833799Z %125 = arith.select %14, %124, %123 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7834034Z %126 = tt.reshape %125 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:48.7834256Z %127 = arith.sitofp %126 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:48.7834506Z %128 = ttg.local_alloc %127 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:48.7834832Z %129 = ttg.local_load %128 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.7835302Z %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:48.7835678Z %131 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:48.7835802Z %132 = arith.muli %131, %c2_i32 : i32 2026-02-21T09:37:48.7835970Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.7836192Z %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:48.7836465Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:48.7836742Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.7836937Z %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.7837134Z %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.7837343Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.7837606Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.7838013Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.7838295Z %142 = arith.muli %131, %c7168_i32 : i32 2026-02-21T09:37:48.7838467Z %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7838695Z %144 = arith.addi %143, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7839000Z %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7839357Z %146 = tt.load %145 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7839584Z %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7839820Z %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7840053Z %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7840336Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.7840667Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.7840943Z %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7841184Z %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7841419Z %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7841647Z %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7841871Z %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:48.7842087Z %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:48.7842335Z %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:48.7842703Z %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.7843168Z %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:48.7843549Z scf.yield %160 : tensor<16x64xf32, #mma> 2026-02-21T09:37:48.7843675Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:37:48.7843814Z %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.7844005Z %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:48.7844199Z %44 = tt.load %43 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:48.7844453Z %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.7844837Z %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.7845161Z %47 = arith.addi %40, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7845466Z %48 = tt.addptr %7, %47 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7845768Z %49 = tt.load %48 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7845990Z %50 = arith.shli %49, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7846215Z %51 = arith.shrsi %50, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7846438Z %52 = arith.shrsi %49, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:48.7846714Z %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.7847070Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:48.7847341Z %55 = tt.broadcast %53 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7847573Z %56 = arith.select %12, %55, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7847798Z %57 = tt.broadcast %54 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7848019Z %58 = arith.select %14, %57, %56 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:48.7848232Z %59 = tt.reshape %58 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:48.7848441Z %60 = arith.sitofp %59 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:48.7848680Z %61 = ttg.local_alloc %60 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:48.7848992Z %62 = ttg.local_load %61 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:48.7849445Z %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:48.7849815Z %64 = arith.truncf %63 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:48.7850069Z %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:48.7850302Z %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma> 2026-02-21T09:37:48.7850526Z %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:48.7850773Z %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:48.7850967Z %69 = tt.broadcast %67 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:48.7851138Z %70 = arith.addi %68, %69 : tensor<16x64xi32, #mma> 2026-02-21T09:37:48.7851348Z %71 = tt.addptr %15, %70 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:48.7851533Z tt.store %71, %64 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:48.7851692Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:37:48.7851827Z tt.return 2026-02-21T09:37:48.7851904Z } 2026-02-21T09:37:48.7851972Z } 2026-02-21T09:37:48.7852016Z 2026-02-21T09:37:48.7852045Z {-# 2026-02-21T09:37:48.7852121Z external_resources: { 2026-02-21T09:37:48.7852220Z mlir_reproducer: { 2026-02-21T09:37:48.7853215Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:48.7854204Z disable_threading: false, 2026-02-21T09:37:48.7854306Z verify_each: true 2026-02-21T09:37:48.7854395Z } 2026-02-21T09:37:48.7854464Z } 2026-02-21T09:37:48.7854530Z #-} 2026-02-21T09:37:48.7854805Z /tmp/torchinductor_root/et/cetnhmlxe7uqgvdferbhtte42dogre5ejn2acagrtopbdcr5d64g.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:48.7855594Z /tmp/torchinductor_root/et/cetnhmlxe7uqgvdferbhtte42dogre5ejn2acagrtopbdcr5d64g.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:48.7856144Z [104s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:48.7856923Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 1], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:37:48.7857631Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:48.7857798Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:49.0188011Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:49.0191618Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:37:49.0192589Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:37:49.0193432Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:37:49.0194265Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:37:49.0195020Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:49.0195752Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:49.0196832Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:49.0198051Z %cst = arith.constant dense<7168> : tensor<1x64xi64, #mma> 2026-02-21T09:37:49.0198458Z %cst_0 = arith.constant dense<0> : tensor<1x64xi64, #mma> 2026-02-21T09:37:49.0198820Z %cst_1 = arith.constant dense<64> : tensor<16x1xi64, #mma> 2026-02-21T09:37:49.0199176Z %cst_2 = arith.constant dense<0> : tensor<16x1xi64, #mma> 2026-02-21T09:37:49.0199541Z %cst_3 = arith.constant dense<7168> : tensor<16x1xi64, #mma> 2026-02-21T09:37:49.0199933Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.0200316Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.0200716Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.0201128Z %cst_7 = arith.constant dense<7168> : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.0201527Z %cst_8 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2> 2026-02-21T09:37:49.0201859Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:49.0202123Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:37:49.0202367Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:49.0202698Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:49.0202941Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:37:49.0203203Z %c4092_i32 = arith.constant 4092 : i32 2026-02-21T09:37:49.0203454Z %c6_i32 = arith.constant 6 : i32 2026-02-21T09:37:49.0203758Z %cst_9 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0204081Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:37:49.0204330Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:49.0204541Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:49.0204906Z %cst_10 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0205289Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:49.0205596Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:49.0206032Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.0206449Z %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.0206859Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.0207270Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.0207682Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.0208058Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:49.0208372Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:37:49.0208791Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:49.0209455Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:49.0210085Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.0210482Z %12 = arith.cmpi eq, %11, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.0210791Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:37:49.0211094Z %14 = arith.cmpi eq, %11, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.0211390Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:37:49.0211713Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:49.0212188Z %17 = arith.extsi %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.0212704Z %18 = arith.extsi %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.0213094Z scf.for %arg3 = %0 to %c448_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:49.0213323Z %19 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:37:49.0213513Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T09:37:49.0213694Z %21 = arith.subi %c4_i32, %20 : i32 2026-02-21T09:37:49.0213874Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T09:37:49.0214062Z %23 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:37:49.0214243Z %24 = arith.remsi %23, %22 : i32 2026-02-21T09:37:49.0214415Z %25 = arith.addi %20, %24 : i32 2026-02-21T09:37:49.0214566Z %26 = arith.divsi %23, %22 : i32 2026-02-21T09:37:49.0214721Z %27 = arith.muli %25, %c16_i32 : i32 2026-02-21T09:37:49.0214924Z %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:49.0215209Z %29 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:49.0215411Z %30 = arith.muli %26, %c64_i32 : i32 2026-02-21T09:37:49.0215604Z %31 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.0215859Z %32 = arith.addi %31, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.0216182Z %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T09:37:49.0216476Z %34 = arith.muli %33, %cst_8 : tensor<16x1xi32, #blocked2> 2026-02-21T09:37:49.0216753Z %35 = tt.broadcast %34 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.0217082Z %36 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1> 2026-02-21T09:37:49.0217410Z %37 = tt.broadcast %36 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.0217726Z %38 = scf.for %arg4 = %c0_i32 to %c4092_i32 step %c6_i32 iter_args(%arg5 = %cst_6) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:49.0218048Z %63 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.0218312Z %64 = arith.addi %63, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.0218515Z %65 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:49.0218714Z %66 = tt.splat %65 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.0218970Z %67 = arith.addi %66, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.0219291Z %68 = tt.expand_dims %67 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:37:49.0219619Z %69 = tt.broadcast %68 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.0219846Z %70 = arith.addi %35, %69 : tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.0220085Z %71 = tt.addptr %7, %70 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.0220323Z %72 = tt.load %71 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:49.0220640Z %73 = ttg.convert_layout %72 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.0221129Z %74 = arith.extf %73 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.0221578Z %75 = tt.expand_dims %64 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.0221908Z %76 = arith.muli %75, %cst_7 : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.0222130Z %77 = tt.broadcast %76 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.0222354Z %78 = arith.addi %77, %37 : tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.0222583Z %79 = tt.addptr %8, %78 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.0222811Z %80 = tt.load %79 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:37:49.0223094Z %81 = ttg.convert_layout %80 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0223423Z %82 = arith.shli %81, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0223703Z %83 = arith.shrsi %82, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0223983Z %84 = arith.shrsi %81, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0224321Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.0224665Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.0224937Z %87 = tt.broadcast %85 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0225167Z %88 = arith.select %13, %87, %cst_9 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0225398Z %89 = tt.broadcast %86 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0225652Z %90 = arith.select %15, %89, %88 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0225871Z %91 = tt.reshape %90 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:37:49.0226087Z %92 = arith.sitofp %91 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:37:49.0226374Z %93 = ttg.convert_layout %92 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.0226837Z %94 = tt.dot %74, %93, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.0227177Z %95 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:49.0227345Z %96 = tt.splat %95 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.0227558Z %97 = arith.addi %96, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.0227727Z %98 = arith.muli %95, %c2_i32 : i32 2026-02-21T09:37:49.0227891Z %99 = tt.splat %98 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.0228101Z %100 = arith.addi %99, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.0228376Z %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:37:49.0228650Z %102 = tt.broadcast %101 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.0228842Z %103 = arith.addi %35, %102 : tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.0229039Z %104 = tt.addptr %7, %103 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.0229240Z %105 = tt.load %104 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:49.0229505Z %106 = ttg.convert_layout %105 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.0229937Z %107 = arith.extf %106 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.0230315Z %108 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.0230560Z %109 = arith.muli %108, %cst_7 : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.0230747Z %110 = tt.broadcast %109 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.0230936Z %111 = arith.addi %110, %37 : tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.0231128Z %112 = tt.addptr %8, %111 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.0231323Z %113 = tt.load %112 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:37:49.0231563Z %114 = ttg.convert_layout %113 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0231843Z %115 = arith.shli %114, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0232077Z %116 = arith.shrsi %115, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0232310Z %117 = arith.shrsi %114, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0232600Z %118 = tt.expand_dims %116 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.0232937Z %119 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.0233214Z %120 = tt.broadcast %118 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0233494Z %121 = arith.select %13, %120, %cst_9 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0233732Z %122 = tt.broadcast %119 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0233966Z %123 = arith.select %15, %122, %121 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0234196Z %124 = tt.reshape %123 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:37:49.0234414Z %125 = arith.sitofp %124 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:37:49.0234708Z %126 = ttg.convert_layout %125 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.0235167Z %127 = tt.dot %107, %126, %94, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.0235505Z %128 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:37:49.0235679Z %129 = tt.splat %128 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.0235898Z %130 = arith.addi %129, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.0236070Z %131 = arith.muli %128, %c2_i32 : i32 2026-02-21T09:37:49.0236238Z %132 = tt.splat %131 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.0236461Z %133 = arith.addi %132, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.0236737Z %134 = tt.expand_dims %133 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:37:49.0237012Z %135 = tt.broadcast %134 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.0237208Z %136 = arith.addi %35, %135 : tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.0237401Z %137 = tt.addptr %7, %136 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.0237730Z %138 = tt.load %137 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:49.0237993Z %139 = ttg.convert_layout %138 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.0238387Z %140 = arith.extf %139 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.0238764Z %141 = tt.expand_dims %130 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.0239011Z %142 = arith.muli %141, %cst_7 : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.0239202Z %143 = tt.broadcast %142 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.0239393Z %144 = arith.addi %143, %37 : tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.0239586Z %145 = tt.addptr %8, %144 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.0239782Z %146 = tt.load %145 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:37:49.0240019Z %147 = ttg.convert_layout %146 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0240297Z %148 = arith.shli %147, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0240533Z %149 = arith.shrsi %148, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0240766Z %150 = arith.shrsi %147, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0241086Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.0241419Z %152 = tt.expand_dims %150 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.0241698Z %153 = tt.broadcast %151 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0241934Z %154 = arith.select %13, %153, %cst_9 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0242168Z %155 = tt.broadcast %152 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0242401Z %156 = arith.select %15, %155, %154 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0242679Z %157 = tt.reshape %156 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:37:49.0242900Z %158 = arith.sitofp %157 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:37:49.0243199Z %159 = ttg.convert_layout %158 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.0243667Z %160 = tt.dot %140, %159, %127, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.0244014Z scf.yield %160 : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.0244134Z } {tt.flatten} 2026-02-21T09:37:49.0244325Z %39 = scf.for %arg4 = %c4092_i32 to %c4096_i32 step %c2_i32 iter_args(%arg5 = %38) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:49.0244595Z %63 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.0244816Z %64 = arith.addi %63, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.0244991Z %65 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:49.0245157Z %66 = tt.splat %65 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.0245367Z %67 = arith.addi %66, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.0245677Z %68 = tt.expand_dims %67 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:37:49.0245945Z %69 = tt.broadcast %68 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.0246131Z %70 = arith.addi %35, %69 : tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.0246322Z %71 = tt.addptr %7, %70 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.0246518Z %72 = tt.load %71 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:49.0246775Z %73 = ttg.convert_layout %72 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.0247167Z %74 = arith.extf %73 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.0247548Z %75 = tt.expand_dims %64 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.0247789Z %76 = arith.muli %75, %cst_7 : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.0247973Z %77 = tt.broadcast %76 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.0248160Z %78 = arith.addi %77, %37 : tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.0248347Z %79 = tt.addptr %8, %78 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.0248540Z %80 = tt.load %79 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:37:49.0248775Z %81 = ttg.convert_layout %80 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0249086Z %82 = arith.shli %81, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0249317Z %83 = arith.shrsi %82, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0249545Z %84 = arith.shrsi %81, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.0249829Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.0250159Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.0250431Z %87 = tt.broadcast %85 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0250662Z %88 = arith.select %13, %87, %cst_9 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0250890Z %89 = tt.broadcast %86 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0251111Z %90 = arith.select %15, %89, %88 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.0251333Z %91 = tt.reshape %90 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:37:49.0251544Z %92 = arith.sitofp %91 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:37:49.0251836Z %93 = ttg.convert_layout %92 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.0252288Z %94 = tt.dot %74, %93, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.0252631Z scf.yield %94 : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.0252762Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:37:49.0252928Z %40 = arith.truncf %39 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:49.0253092Z %41 = arith.extsi %27 : i32 to i64 2026-02-21T09:37:49.0253239Z %42 = arith.extsi %30 : i32 to i64 2026-02-21T09:37:49.0253396Z %43 = tt.splat %41 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.0253599Z %44 = arith.addi %43, %17 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.0253857Z %45 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma> 2026-02-21T09:37:49.0254090Z %46 = arith.muli %45, %cst_3 : tensor<16x1xi64, #mma> 2026-02-21T09:37:49.0254259Z %47 = tt.broadcast %46 : tensor<16x1xi64, #mma> -> tensor<16x64xi64, #mma> 2026-02-21T09:37:49.0254455Z %48 = tt.splat %42 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.0254654Z %49 = arith.addi %48, %18 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.0254907Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi64, #mma> 2026-02-21T09:37:49.0255161Z %51 = tt.broadcast %50 : tensor<1x64xi64, #mma> -> tensor<16x64xi64, #mma> 2026-02-21T09:37:49.0255333Z %52 = arith.addi %47, %51 : tensor<16x64xi64, #mma> 2026-02-21T09:37:49.0255509Z %53 = tt.addptr %16, %52 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi64, #mma> 2026-02-21T09:37:49.0255698Z %54 = arith.cmpi sge, %45, %cst_2 : tensor<16x1xi64, #mma> 2026-02-21T09:37:49.0255857Z %55 = arith.cmpi slt, %45, %cst_1 : tensor<16x1xi64, #mma> 2026-02-21T09:37:49.0256007Z %56 = arith.andi %54, %55 : tensor<16x1xi1, #mma> 2026-02-21T09:37:49.0256170Z %57 = tt.broadcast %56 : tensor<16x1xi1, #mma> -> tensor<16x64xi1, #mma> 2026-02-21T09:37:49.0256345Z %58 = arith.cmpi sge, %50, %cst_0 : tensor<1x64xi64, #mma> 2026-02-21T09:37:49.0256532Z %59 = arith.cmpi slt, %50, %cst : tensor<1x64xi64, #mma> 2026-02-21T09:37:49.0256678Z %60 = arith.andi %58, %59 : tensor<1x64xi1, #mma> 2026-02-21T09:37:49.0256841Z %61 = tt.broadcast %60 : tensor<1x64xi1, #mma> -> tensor<16x64xi1, #mma> 2026-02-21T09:37:49.0257008Z %62 = arith.andi %57, %61 : tensor<16x64xi1, #mma> 2026-02-21T09:37:49.0257156Z tt.store %53, %40, %62 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:49.0257315Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:37:49.0257449Z tt.return 2026-02-21T09:37:49.0257528Z } 2026-02-21T09:37:49.0257602Z } 2026-02-21T09:37:49.0257646Z 2026-02-21T09:37:49.0257676Z {-# 2026-02-21T09:37:49.0257755Z external_resources: { 2026-02-21T09:37:49.0257851Z mlir_reproducer: { 2026-02-21T09:37:49.0258846Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:49.0259837Z disable_threading: false, 2026-02-21T09:37:49.0259942Z verify_each: true 2026-02-21T09:37:49.0260030Z } 2026-02-21T09:37:49.0260101Z } 2026-02-21T09:37:49.0260169Z #-} 2026-02-21T09:37:49.0260446Z /tmp/torchinductor_root/wo/cwoivxvx4gcd3oojv5o2u53akbi3va4uyq3akf2pkujxur7bsnzy.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:49.0261127Z /tmp/torchinductor_root/wo/cwoivxvx4gcd3oojv5o2u53akbi3va4uyq3akf2pkujxur7bsnzy.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:49.0261708Z [105s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:49.0262485Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 64], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:37:49.0263190Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:49.0263353Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:49.2710205Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:49.2717123Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:37:49.2717691Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:37:49.2718207Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:37:49.2718686Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:49.2719109Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:49.2719418Z #smem = #ttg.shared_memory 2026-02-21T09:37:49.2724124Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:49.2724867Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:49.2725431Z %cst = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2725663Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:49.2725835Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:49.2726004Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:49.2726161Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:49.2726364Z %cst_0 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2726575Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:49.2726744Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:37:49.2726902Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:49.2727063Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:49.2727228Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:37:49.2727393Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:49.2727559Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:49.2727715Z %c4864_i32 = arith.constant 4864 : i32 2026-02-21T09:37:49.2727994Z %cst_1 = arith.constant dense<29352960> : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2728380Z %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2728652Z %c2879_i32 = arith.constant 2879 : i32 2026-02-21T09:37:49.2728861Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.2729166Z %cst_4 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2729474Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.2729715Z %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.2729962Z %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:49.2730167Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:49.2730508Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.2730901Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.2731339Z %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.2731776Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.2732152Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2732494Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.2732837Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2733277Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:49.2733873Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:49.2734428Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.2734710Z %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.2734960Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:49.2735175Z %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.2735422Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:49.2735652Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:49.2735835Z %16 = arith.subi %c2879_i32, %0 : i32 2026-02-21T09:37:49.2735970Z %17 = arith.divui %16, %c2432_i32 : i32 2026-02-21T09:37:49.2736098Z %18 = arith.remsi %17, %c2_i32 : i32 2026-02-21T09:37:49.2736225Z %19 = arith.subi %17, %18 : i32 2026-02-21T09:37:49.2736355Z %20 = arith.muli %19, %c2432_i32 : i32 2026-02-21T09:37:49.2736485Z %21 = arith.addi %0, %20 : i32 2026-02-21T09:37:49.2736664Z %22 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2736973Z %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.2737275Z %24 = tt.broadcast %23 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2737485Z scf.for %arg3 = %0 to %21 step %c4864_i32 : i32 { 2026-02-21T09:37:49.2737639Z %25 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:37:49.2737774Z %26 = arith.muli %25, %c4_i32 : i32 2026-02-21T09:37:49.2737902Z %27 = arith.subi %c4_i32, %26 : i32 2026-02-21T09:37:49.2738027Z %28 = arith.minsi %27, %c4_i32 : i32 2026-02-21T09:37:49.2738160Z %29 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:37:49.2738291Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:37:49.2738413Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:37:49.2738537Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:37:49.2738661Z %33 = arith.muli %31, %c16_i32 : i32 2026-02-21T09:37:49.2738848Z %34 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.2739080Z %35 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.2739317Z %36 = arith.addi %34, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.2739550Z %37 = arith.addi %35, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.2739772Z %38 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:37:49.2739997Z %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.2740270Z %40 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.2740546Z %41 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.2740819Z %42 = arith.addi %40, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.2741111Z %43 = tt.expand_dims %36 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.2741391Z %44 = arith.muli %43, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.2741604Z %45 = tt.broadcast %44 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2741999Z %46 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2742436Z %47 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:49.2742675Z %132 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:49.2742868Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2743120Z %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2743433Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.2743786Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2744012Z %137 = arith.addi %45, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2744240Z %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2744456Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.2744727Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2745132Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2745414Z %142 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:49.2745597Z %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2745830Z %144 = arith.addi %143, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2746145Z %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2746462Z %146 = tt.load %145 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2746695Z %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2746938Z %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2747178Z %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2747478Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2747825Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2748143Z %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2748386Z %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2748629Z %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2748863Z %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2749098Z %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.2749320Z %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.2749576Z %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.2749906Z %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2750384Z %160 = tt.dot %141, %159, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2750738Z %161 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:49.2750866Z %162 = arith.muli %161, %c2_i32 : i32 2026-02-21T09:37:49.2751044Z %163 = tt.splat %162 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2751273Z %164 = arith.addi %163, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2751549Z %165 = tt.expand_dims %164 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.2751878Z %166 = tt.broadcast %165 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2752077Z %167 = arith.addi %45, %166 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2752284Z %168 = tt.addptr %6, %167 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2752496Z %169 = tt.load %168 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.2752763Z %170 = ttg.convert_layout %169 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2753168Z %171 = arith.extf %170 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2753453Z %172 = arith.muli %161, %c7168_i32 : i32 2026-02-21T09:37:49.2753634Z %173 = tt.splat %172 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2753869Z %174 = arith.addi %173, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2754178Z %175 = tt.addptr %7, %174 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2754491Z %176 = tt.load %175 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2754723Z %177 = arith.shli %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2754965Z %178 = arith.shrsi %177, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2755205Z %179 = arith.shrsi %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2755493Z %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2755833Z %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2756153Z %182 = tt.broadcast %180 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2756397Z %183 = arith.select %12, %182, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2756639Z %184 = tt.broadcast %181 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2756871Z %185 = arith.select %14, %184, %183 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2757105Z %186 = tt.reshape %185 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.2757327Z %187 = arith.sitofp %186 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.2757581Z %188 = ttg.local_alloc %187 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.2757911Z %189 = ttg.local_load %188 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2758382Z %190 = tt.dot %171, %189, %160, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2758731Z %191 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:49.2758860Z %192 = arith.muli %191, %c2_i32 : i32 2026-02-21T09:37:49.2759032Z %193 = tt.splat %192 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2759263Z %194 = arith.addi %193, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2759546Z %195 = tt.expand_dims %194 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.2759867Z %196 = tt.broadcast %195 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2760066Z %197 = arith.addi %45, %196 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2760266Z %198 = tt.addptr %6, %197 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2760478Z %199 = tt.load %198 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.2760744Z %200 = ttg.convert_layout %199 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2761149Z %201 = arith.extf %200 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2761438Z %202 = arith.muli %191, %c7168_i32 : i32 2026-02-21T09:37:49.2761614Z %203 = tt.splat %202 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2761849Z %204 = arith.addi %203, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2762157Z %205 = tt.addptr %7, %204 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2762472Z %206 = tt.load %205 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2762754Z %207 = arith.shli %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2762989Z %208 = arith.shrsi %207, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2763229Z %209 = arith.shrsi %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2763518Z %210 = tt.expand_dims %208 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2763860Z %211 = tt.expand_dims %209 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2764182Z %212 = tt.broadcast %210 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2764419Z %213 = arith.select %12, %212, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2764662Z %214 = tt.broadcast %211 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2764893Z %215 = arith.select %14, %214, %213 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2765127Z %216 = tt.reshape %215 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.2765353Z %217 = arith.sitofp %216 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.2765606Z %218 = ttg.local_alloc %217 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.2765936Z %219 = ttg.local_load %218 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2766406Z %220 = tt.dot %201, %219, %190, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2766756Z scf.yield %220 : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2766883Z } {tt.flatten} 2026-02-21T09:37:49.2767004Z %48 = arith.addi %45, %24 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2767204Z %49 = tt.addptr %6, %48 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2767405Z %50 = tt.load %49 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.2767706Z %51 = ttg.convert_layout %50 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2768104Z %52 = arith.extf %51 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2768439Z %53 = arith.addi %46, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2768752Z %54 = tt.addptr %7, %53 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2769054Z %55 = tt.load %54 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2769285Z %56 = arith.shli %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2769521Z %57 = arith.shrsi %56, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2769753Z %58 = arith.shrsi %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2770042Z %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2770377Z %60 = tt.expand_dims %58 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2770653Z %61 = tt.broadcast %59 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2770891Z %62 = arith.select %12, %61, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2771119Z %63 = tt.broadcast %60 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2771344Z %64 = arith.select %14, %63, %62 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2771569Z %65 = tt.reshape %64 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.2771785Z %66 = arith.sitofp %65 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.2772032Z %67 = ttg.local_alloc %66 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.2772379Z %68 = ttg.local_load %67 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2772837Z %69 = tt.dot %52, %68, %47, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2773215Z %70 = arith.truncf %69 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:49.2773472Z %71 = tt.expand_dims %37 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:49.2773711Z %72 = arith.muli %71, %cst_7 : tensor<16x1xi32, #mma> 2026-02-21T09:37:49.2773942Z %73 = tt.expand_dims %42 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:49.2774201Z %74 = tt.broadcast %72 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.2774402Z %75 = tt.broadcast %73 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.2774577Z %76 = arith.addi %74, %75 : tensor<16x64xi32, #mma> 2026-02-21T09:37:49.2774762Z %77 = tt.addptr %15, %76 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:49.2774951Z tt.store %77, %70 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:49.2775097Z %78 = arith.addi %arg3, %c2432_i32 : i32 2026-02-21T09:37:49.2775222Z %79 = arith.divsi %78, %c448_i32 : i32 2026-02-21T09:37:49.2775346Z %80 = arith.muli %79, %c4_i32 : i32 2026-02-21T09:37:49.2775468Z %81 = arith.subi %c4_i32, %80 : i32 2026-02-21T09:37:49.2775586Z %82 = arith.minsi %81, %c4_i32 : i32 2026-02-21T09:37:49.2775745Z %83 = arith.remsi %78, %c448_i32 : i32 2026-02-21T09:37:49.2775864Z %84 = arith.remsi %83, %82 : i32 2026-02-21T09:37:49.2775989Z %85 = arith.addi %80, %84 : i32 2026-02-21T09:37:49.2776104Z %86 = arith.divsi %83, %82 : i32 2026-02-21T09:37:49.2776222Z %87 = arith.muli %85, %c16_i32 : i32 2026-02-21T09:37:49.2776389Z %88 = tt.splat %87 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.2776605Z %89 = tt.splat %87 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.2776820Z %90 = arith.addi %88, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.2777030Z %91 = arith.addi %89, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.2777197Z %92 = arith.muli %86, %c64_i32 : i32 2026-02-21T09:37:49.2777400Z %93 = tt.splat %92 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.2777654Z %94 = tt.splat %92 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.2777908Z %95 = arith.addi %93, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.2778154Z %96 = arith.addi %94, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.2778420Z %97 = tt.expand_dims %90 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.2778665Z %98 = arith.muli %97, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.2778858Z %99 = tt.broadcast %98 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2779212Z %100 = tt.expand_dims %95 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2779606Z %101 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:49.2779861Z %132 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:49.2780038Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2780269Z %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2780553Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.2780832Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2781033Z %137 = arith.addi %99, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2781236Z %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2781449Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.2781724Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2782133Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2782425Z %142 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:49.2782605Z %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2782839Z %144 = arith.addi %143, %100 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2783154Z %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2783499Z %146 = tt.load %145 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2783736Z %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2783977Z %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2784213Z %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2784507Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2784840Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2785127Z %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2785370Z %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2785605Z %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2785837Z %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2786068Z %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.2786291Z %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.2786545Z %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.2786864Z %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2787335Z %160 = tt.dot %141, %159, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2787682Z %161 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:49.2787838Z %162 = arith.muli %161, %c2_i32 : i32 2026-02-21T09:37:49.2795709Z %163 = tt.splat %162 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2795947Z %164 = arith.addi %163, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2796229Z %165 = tt.expand_dims %164 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.2796510Z %166 = tt.broadcast %165 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2796705Z %167 = arith.addi %99, %166 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2796905Z %168 = tt.addptr %6, %167 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2797112Z %169 = tt.load %168 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.2797384Z %170 = ttg.convert_layout %169 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2797794Z %171 = arith.extf %170 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2798079Z %172 = arith.muli %161, %c7168_i32 : i32 2026-02-21T09:37:49.2798256Z %173 = tt.splat %172 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2798486Z %174 = arith.addi %173, %100 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2798800Z %175 = tt.addptr %7, %174 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2799169Z %176 = tt.load %175 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2799402Z %177 = arith.shli %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2799641Z %178 = arith.shrsi %177, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2799876Z %179 = arith.shrsi %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2800167Z %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2800506Z %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2800789Z %182 = tt.broadcast %180 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2801034Z %183 = arith.select %12, %182, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2801272Z %184 = tt.broadcast %181 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2801505Z %185 = arith.select %14, %184, %183 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2801734Z %186 = tt.reshape %185 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.2801954Z %187 = arith.sitofp %186 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.2802204Z %188 = ttg.local_alloc %187 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.2802525Z %189 = ttg.local_load %188 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2803043Z %190 = tt.dot %171, %189, %160, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2803389Z %191 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:49.2803550Z %192 = arith.muli %191, %c2_i32 : i32 2026-02-21T09:37:49.2803722Z %193 = tt.splat %192 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2803942Z %194 = arith.addi %193, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2804215Z %195 = tt.expand_dims %194 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.2804490Z %196 = tt.broadcast %195 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2804684Z %197 = arith.addi %99, %196 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2804880Z %198 = tt.addptr %6, %197 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2805087Z %199 = tt.load %198 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.2805353Z %200 = ttg.convert_layout %199 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2805759Z %201 = arith.extf %200 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2806042Z %202 = arith.muli %191, %c7168_i32 : i32 2026-02-21T09:37:49.2806217Z %203 = tt.splat %202 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2806445Z %204 = arith.addi %203, %100 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2806756Z %205 = tt.addptr %7, %204 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2807100Z %206 = tt.load %205 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2807330Z %207 = arith.shli %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2807568Z %208 = arith.shrsi %207, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2807805Z %209 = arith.shrsi %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2808093Z %210 = tt.expand_dims %208 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2808429Z %211 = tt.expand_dims %209 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2808712Z %212 = tt.broadcast %210 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2808952Z %213 = arith.select %12, %212, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2809187Z %214 = tt.broadcast %211 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2809418Z %215 = arith.select %14, %214, %213 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2809645Z %216 = tt.reshape %215 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.2809864Z %217 = arith.sitofp %216 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.2810113Z %218 = ttg.local_alloc %217 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.2810438Z %219 = ttg.local_load %218 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2810909Z %220 = tt.dot %201, %219, %190, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2811259Z scf.yield %220 : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2811433Z } {tt.flatten} 2026-02-21T09:37:49.2811551Z %102 = arith.addi %99, %24 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2811754Z %103 = tt.addptr %6, %102 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2811959Z %104 = tt.load %103 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.2812225Z %105 = ttg.convert_layout %104 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2812626Z %106 = arith.extf %105 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2812959Z %107 = arith.addi %100, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2813274Z %108 = tt.addptr %7, %107 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2813585Z %109 = tt.load %108 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2813814Z %110 = arith.shli %109, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2814046Z %111 = arith.shrsi %110, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2814276Z %112 = arith.shrsi %109, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2814560Z %113 = tt.expand_dims %111 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2814929Z %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2815210Z %115 = tt.broadcast %113 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2815448Z %116 = arith.select %12, %115, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2815680Z %117 = tt.broadcast %114 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2815912Z %118 = arith.select %14, %117, %116 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2816137Z %119 = tt.reshape %118 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.2816358Z %120 = arith.sitofp %119 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.2816606Z %121 = ttg.local_alloc %120 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.2816926Z %122 = ttg.local_load %121 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2817393Z %123 = tt.dot %106, %122, %101, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2817777Z %124 = arith.truncf %123 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:49.2818041Z %125 = tt.expand_dims %91 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:49.2818279Z %126 = arith.muli %125, %cst_7 : tensor<16x1xi32, #mma> 2026-02-21T09:37:49.2818506Z %127 = tt.expand_dims %96 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:49.2818761Z %128 = tt.broadcast %126 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.2818963Z %129 = tt.broadcast %127 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.2819137Z %130 = arith.addi %128, %129 : tensor<16x64xi32, #mma> 2026-02-21T09:37:49.2819357Z %131 = tt.addptr %15, %130 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:49.2819547Z tt.store %131, %124 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:49.2819683Z } {tt.num_stages = 1 : i32} 2026-02-21T09:37:49.2819813Z scf.for %arg3 = %21 to %c448_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:49.2819955Z %25 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:37:49.2820076Z %26 = arith.muli %25, %c4_i32 : i32 2026-02-21T09:37:49.2820190Z %27 = arith.subi %c4_i32, %26 : i32 2026-02-21T09:37:49.2820304Z %28 = arith.minsi %27, %c4_i32 : i32 2026-02-21T09:37:49.2820424Z %29 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:37:49.2820541Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:37:49.2820650Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:37:49.2820762Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:37:49.2820871Z %33 = arith.muli %31, %c16_i32 : i32 2026-02-21T09:37:49.2821041Z %34 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.2821253Z %35 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.2821460Z %36 = arith.addi %34, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.2821667Z %37 = arith.addi %35, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.2821826Z %38 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:37:49.2822030Z %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.2822279Z %40 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.2822692Z %41 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.2822938Z %42 = arith.addi %40, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.2823204Z %43 = tt.expand_dims %36 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.2823454Z %44 = arith.muli %43, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.2823644Z %45 = tt.broadcast %44 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2823993Z %46 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2824383Z %47 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:49.2824597Z %78 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:49.2824768Z %79 = tt.splat %78 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2824988Z %80 = arith.addi %79, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2825260Z %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.2825532Z %82 = tt.broadcast %81 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2825721Z %83 = arith.addi %45, %82 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2825917Z %84 = tt.addptr %6, %83 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2826119Z %85 = tt.load %84 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.2826379Z %86 = ttg.convert_layout %85 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2826780Z %87 = arith.extf %86 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2827091Z %88 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:49.2827266Z %89 = tt.splat %88 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2827489Z %90 = arith.addi %89, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2827792Z %91 = tt.addptr %7, %90 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2828097Z %92 = tt.load %91 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2828321Z %93 = arith.shli %92, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2828554Z %94 = arith.shrsi %93, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2828785Z %95 = arith.shrsi %92, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2829072Z %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2829403Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2829678Z %98 = tt.broadcast %96 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2829908Z %99 = arith.select %12, %98, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2830139Z %100 = tt.broadcast %97 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2830382Z %101 = arith.select %14, %100, %99 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2830638Z %102 = tt.reshape %101 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.2830861Z %103 = arith.sitofp %102 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.2831109Z %104 = ttg.local_alloc %103 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.2831429Z %105 = ttg.local_load %104 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2831896Z %106 = tt.dot %87, %105, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2832241Z %107 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:49.2832363Z %108 = arith.muli %107, %c2_i32 : i32 2026-02-21T09:37:49.2832534Z %109 = tt.splat %108 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2832757Z %110 = arith.addi %109, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2833036Z %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.2833310Z %112 = tt.broadcast %111 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2833503Z %113 = arith.addi %45, %112 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2833699Z %114 = tt.addptr %6, %113 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2833906Z %115 = tt.load %114 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.2834173Z %116 = ttg.convert_layout %115 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2834575Z %117 = arith.extf %116 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2834891Z %118 = arith.muli %107, %c7168_i32 : i32 2026-02-21T09:37:49.2835065Z %119 = tt.splat %118 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2835293Z %120 = arith.addi %119, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2835601Z %121 = tt.addptr %7, %120 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2835906Z %122 = tt.load %121 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2836139Z %123 = arith.shli %122, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2836376Z %124 = arith.shrsi %123, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2836612Z %125 = arith.shrsi %122, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2836904Z %126 = tt.expand_dims %124 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2837235Z %127 = tt.expand_dims %125 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2837521Z %128 = tt.broadcast %126 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2837757Z %129 = arith.select %12, %128, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2837993Z %130 = tt.broadcast %127 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2838226Z %131 = arith.select %14, %130, %129 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2838487Z %132 = tt.reshape %131 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.2838714Z %133 = arith.sitofp %132 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.2838964Z %134 = ttg.local_alloc %133 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.2839287Z %135 = ttg.local_load %134 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2839757Z %136 = tt.dot %117, %135, %106, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2840102Z %137 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:49.2840226Z %138 = arith.muli %137, %c2_i32 : i32 2026-02-21T09:37:49.2840404Z %139 = tt.splat %138 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2840627Z %140 = arith.addi %139, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.2840909Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.2841188Z %142 = tt.broadcast %141 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2841383Z %143 = arith.addi %45, %142 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2841579Z %144 = tt.addptr %6, %143 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2841786Z %145 = tt.load %144 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.2842053Z %146 = ttg.convert_layout %145 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2842462Z %147 = arith.extf %146 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2842804Z %148 = arith.muli %137, %c7168_i32 : i32 2026-02-21T09:37:49.2842978Z %149 = tt.splat %148 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2843205Z %150 = arith.addi %149, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2843512Z %151 = tt.addptr %7, %150 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2843817Z %152 = tt.load %151 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2844047Z %153 = arith.shli %152, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2844280Z %154 = arith.shrsi %153, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2844516Z %155 = arith.shrsi %152, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2844810Z %156 = tt.expand_dims %154 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2845143Z %157 = tt.expand_dims %155 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2845425Z %158 = tt.broadcast %156 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2845662Z %159 = arith.select %12, %158, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2845897Z %160 = tt.broadcast %157 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2846127Z %161 = arith.select %14, %160, %159 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2846386Z %162 = tt.reshape %161 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.2846611Z %163 = arith.sitofp %162 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.2846858Z %164 = ttg.local_alloc %163 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.2847181Z %165 = ttg.local_load %164 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2847646Z %166 = tt.dot %147, %165, %136, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2847990Z scf.yield %166 : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2848106Z } {tt.flatten} 2026-02-21T09:37:49.2848223Z %48 = arith.addi %45, %24 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2848415Z %49 = tt.addptr %6, %48 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.2848615Z %50 = tt.load %49 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.2848872Z %51 = ttg.convert_layout %50 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2849268Z %52 = arith.extf %51 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2849596Z %53 = arith.addi %46, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2849905Z %54 = tt.addptr %7, %53 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2850210Z %55 = tt.load %54 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2850432Z %56 = arith.shli %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2850693Z %57 = arith.shrsi %56, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2850918Z %58 = arith.shrsi %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.2851197Z %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2851525Z %60 = tt.expand_dims %58 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.2851797Z %61 = tt.broadcast %59 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2852026Z %62 = arith.select %12, %61, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2852257Z %63 = tt.broadcast %60 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2852477Z %64 = arith.select %14, %63, %62 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.2852699Z %65 = tt.reshape %64 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.2852909Z %66 = arith.sitofp %65 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.2853150Z %67 = ttg.local_alloc %66 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.2853464Z %68 = ttg.local_load %67 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.2853919Z %69 = tt.dot %52, %68, %47, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.2854325Z %70 = arith.truncf %69 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:49.2854581Z %71 = tt.expand_dims %37 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:49.2854815Z %72 = arith.muli %71, %cst_7 : tensor<16x1xi32, #mma> 2026-02-21T09:37:49.2855046Z %73 = tt.expand_dims %42 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:49.2855296Z %74 = tt.broadcast %72 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.2855491Z %75 = tt.broadcast %73 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.2855661Z %76 = arith.addi %74, %75 : tensor<16x64xi32, #mma> 2026-02-21T09:37:49.2855840Z %77 = tt.addptr %15, %76 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:49.2856027Z tt.store %77, %70 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:49.2856162Z } {tt.num_stages = 1 : i32} 2026-02-21T09:37:49.2856265Z tt.return 2026-02-21T09:37:49.2856342Z } 2026-02-21T09:37:49.2856414Z } 2026-02-21T09:37:49.2856456Z 2026-02-21T09:37:49.2856487Z {-# 2026-02-21T09:37:49.2856567Z external_resources: { 2026-02-21T09:37:49.2856665Z mlir_reproducer: { 2026-02-21T09:37:49.2857680Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:49.2858684Z disable_threading: false, 2026-02-21T09:37:49.2858786Z verify_each: true 2026-02-21T09:37:49.2858877Z } 2026-02-21T09:37:49.2858976Z } 2026-02-21T09:37:49.2859043Z #-} 2026-02-21T09:37:49.2859318Z /tmp/torchinductor_root/g6/cg6tqsj2ilamzoxlnt4q3ufxel6xqyijdituwathk7todtheq5a5.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:49.2860005Z /tmp/torchinductor_root/g6/cg6tqsj2ilamzoxlnt4q3ufxel6xqyijdituwathk7todtheq5a5.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:49.2860557Z [105s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:49.2861343Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 0], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:37:49.2862056Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:49.2862224Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:49.4136056Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:49.4139098Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:37:49.4140033Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:37:49.4141105Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:37:49.4141928Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:37:49.4142701Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:49.4143580Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:49.4144901Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:49.4145952Z %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:49.4146249Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.4146471Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.4146687Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.4146886Z %cst_3 = arith.constant dense<7168> : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.4147067Z %cst_4 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2> 2026-02-21T09:37:49.4147217Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:49.4147340Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:37:49.4147463Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:49.4147583Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:49.4147705Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:37:49.4147828Z %c4092_i32 = arith.constant 4092 : i32 2026-02-21T09:37:49.4147947Z %c6_i32 = arith.constant 6 : i32 2026-02-21T09:37:49.4148093Z %cst_5 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4148247Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:37:49.4148362Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:49.4148472Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:49.4148699Z %cst_6 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4148890Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:49.4149087Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:49.4149359Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.4149628Z %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.4149889Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.4150155Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.4150417Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.4150658Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:49.4150859Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:37:49.4151123Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:49.4151543Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:49.4151951Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.4152240Z %12 = arith.cmpi eq, %11, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.4152435Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:37:49.4152635Z %14 = arith.cmpi eq, %11, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.4152818Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:37:49.4153022Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:49.4153202Z scf.for %arg3 = %0 to %c448_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:49.4153351Z %17 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:37:49.4153473Z %18 = arith.muli %17, %c4_i32 : i32 2026-02-21T09:37:49.4153590Z %19 = arith.subi %c4_i32, %18 : i32 2026-02-21T09:37:49.4153704Z %20 = arith.minsi %19, %c4_i32 : i32 2026-02-21T09:37:49.4153823Z %21 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:37:49.4153943Z %22 = arith.remsi %21, %20 : i32 2026-02-21T09:37:49.4154057Z %23 = arith.addi %18, %22 : i32 2026-02-21T09:37:49.4154185Z %24 = arith.divsi %21, %20 : i32 2026-02-21T09:37:49.4154299Z %25 = arith.muli %23, %c16_i32 : i32 2026-02-21T09:37:49.4154467Z %26 = tt.splat %25 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:49.4154680Z %27 = tt.splat %25 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.4154893Z %28 = arith.addi %26, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:49.4155103Z %29 = arith.addi %27, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.4155265Z %30 = arith.muli %24, %c64_i32 : i32 2026-02-21T09:37:49.4155426Z %31 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.4155633Z %32 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.4155843Z %33 = arith.addi %31, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.4156050Z %34 = arith.addi %32, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.4156348Z %35 = tt.expand_dims %28 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T09:37:49.4156597Z %36 = arith.muli %35, %cst_4 : tensor<16x1xi32, #blocked2> 2026-02-21T09:37:49.4156788Z %37 = tt.broadcast %36 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.4157064Z %38 = tt.expand_dims %33 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1> 2026-02-21T09:37:49.4157334Z %39 = tt.broadcast %38 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.4157601Z %40 = scf.for %arg4 = %c0_i32 to %c4092_i32 step %c6_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:49.4157870Z %50 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.4158093Z %51 = arith.addi %50, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.4158269Z %52 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:49.4158433Z %53 = tt.splat %52 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.4158647Z %54 = arith.addi %53, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.4158917Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:37:49.4159189Z %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.4159378Z %57 = arith.addi %37, %56 : tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.4159576Z %58 = tt.addptr %7, %57 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.4159808Z %59 = tt.load %58 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:49.4160071Z %60 = ttg.convert_layout %59 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.4160480Z %61 = arith.extf %60 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.4160859Z %62 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.4161102Z %63 = arith.muli %62, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.4161288Z %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.4161475Z %65 = arith.addi %64, %39 : tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.4161669Z %66 = tt.addptr %8, %65 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.4161864Z %67 = tt.load %66 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:37:49.4162105Z %68 = ttg.convert_layout %67 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4162384Z %69 = arith.shli %68, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4162673Z %70 = arith.shrsi %69, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4162906Z %71 = arith.shrsi %68, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4163185Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.4163514Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.4163794Z %74 = tt.broadcast %72 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4164064Z %75 = arith.select %13, %74, %cst_5 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4164296Z %76 = tt.broadcast %73 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4164522Z %77 = arith.select %15, %76, %75 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4164744Z %78 = tt.reshape %77 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:37:49.4164963Z %79 = arith.sitofp %78 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:37:49.4165250Z %80 = ttg.convert_layout %79 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.4165720Z %81 = tt.dot %61, %80, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.4166070Z %82 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:49.4166241Z %83 = tt.splat %82 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.4166459Z %84 = arith.addi %83, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.4166626Z %85 = arith.muli %82, %c2_i32 : i32 2026-02-21T09:37:49.4166790Z %86 = tt.splat %85 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.4167004Z %87 = arith.addi %86, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.4167272Z %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:37:49.4167589Z %89 = tt.broadcast %88 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.4167775Z %90 = arith.addi %37, %89 : tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.4167973Z %91 = tt.addptr %7, %90 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.4168175Z %92 = tt.load %91 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:49.4168435Z %93 = ttg.convert_layout %92 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.4168833Z %94 = arith.extf %93 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.4169211Z %95 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.4169456Z %96 = arith.muli %95, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.4169646Z %97 = tt.broadcast %96 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.4169830Z %98 = arith.addi %97, %39 : tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.4170021Z %99 = tt.addptr %8, %98 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.4170216Z %100 = tt.load %99 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:37:49.4170458Z %101 = ttg.convert_layout %100 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4170742Z %102 = arith.shli %101, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4170976Z %103 = arith.shrsi %102, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4171212Z %104 = arith.shrsi %101, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4171505Z %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.4171842Z %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.4172158Z %107 = tt.broadcast %105 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4172393Z %108 = arith.select %13, %107, %cst_5 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4172628Z %109 = tt.broadcast %106 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4172859Z %110 = arith.select %15, %109, %108 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4173087Z %111 = tt.reshape %110 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:37:49.4173307Z %112 = arith.sitofp %111 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:37:49.4173601Z %113 = ttg.convert_layout %112 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.4174065Z %114 = tt.dot %94, %113, %81, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.4174406Z %115 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:37:49.4174579Z %116 = tt.splat %115 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.4174803Z %117 = arith.addi %116, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.4174975Z %118 = arith.muli %115, %c2_i32 : i32 2026-02-21T09:37:49.4175144Z %119 = tt.splat %118 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.4175393Z %120 = arith.addi %119, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.4175670Z %121 = tt.expand_dims %120 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:37:49.4175947Z %122 = tt.broadcast %121 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.4176137Z %123 = arith.addi %37, %122 : tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.4176335Z %124 = tt.addptr %7, %123 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.4176538Z %125 = tt.load %124 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:49.4176806Z %126 = ttg.convert_layout %125 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.4177211Z %127 = arith.extf %126 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.4177594Z %128 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.4177844Z %129 = arith.muli %128, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.4178033Z %130 = tt.broadcast %129 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.4178224Z %131 = arith.addi %130, %39 : tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.4178419Z %132 = tt.addptr %8, %131 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.4178615Z %133 = tt.load %132 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:37:49.4178856Z %134 = ttg.convert_layout %133 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4179134Z %135 = arith.shli %134, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4179370Z %136 = arith.shrsi %135, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4179637Z %137 = arith.shrsi %134, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4179925Z %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.4180262Z %139 = tt.expand_dims %137 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.4180543Z %140 = tt.broadcast %138 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4180783Z %141 = arith.select %13, %140, %cst_5 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4181021Z %142 = tt.broadcast %139 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4181252Z %143 = arith.select %15, %142, %141 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4181481Z %144 = tt.reshape %143 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:37:49.4181704Z %145 = arith.sitofp %144 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:37:49.4182002Z %146 = ttg.convert_layout %145 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.4182466Z %147 = tt.dot %127, %146, %114, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.4182811Z scf.yield %147 : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.4182931Z } {tt.flatten} 2026-02-21T09:37:49.4183151Z %41 = scf.for %arg4 = %c4092_i32 to %c4096_i32 step %c2_i32 iter_args(%arg5 = %40) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:49.4183422Z %50 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.4183645Z %51 = arith.addi %50, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.4183816Z %52 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:49.4183981Z %53 = tt.splat %52 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.4184192Z %54 = arith.addi %53, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:49.4184462Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:37:49.4184733Z %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.4184919Z %57 = arith.addi %37, %56 : tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.4185115Z %58 = tt.addptr %7, %57 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:37:49.4185309Z %59 = tt.load %58 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:37:49.4185571Z %60 = ttg.convert_layout %59 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.4185966Z %61 = arith.extf %60 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.4186344Z %62 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.4186586Z %63 = arith.muli %62, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:37:49.4186772Z %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.4186958Z %65 = arith.addi %64, %39 : tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.4187148Z %66 = tt.addptr %8, %65 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:37:49.4187367Z %67 = tt.load %66 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:37:49.4187601Z %68 = ttg.convert_layout %67 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4187874Z %69 = arith.shli %68, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4188103Z %70 = arith.shrsi %69, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4188331Z %71 = arith.shrsi %68, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.4188612Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.4188944Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:37:49.4189219Z %74 = tt.broadcast %72 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4189454Z %75 = arith.select %13, %74, %cst_5 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4189686Z %76 = tt.broadcast %73 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4189908Z %77 = arith.select %15, %76, %75 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:37:49.4190129Z %78 = tt.reshape %77 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:37:49.4190341Z %79 = arith.sitofp %78 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:37:49.4190626Z %80 = ttg.convert_layout %79 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.4191131Z %81 = tt.dot %61, %80, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.4191475Z scf.yield %81 : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.4191603Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:37:49.4191765Z %42 = arith.truncf %41 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:49.4192022Z %43 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:49.4192253Z %44 = arith.muli %43, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:37:49.4192472Z %45 = tt.expand_dims %34 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:49.4192720Z %46 = tt.broadcast %44 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.4192917Z %47 = tt.broadcast %45 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.4193085Z %48 = arith.addi %46, %47 : tensor<16x64xi32, #mma> 2026-02-21T09:37:49.4193267Z %49 = tt.addptr %16, %48 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:49.4193451Z tt.store %49, %42 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:49.4193608Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:37:49.4193741Z tt.return 2026-02-21T09:37:49.4193818Z } 2026-02-21T09:37:49.4193886Z } 2026-02-21T09:37:49.4193929Z 2026-02-21T09:37:49.4193958Z {-# 2026-02-21T09:37:49.4194037Z external_resources: { 2026-02-21T09:37:49.4194134Z mlir_reproducer: { 2026-02-21T09:37:49.4195136Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:49.4196167Z disable_threading: false, 2026-02-21T09:37:49.4196268Z verify_each: true 2026-02-21T09:37:49.4196357Z } 2026-02-21T09:37:49.4196426Z } 2026-02-21T09:37:49.4196491Z #-} 2026-02-21T09:37:49.4196767Z /tmp/torchinductor_root/fk/cfkxb2bdtdprrt3gfdetbnk47ke6tjz4fd3xdafarz2lhml4efpd.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:49.4197452Z /tmp/torchinductor_root/fk/cfkxb2bdtdprrt3gfdetbnk47ke6tjz4fd3xdafarz2lhml4efpd.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:49.4198009Z [105s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:49.4198801Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:37:49.4199509Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:49.4199673Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:49.7338742Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:49.7345881Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:37:49.7346446Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:37:49.7346966Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:37:49.7347432Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:49.7347862Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:49.7348173Z #smem = #ttg.shared_memory 2026-02-21T09:37:49.7348543Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:49.7349322Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:49.7349954Z %cst = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7350210Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:49.7350404Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:49.7350589Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:49.7350766Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:49.7350941Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:37:49.7351179Z %cst_0 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7351418Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:49.7351613Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:37:49.7351795Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:49.7351978Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:37:49.7352152Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:49.7352327Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:37:49.7352602Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:49.7352784Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:49.7352960Z %c4864_i32 = arith.constant 4864 : i32 2026-02-21T09:37:49.7353276Z %cst_1 = arith.constant dense<29352960> : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7353712Z %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7354019Z %c2879_i32 = arith.constant 2879 : i32 2026-02-21T09:37:49.7354252Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.7354567Z %cst_4 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7354866Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.7355103Z %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.7355332Z %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:49.7355535Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:49.7355807Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.7356182Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.7356612Z %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.7357036Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.7357407Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7357789Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.7358115Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7358542Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:49.7359117Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:49.7359678Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.7360029Z %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.7360297Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:49.7360570Z %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:49.7360828Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:49.7361118Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:49.7361341Z %16 = arith.subi %c2879_i32, %0 : i32 2026-02-21T09:37:49.7361505Z %17 = arith.divui %16, %c2432_i32 : i32 2026-02-21T09:37:49.7361664Z %18 = arith.remsi %17, %c2_i32 : i32 2026-02-21T09:37:49.7361823Z %19 = arith.subi %17, %18 : i32 2026-02-21T09:37:49.7362015Z %20 = arith.muli %19, %c2432_i32 : i32 2026-02-21T09:37:49.7362176Z %21 = arith.addi %0, %20 : i32 2026-02-21T09:37:49.7362400Z %22 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7362865Z %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.7363245Z %24 = tt.broadcast %23 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7363503Z scf.for %arg3 = %0 to %21 step %c4864_i32 : i32 { 2026-02-21T09:37:49.7363752Z %25 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:37:49.7363917Z %26 = arith.muli %25, %c8_i32 : i32 2026-02-21T09:37:49.7364075Z %27 = arith.subi %c4_i32, %26 : i32 2026-02-21T09:37:49.7364230Z %28 = arith.minsi %27, %c8_i32 : i32 2026-02-21T09:37:49.7364392Z %29 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:37:49.7364554Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:37:49.7364707Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:37:49.7364859Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:37:49.7365010Z %33 = arith.muli %31, %c16_i32 : i32 2026-02-21T09:37:49.7365209Z %34 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.7365441Z %35 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.7365669Z %36 = arith.addi %34, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.7365900Z %37 = arith.addi %35, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.7366076Z %38 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:37:49.7366294Z %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.7366561Z %40 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.7366830Z %41 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.7367096Z %42 = arith.addi %40, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.7367383Z %43 = tt.expand_dims %36 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.7367695Z %44 = arith.muli %43, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.7367899Z %45 = tt.broadcast %44 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7368289Z %46 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7368715Z %47 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:49.7368949Z %132 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:49.7369141Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7369387Z %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7369690Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.7369999Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7370214Z %137 = arith.addi %45, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7370434Z %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7370659Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.7370954Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7371397Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7371710Z %142 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:49.7371908Z %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7372160Z %144 = arith.addi %143, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7372536Z %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7372871Z %146 = tt.load %145 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7373126Z %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7373385Z %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7373643Z %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7373964Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7374330Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7374619Z %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7374859Z %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7375094Z %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7375325Z %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7375554Z %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.7375774Z %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.7376170Z %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.7376493Z %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7376972Z %160 = tt.dot %141, %159, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7377321Z %161 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:49.7377442Z %162 = arith.muli %161, %c2_i32 : i32 2026-02-21T09:37:49.7377611Z %163 = tt.splat %162 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7377833Z %164 = arith.addi %163, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7378110Z %165 = tt.expand_dims %164 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.7378387Z %166 = tt.broadcast %165 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7378581Z %167 = arith.addi %45, %166 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7378777Z %168 = tt.addptr %6, %167 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7378981Z %169 = tt.load %168 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.7379246Z %170 = ttg.convert_layout %169 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7379648Z %171 = arith.extf %170 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7379929Z %172 = arith.muli %161, %c7168_i32 : i32 2026-02-21T09:37:49.7380106Z %173 = tt.splat %172 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7380330Z %174 = arith.addi %173, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7380680Z %175 = tt.addptr %7, %174 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7380987Z %176 = tt.load %175 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7381217Z %177 = arith.shli %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7381450Z %178 = arith.shrsi %177, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7381684Z %179 = arith.shrsi %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7381975Z %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7382310Z %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7382595Z %182 = tt.broadcast %180 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7382833Z %183 = arith.select %12, %182, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7383068Z %184 = tt.broadcast %181 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7383297Z %185 = arith.select %14, %184, %183 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7383525Z %186 = tt.reshape %185 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.7383744Z %187 = arith.sitofp %186 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.7384019Z %188 = ttg.local_alloc %187 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.7384343Z %189 = ttg.local_load %188 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7384814Z %190 = tt.dot %171, %189, %160, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7385157Z %191 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:49.7385278Z %192 = arith.muli %191, %c2_i32 : i32 2026-02-21T09:37:49.7385449Z %193 = tt.splat %192 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7385676Z %194 = arith.addi %193, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7385957Z %195 = tt.expand_dims %194 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.7386234Z %196 = tt.broadcast %195 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7386429Z %197 = arith.addi %45, %196 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7386628Z %198 = tt.addptr %6, %197 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7386833Z %199 = tt.load %198 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.7387097Z %200 = ttg.convert_layout %199 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7387500Z %201 = arith.extf %200 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7387782Z %202 = arith.muli %191, %c7168_i32 : i32 2026-02-21T09:37:49.7387960Z %203 = tt.splat %202 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7388188Z %204 = arith.addi %203, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7388527Z %205 = tt.addptr %7, %204 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7388834Z %206 = tt.load %205 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7389063Z %207 = arith.shli %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7389299Z %208 = arith.shrsi %207, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7389534Z %209 = arith.shrsi %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7389824Z %210 = tt.expand_dims %208 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7390157Z %211 = tt.expand_dims %209 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7390443Z %212 = tt.broadcast %210 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7390680Z %213 = arith.select %12, %212, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7390916Z %214 = tt.broadcast %211 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7391147Z %215 = arith.select %14, %214, %213 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7391377Z %216 = tt.reshape %215 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.7391597Z %217 = arith.sitofp %216 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.7391880Z %218 = ttg.local_alloc %217 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.7392206Z %219 = ttg.local_load %218 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7392668Z %220 = tt.dot %201, %219, %190, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7393012Z scf.yield %220 : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7393129Z } {tt.flatten} 2026-02-21T09:37:49.7393244Z %48 = arith.addi %45, %24 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7393438Z %49 = tt.addptr %6, %48 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7393635Z %50 = tt.load %49 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.7393898Z %51 = ttg.convert_layout %50 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7394295Z %52 = arith.extf %51 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7394624Z %53 = arith.addi %46, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7394930Z %54 = tt.addptr %7, %53 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7395227Z %55 = tt.load %54 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7395449Z %56 = arith.shli %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7395676Z %57 = arith.shrsi %56, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7395904Z %58 = arith.shrsi %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7396217Z %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7396542Z %60 = tt.expand_dims %58 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7396817Z %61 = tt.broadcast %59 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7397046Z %62 = arith.select %12, %61, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7397274Z %63 = tt.broadcast %60 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7397497Z %64 = arith.select %14, %63, %62 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7397715Z %65 = tt.reshape %64 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.7397929Z %66 = arith.sitofp %65 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.7398176Z %67 = ttg.local_alloc %66 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.7398492Z %68 = ttg.local_load %67 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7398951Z %69 = tt.dot %52, %68, %47, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7399327Z %70 = arith.truncf %69 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:49.7399586Z %71 = tt.expand_dims %37 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:49.7399851Z %72 = arith.muli %71, %cst_7 : tensor<16x1xi32, #mma> 2026-02-21T09:37:49.7400074Z %73 = tt.expand_dims %42 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:49.7400323Z %74 = tt.broadcast %72 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.7400514Z %75 = tt.broadcast %73 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.7400686Z %76 = arith.addi %74, %75 : tensor<16x64xi32, #mma> 2026-02-21T09:37:49.7400862Z %77 = tt.addptr %15, %76 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:49.7401047Z tt.store %77, %70 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:49.7401186Z %78 = arith.addi %arg3, %c2432_i32 : i32 2026-02-21T09:37:49.7401307Z %79 = arith.divsi %78, %c896_i32 : i32 2026-02-21T09:37:49.7401422Z %80 = arith.muli %79, %c8_i32 : i32 2026-02-21T09:37:49.7401536Z %81 = arith.subi %c4_i32, %80 : i32 2026-02-21T09:37:49.7401653Z %82 = arith.minsi %81, %c8_i32 : i32 2026-02-21T09:37:49.7401769Z %83 = arith.remsi %78, %c896_i32 : i32 2026-02-21T09:37:49.7401883Z %84 = arith.remsi %83, %82 : i32 2026-02-21T09:37:49.7401997Z %85 = arith.addi %80, %84 : i32 2026-02-21T09:37:49.7402105Z %86 = arith.divsi %83, %82 : i32 2026-02-21T09:37:49.7402214Z %87 = arith.muli %85, %c16_i32 : i32 2026-02-21T09:37:49.7402377Z %88 = tt.splat %87 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.7402622Z %89 = tt.splat %87 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.7402832Z %90 = arith.addi %88, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.7403037Z %91 = arith.addi %89, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.7403198Z %92 = arith.muli %86, %c64_i32 : i32 2026-02-21T09:37:49.7403399Z %93 = tt.splat %92 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.7403647Z %94 = tt.splat %92 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.7403935Z %95 = arith.addi %93, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.7404180Z %96 = arith.addi %94, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.7404444Z %97 = tt.expand_dims %90 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.7404690Z %98 = arith.muli %97, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.7404878Z %99 = tt.broadcast %98 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7405227Z %100 = tt.expand_dims %95 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7405618Z %101 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:49.7405839Z %132 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:49.7406014Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7406238Z %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7406514Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.7406789Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7406984Z %137 = arith.addi %99, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7407230Z %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7407436Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.7407706Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7408107Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7408393Z %142 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:49.7408569Z %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7408798Z %144 = arith.addi %143, %100 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7409112Z %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7409422Z %146 = tt.load %145 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7409656Z %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7409889Z %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7410125Z %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7410415Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7410748Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7411035Z %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7411275Z %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7411551Z %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7411784Z %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7412013Z %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.7412236Z %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.7412488Z %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.7412815Z %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7413286Z %160 = tt.dot %141, %159, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7413639Z %161 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:49.7413762Z %162 = arith.muli %161, %c2_i32 : i32 2026-02-21T09:37:49.7413934Z %163 = tt.splat %162 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7414160Z %164 = arith.addi %163, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7414438Z %165 = tt.expand_dims %164 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.7414713Z %166 = tt.broadcast %165 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7414909Z %167 = arith.addi %99, %166 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7415139Z %168 = tt.addptr %6, %167 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7415346Z %169 = tt.load %168 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.7423307Z %170 = ttg.convert_layout %169 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7423727Z %171 = arith.extf %170 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7424016Z %172 = arith.muli %161, %c7168_i32 : i32 2026-02-21T09:37:49.7424195Z %173 = tt.splat %172 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7424423Z %174 = arith.addi %173, %100 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7424737Z %175 = tt.addptr %7, %174 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7425043Z %176 = tt.load %175 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7425278Z %177 = arith.shli %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7425513Z %178 = arith.shrsi %177, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7425748Z %179 = arith.shrsi %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7426036Z %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7426370Z %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7426653Z %182 = tt.broadcast %180 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7426893Z %183 = arith.select %12, %182, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7427183Z %184 = tt.broadcast %181 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7427414Z %185 = arith.select %14, %184, %183 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7427639Z %186 = tt.reshape %185 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.7427859Z %187 = arith.sitofp %186 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.7428106Z %188 = ttg.local_alloc %187 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.7428423Z %189 = ttg.local_load %188 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7428891Z %190 = tt.dot %171, %189, %160, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7429240Z %191 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:49.7429363Z %192 = arith.muli %191, %c2_i32 : i32 2026-02-21T09:37:49.7429534Z %193 = tt.splat %192 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7429754Z %194 = arith.addi %193, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7430028Z %195 = tt.expand_dims %194 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.7430302Z %196 = tt.broadcast %195 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7430493Z %197 = arith.addi %99, %196 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7430726Z %198 = tt.addptr %6, %197 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7430929Z %199 = tt.load %198 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.7431196Z %200 = ttg.convert_layout %199 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7431594Z %201 = arith.extf %200 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7431875Z %202 = arith.muli %191, %c7168_i32 : i32 2026-02-21T09:37:49.7432048Z %203 = tt.splat %202 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7432274Z %204 = arith.addi %203, %100 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7432583Z %205 = tt.addptr %7, %204 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7432887Z %206 = tt.load %205 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7433116Z %207 = arith.shli %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7433349Z %208 = arith.shrsi %207, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7433580Z %209 = arith.shrsi %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7433864Z %210 = tt.expand_dims %208 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7434195Z %211 = tt.expand_dims %209 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7434472Z %212 = tt.broadcast %210 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7434707Z %213 = arith.select %12, %212, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7434971Z %214 = tt.broadcast %211 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7435198Z %215 = arith.select %14, %214, %213 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7435423Z %216 = tt.reshape %215 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.7435640Z %217 = arith.sitofp %216 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.7435885Z %218 = ttg.local_alloc %217 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.7436199Z %219 = ttg.local_load %218 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7436670Z %220 = tt.dot %201, %219, %190, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7437018Z scf.yield %220 : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7437136Z } {tt.flatten} 2026-02-21T09:37:49.7437254Z %102 = arith.addi %99, %24 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7437449Z %103 = tt.addptr %6, %102 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7437651Z %104 = tt.load %103 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.7437913Z %105 = ttg.convert_layout %104 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7438303Z %106 = arith.extf %105 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7438664Z %107 = arith.addi %100, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7438980Z %108 = tt.addptr %7, %107 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7439282Z %109 = tt.load %108 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7439509Z %110 = arith.shli %109, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7439737Z %111 = arith.shrsi %110, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7439967Z %112 = arith.shrsi %109, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7440248Z %113 = tt.expand_dims %111 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7440578Z %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7440856Z %115 = tt.broadcast %113 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7441090Z %116 = arith.select %12, %115, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7441322Z %117 = tt.broadcast %114 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7441549Z %118 = arith.select %14, %117, %116 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7441772Z %119 = tt.reshape %118 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.7441989Z %120 = arith.sitofp %119 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.7442233Z %121 = ttg.local_alloc %120 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.7442548Z %122 = ttg.local_load %121 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7443077Z %123 = tt.dot %106, %122, %101, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7443454Z %124 = arith.truncf %123 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:49.7443714Z %125 = tt.expand_dims %91 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:49.7443948Z %126 = arith.muli %125, %cst_7 : tensor<16x1xi32, #mma> 2026-02-21T09:37:49.7444175Z %127 = tt.expand_dims %96 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:49.7444430Z %128 = tt.broadcast %126 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.7444627Z %129 = tt.broadcast %127 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.7444807Z %130 = arith.addi %128, %129 : tensor<16x64xi32, #mma> 2026-02-21T09:37:49.7444990Z %131 = tt.addptr %15, %130 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:49.7445181Z tt.store %131, %124 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:49.7445317Z } {tt.num_stages = 1 : i32} 2026-02-21T09:37:49.7445446Z scf.for %arg3 = %21 to %c448_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:49.7445591Z %25 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:37:49.7445710Z %26 = arith.muli %25, %c8_i32 : i32 2026-02-21T09:37:49.7445825Z %27 = arith.subi %c4_i32, %26 : i32 2026-02-21T09:37:49.7445937Z %28 = arith.minsi %27, %c8_i32 : i32 2026-02-21T09:37:49.7446055Z %29 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:37:49.7446170Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:37:49.7446329Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:37:49.7446438Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:37:49.7446550Z %33 = arith.muli %31, %c16_i32 : i32 2026-02-21T09:37:49.7446714Z %34 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.7446921Z %35 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.7447127Z %36 = arith.addi %34, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:49.7447332Z %37 = arith.addi %35, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:49.7447490Z %38 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:37:49.7447690Z %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.7447935Z %40 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.7448180Z %41 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:49.7448424Z %42 = arith.addi %40, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:49.7448686Z %43 = tt.expand_dims %36 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.7448930Z %44 = arith.muli %43, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:49.7449117Z %45 = tt.broadcast %44 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7449465Z %46 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7449854Z %47 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:49.7450066Z %78 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:49.7450233Z %79 = tt.splat %78 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7450484Z %80 = arith.addi %79, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7450767Z %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.7451034Z %82 = tt.broadcast %81 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7451221Z %83 = arith.addi %45, %82 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7451413Z %84 = tt.addptr %6, %83 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7451611Z %85 = tt.load %84 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.7451870Z %86 = ttg.convert_layout %85 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7452263Z %87 = arith.extf %86 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7452548Z %88 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:49.7452718Z %89 = tt.splat %88 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7452937Z %90 = arith.addi %89, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7453235Z %91 = tt.addptr %7, %90 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7453533Z %92 = tt.load %91 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7453754Z %93 = arith.shli %92, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7454013Z %94 = arith.shrsi %93, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7454246Z %95 = arith.shrsi %92, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7454523Z %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7454848Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7455121Z %98 = tt.broadcast %96 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7455347Z %99 = arith.select %12, %98, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7455575Z %100 = tt.broadcast %97 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7455803Z %101 = arith.select %14, %100, %99 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7456027Z %102 = tt.reshape %101 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.7456249Z %103 = arith.sitofp %102 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.7456494Z %104 = ttg.local_alloc %103 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.7456813Z %105 = ttg.local_load %104 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7457276Z %106 = tt.dot %87, %105, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7457615Z %107 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:49.7457739Z %108 = arith.muli %107, %c2_i32 : i32 2026-02-21T09:37:49.7457905Z %109 = tt.splat %108 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7458158Z %110 = arith.addi %109, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7458432Z %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.7458704Z %112 = tt.broadcast %111 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7458895Z %113 = arith.addi %45, %112 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7459091Z %114 = tt.addptr %6, %113 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7459293Z %115 = tt.load %114 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.7459559Z %116 = ttg.convert_layout %115 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7459955Z %117 = arith.extf %116 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7460238Z %118 = arith.muli %107, %c7168_i32 : i32 2026-02-21T09:37:49.7460409Z %119 = tt.splat %118 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7460634Z %120 = arith.addi %119, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7460940Z %121 = tt.addptr %7, %120 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7461247Z %122 = tt.load %121 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7461477Z %123 = arith.shli %122, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7461741Z %124 = arith.shrsi %123, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7461978Z %125 = arith.shrsi %122, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7462263Z %126 = tt.expand_dims %124 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7462594Z %127 = tt.expand_dims %125 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7462876Z %128 = tt.broadcast %126 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7463109Z %129 = arith.select %12, %128, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7463341Z %130 = tt.broadcast %127 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7463573Z %131 = arith.select %14, %130, %129 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7463797Z %132 = tt.reshape %131 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.7464017Z %133 = arith.sitofp %132 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.7464261Z %134 = ttg.local_alloc %133 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.7464579Z %135 = ttg.local_load %134 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7465039Z %136 = tt.dot %117, %135, %106, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7465377Z %137 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:49.7465500Z %138 = arith.muli %137, %c2_i32 : i32 2026-02-21T09:37:49.7465667Z %139 = tt.splat %138 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7465918Z %140 = arith.addi %139, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:49.7466189Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:49.7466460Z %142 = tt.broadcast %141 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7466651Z %143 = arith.addi %45, %142 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7466846Z %144 = tt.addptr %6, %143 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7467046Z %145 = tt.load %144 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.7467311Z %146 = ttg.convert_layout %145 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7467705Z %147 = arith.extf %146 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7467991Z %148 = arith.muli %137, %c7168_i32 : i32 2026-02-21T09:37:49.7468162Z %149 = tt.splat %148 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7468385Z %150 = arith.addi %149, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7468692Z %151 = tt.addptr %7, %150 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7468999Z %152 = tt.load %151 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7469228Z %153 = arith.shli %152, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7469488Z %154 = arith.shrsi %153, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7469725Z %155 = arith.shrsi %152, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7470008Z %156 = tt.expand_dims %154 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7470337Z %157 = tt.expand_dims %155 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7470615Z %158 = tt.broadcast %156 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7470849Z %159 = arith.select %12, %158, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7471080Z %160 = tt.broadcast %157 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7471309Z %161 = arith.select %14, %160, %159 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7471536Z %162 = tt.reshape %161 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.7471754Z %163 = arith.sitofp %162 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.7472003Z %164 = ttg.local_alloc %163 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.7472323Z %165 = ttg.local_load %164 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7472785Z %166 = tt.dot %147, %165, %136, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7473126Z scf.yield %166 : tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7473243Z } {tt.flatten} 2026-02-21T09:37:49.7473358Z %48 = arith.addi %45, %24 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7473584Z %49 = tt.addptr %6, %48 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:49.7473781Z %50 = tt.load %49 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:49.7474037Z %51 = ttg.convert_layout %50 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7474428Z %52 = arith.extf %51 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7474752Z %53 = arith.addi %46, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7475057Z %54 = tt.addptr %7, %53 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7475356Z %55 = tt.load %54 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7475581Z %56 = arith.shli %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7475807Z %57 = arith.shrsi %56, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7476030Z %58 = arith.shrsi %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:49.7476305Z %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7476631Z %60 = tt.expand_dims %58 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:49.7476902Z %61 = tt.broadcast %59 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7477158Z %62 = arith.select %12, %61, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7477382Z %63 = tt.broadcast %60 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7477607Z %64 = arith.select %14, %63, %62 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:49.7477821Z %65 = tt.reshape %64 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:49.7478030Z %66 = arith.sitofp %65 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:49.7478269Z %67 = ttg.local_alloc %66 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:49.7478580Z %68 = ttg.local_load %67 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:49.7479033Z %69 = tt.dot %52, %68, %47, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:49.7479404Z %70 = arith.truncf %69 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:49.7479660Z %71 = tt.expand_dims %37 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:49.7479889Z %72 = arith.muli %71, %cst_7 : tensor<16x1xi32, #mma> 2026-02-21T09:37:49.7480111Z %73 = tt.expand_dims %42 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:49.7480355Z %74 = tt.broadcast %72 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.7480549Z %75 = tt.broadcast %73 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:49.7480719Z %76 = arith.addi %74, %75 : tensor<16x64xi32, #mma> 2026-02-21T09:37:49.7480893Z %77 = tt.addptr %15, %76 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:49.7481082Z tt.store %77, %70 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:49.7481213Z } {tt.num_stages = 1 : i32} 2026-02-21T09:37:49.7481346Z tt.return 2026-02-21T09:37:49.7481422Z } 2026-02-21T09:37:49.7481493Z } 2026-02-21T09:37:49.7481535Z 2026-02-21T09:37:49.7481564Z {-# 2026-02-21T09:37:49.7481642Z external_resources: { 2026-02-21T09:37:49.7481739Z mlir_reproducer: { 2026-02-21T09:37:49.7482767Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:49.7483755Z disable_threading: false, 2026-02-21T09:37:49.7483863Z verify_each: true 2026-02-21T09:37:49.7483948Z } 2026-02-21T09:37:49.7484018Z } 2026-02-21T09:37:49.7484084Z #-} 2026-02-21T09:37:49.7484358Z /tmp/torchinductor_root/vo/cvoepzha374qpppib5bjjzr2722y5wpm227pikwbnla374as2rrd.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:49.7485035Z /tmp/torchinductor_root/vo/cvoepzha374qpppib5bjjzr2722y5wpm227pikwbnla374as2rrd.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:49.7485578Z [105s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:49.7486389Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:37:49.7487097Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:49.7487261Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:50.0111373Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:50.0114878Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:37:50.0115680Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:37:50.0116432Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:37:50.0117125Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:50.0117715Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:50.0118137Z #smem = #ttg.shared_memory 2026-02-21T09:37:50.0118679Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:50.0119784Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:50.0120683Z %cst = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma> 2026-02-21T09:37:50.0121050Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:50.0121331Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:50.0121848Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:50.0122103Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:50.0122431Z %cst_0 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0122884Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:50.0123154Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:37:50.0123409Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:50.0123665Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:50.0123920Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:37:50.0124191Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:50.0124408Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:50.0124760Z %cst_1 = arith.constant dense<29352960> : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0125198Z %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.0125558Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:50.0125915Z %cst_4 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0126263Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.0126542Z %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.0126814Z %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:50.0127044Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:50.0127368Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.0127803Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.0128377Z %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:50.0128878Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.0129310Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.0129700Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:50.0130087Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0130586Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:50.0131267Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:50.0131922Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.0132339Z %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.0132652Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:50.0132968Z %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.0133273Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:50.0133601Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:50.0133946Z %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.0134392Z %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:50.0134767Z %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.0135003Z scf.for %arg3 = %0 to %c448_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:50.0135229Z %19 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:37:50.0135380Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T09:37:50.0135522Z %21 = arith.subi %c4_i32, %20 : i32 2026-02-21T09:37:50.0135663Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T09:37:50.0135829Z %23 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:37:50.0135972Z %24 = arith.remsi %23, %22 : i32 2026-02-21T09:37:50.0136107Z %25 = arith.addi %20, %24 : i32 2026-02-21T09:37:50.0136239Z %26 = arith.divsi %23, %22 : i32 2026-02-21T09:37:50.0136374Z %27 = arith.muli %25, %c16_i32 : i32 2026-02-21T09:37:50.0136579Z %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.0136841Z %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.0137098Z %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.0137358Z %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.0137555Z %32 = arith.muli %26, %c64_i32 : i32 2026-02-21T09:37:50.0137804Z %33 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:50.0138107Z %34 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.0138411Z %35 = arith.addi %33, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:50.0138713Z %36 = arith.addi %34, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.0139083Z %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:50.0139391Z %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:50.0139628Z %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.0140060Z %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0140541Z %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:50.0140807Z %72 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:50.0141015Z %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.0141282Z %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.0141620Z %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:50.0141955Z %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.0142192Z %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.0142435Z %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.0142681Z %79 = tt.load %78 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:50.0143005Z %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.0143503Z %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.0143855Z %82 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:50.0144073Z %83 = tt.splat %82 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0144344Z %84 = arith.addi %83, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0144763Z %85 = tt.addptr %7, %84 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0145130Z %86 = tt.load %85 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0145359Z %87 = arith.shli %86, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0145590Z %88 = arith.shrsi %87, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0145818Z %89 = arith.shrsi %86, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0146101Z %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.0146426Z %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.0146704Z %92 = tt.broadcast %90 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0146935Z %93 = arith.select %12, %92, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0147162Z %94 = tt.broadcast %91 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0147383Z %95 = arith.select %14, %94, %93 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0147603Z %96 = tt.reshape %95 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:50.0147817Z %97 = arith.sitofp %96 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:50.0148098Z %98 = ttg.local_alloc %97 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:50.0148413Z %99 = ttg.local_load %98 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.0148881Z %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:50.0149223Z %101 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:50.0149346Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:37:50.0149516Z %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.0149737Z %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.0150014Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:50.0150289Z %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.0150485Z %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.0150682Z %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.0150884Z %109 = tt.load %108 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:50.0151149Z %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.0151548Z %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.0151827Z %112 = arith.muli %101, %c7168_i32 : i32 2026-02-21T09:37:50.0152004Z %113 = tt.splat %112 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0152228Z %114 = arith.addi %113, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0152584Z %115 = tt.addptr %7, %114 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0152890Z %116 = tt.load %115 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0153118Z %117 = arith.shli %116, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0153351Z %118 = arith.shrsi %117, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0153584Z %119 = arith.shrsi %116, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0153873Z %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.0154206Z %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.0154492Z %122 = tt.broadcast %120 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0154728Z %123 = arith.select %12, %122, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0154959Z %124 = tt.broadcast %121 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0155186Z %125 = arith.select %14, %124, %123 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0155411Z %126 = tt.reshape %125 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:50.0155626Z %127 = arith.sitofp %126 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:50.0155904Z %128 = ttg.local_alloc %127 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:50.0156219Z %129 = ttg.local_load %128 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.0156680Z %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:50.0157018Z %131 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:50.0157137Z %132 = arith.muli %131, %c2_i32 : i32 2026-02-21T09:37:50.0157303Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.0157521Z %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.0157792Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:50.0158065Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.0158256Z %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.0158453Z %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.0158652Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:50.0158913Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.0159305Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.0159579Z %142 = arith.muli %131, %c7168_i32 : i32 2026-02-21T09:37:50.0159751Z %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0159970Z %144 = arith.addi %143, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0160308Z %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0160609Z %146 = tt.load %145 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0160835Z %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0161065Z %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0161296Z %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0161581Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.0161911Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.0162191Z %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0162422Z %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0162694Z %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0162920Z %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0163146Z %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:50.0163364Z %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:50.0163660Z %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:50.0163975Z %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.0164434Z %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:50.0164774Z scf.yield %160 : tensor<16x64xf32, #mma> 2026-02-21T09:37:50.0164889Z } {tt.flatten} 2026-02-21T09:37:50.0165004Z %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.0165193Z %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.0165386Z %44 = tt.load %43 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:50.0165645Z %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.0166028Z %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.0166354Z %47 = arith.addi %40, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0166657Z %48 = tt.addptr %7, %47 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0166952Z %49 = tt.load %48 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0167170Z %50 = arith.shli %49, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0167391Z %51 = arith.shrsi %50, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0167618Z %52 = arith.shrsi %49, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.0167896Z %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.0168252Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.0168520Z %55 = tt.broadcast %53 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0168744Z %56 = arith.select %12, %55, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0168968Z %57 = tt.broadcast %54 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0169187Z %58 = arith.select %14, %57, %56 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.0169401Z %59 = tt.reshape %58 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:50.0169611Z %60 = arith.sitofp %59 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:50.0169850Z %61 = ttg.local_alloc %60 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:50.0170158Z %62 = ttg.local_load %61 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.0170605Z %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:50.0170970Z %64 = arith.truncf %63 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:50.0171223Z %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:50.0171485Z %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma> 2026-02-21T09:37:50.0171708Z %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:50.0171959Z %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:50.0172148Z %69 = tt.broadcast %67 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:50.0172317Z %70 = arith.addi %68, %69 : tensor<16x64xi32, #mma> 2026-02-21T09:37:50.0172491Z %71 = tt.addptr %15, %70 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:50.0172677Z tt.store %71, %64 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:50.0172835Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:37:50.0172969Z tt.return 2026-02-21T09:37:50.0173047Z } 2026-02-21T09:37:50.0173117Z } 2026-02-21T09:37:50.0173159Z 2026-02-21T09:37:50.0173187Z {-# 2026-02-21T09:37:50.0173264Z external_resources: { 2026-02-21T09:37:50.0173363Z mlir_reproducer: { 2026-02-21T09:37:50.0174354Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:50.0175339Z disable_threading: false, 2026-02-21T09:37:50.0175440Z verify_each: true 2026-02-21T09:37:50.0175528Z } 2026-02-21T09:37:50.0175595Z } 2026-02-21T09:37:50.0175661Z #-} 2026-02-21T09:37:50.0175933Z /tmp/torchinductor_root/7s/c7sb5sb6s3beeqyxvqjiy2qgqbn2owvyq5rq233iflctmhwzatng.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:50.0176600Z /tmp/torchinductor_root/7s/c7sb5sb6s3beeqyxvqjiy2qgqbn2owvyq5rq233iflctmhwzatng.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:50.0177176Z [106s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:50.0177942Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:37:50.0178643Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:50.0178805Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:50.1414156Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:50.1418851Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:37:50.1419274Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:37:50.1419650Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:37:50.1420004Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:37:50.1420430Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:50.1420657Z #smem = #ttg.shared_memory 2026-02-21T09:37:50.1420950Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:50.1421535Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:50.1422019Z %cst = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma> 2026-02-21T09:37:50.1422218Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:50.1422368Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:37:50.1422517Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:50.1422654Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:37:50.1422832Z %cst_0 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1423018Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:37:50.1423166Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:37:50.1423309Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:50.1423446Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:50.1423587Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:37:50.1423730Z %c4095_i32 = arith.constant 4095 : i32 2026-02-21T09:37:50.1423876Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:50.1424116Z %cst_1 = arith.constant dense<29352960> : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1424440Z %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.1424702Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:50.1424965Z %cst_4 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1425223Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.1425421Z %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.1425668Z %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:50.1425834Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:50.1426064Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.1426382Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.1426735Z %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:50.1427090Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.1427399Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.1427682Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:50.1427961Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1428316Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:50.1428797Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:50.1429261Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.1429555Z %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.1429819Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:50.1430044Z %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.1430267Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked> 2026-02-21T09:37:50.1430505Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:50.1430752Z %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.1431075Z %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:50.1431385Z %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.1431608Z scf.for %arg3 = %0 to %c448_i32 step %c2432_i32 : i32 { 2026-02-21T09:37:50.1431801Z %19 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:37:50.1431944Z %20 = arith.muli %19, %c4_i32 : i32 2026-02-21T09:37:50.1432085Z %21 = arith.subi %c4_i32, %20 : i32 2026-02-21T09:37:50.1432132Z %22 = arith.minsi %21, %c4_i32 : i32 2026-02-21T09:37:50.1432187Z %23 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:37:50.1432236Z %24 = arith.remsi %23, %22 : i32 2026-02-21T09:37:50.1432282Z %25 = arith.addi %20, %24 : i32 2026-02-21T09:37:50.1432327Z %26 = arith.divsi %23, %22 : i32 2026-02-21T09:37:50.1432374Z %27 = arith.muli %25, %c16_i32 : i32 2026-02-21T09:37:50.1432483Z %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.1432577Z %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.1432684Z %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.1432778Z %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.1432824Z %32 = arith.muli %26, %c64_i32 : i32 2026-02-21T09:37:50.1432977Z %33 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:50.1433116Z %34 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.1433263Z %35 = arith.addi %33, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:37:50.1433359Z %36 = arith.addi %34, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.1433526Z %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:37:50.1433599Z %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1> 2026-02-21T09:37:50.1433707Z %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.1433966Z %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1434121Z %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:50.1434178Z %72 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:50.1434280Z %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.1434367Z %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.1434507Z %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:50.1434595Z %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.1434654Z %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.1434785Z %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.1434847Z %79 = tt.load %78 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:50.1435011Z %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.1435210Z %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.1435256Z %82 = arith.muli %arg4, %c7168_i32 : i32 2026-02-21T09:37:50.1435346Z %83 = tt.splat %82 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1435437Z %84 = arith.addi %83, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1435608Z %85 = tt.addptr %7, %84 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1435702Z %86 = tt.load %85 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1435801Z %87 = arith.shli %86, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1435896Z %88 = arith.shrsi %87, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1435988Z %89 = arith.shrsi %86, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1436132Z %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.1436274Z %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.1436364Z %92 = tt.broadcast %90 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1436464Z %93 = arith.select %12, %92, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1436552Z %94 = tt.broadcast %91 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1436676Z %95 = arith.select %14, %94, %93 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1436760Z %96 = tt.reshape %95 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:50.1436846Z %97 = arith.sitofp %96 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:50.1436958Z %98 = ttg.local_alloc %97 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:50.1437121Z %99 = ttg.local_load %98 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.1437387Z %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:50.1437435Z %101 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:37:50.1437478Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:37:50.1437570Z %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.1437663Z %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.1437808Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:50.1437899Z %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.1437962Z %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.1438061Z %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.1438153Z %109 = tt.load %108 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:50.1438318Z %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.1438514Z %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.1438562Z %112 = arith.muli %101, %c7168_i32 : i32 2026-02-21T09:37:50.1438655Z %113 = tt.splat %112 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1438747Z %114 = arith.addi %113, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1438922Z %115 = tt.addptr %7, %114 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1439018Z %116 = tt.load %115 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1439116Z %117 = arith.shli %116, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1439215Z %118 = arith.shrsi %117, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1439310Z %119 = arith.shrsi %116, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1439456Z %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.1439601Z %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.1439694Z %122 = tt.broadcast %120 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1439798Z %123 = arith.select %12, %122, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1439892Z %124 = tt.broadcast %121 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1440024Z %125 = arith.select %14, %124, %123 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1440111Z %126 = tt.reshape %125 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:50.1440204Z %127 = arith.sitofp %126 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:50.1440320Z %128 = ttg.local_alloc %127 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:50.1440485Z %129 = ttg.local_load %128 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.1440747Z %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:50.1440793Z %131 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:37:50.1440837Z %132 = arith.muli %131, %c2_i32 : i32 2026-02-21T09:37:50.1440930Z %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.1441021Z %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.1441164Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:37:50.1441255Z %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.1441317Z %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.1441443Z %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.1441507Z %139 = tt.load %138 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:50.1441672Z %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.1441864Z %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.1441909Z %142 = arith.muli %131, %c7168_i32 : i32 2026-02-21T09:37:50.1442002Z %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1442092Z %144 = arith.addi %143, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1442266Z %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1442360Z %146 = tt.load %145 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1442456Z %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1442554Z %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1442688Z %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1442833Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.1442980Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.1443071Z %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1443172Z %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1443260Z %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1443401Z %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1443488Z %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:50.1443578Z %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:50.1443696Z %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:50.1443860Z %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.1444122Z %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:50.1444173Z scf.yield %160 : tensor<16x64xf32, #mma> 2026-02-21T09:37:50.1444218Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:37:50.1444278Z %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.1444378Z %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr, #blocked1>, tensor<16x2xi32, #blocked1> 2026-02-21T09:37:50.1444438Z %44 = tt.load %43 : tensor<16x2x!tt.ptr, #blocked1> 2026-02-21T09:37:50.1444600Z %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.1444798Z %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.1444924Z %47 = arith.addi %40, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1445093Z %48 = tt.addptr %7, %47 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1446668Z %49 = tt.load %48 : tensor<1x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1446763Z %50 = arith.shli %49, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1446857Z %51 = arith.shrsi %50, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1446948Z %52 = arith.shrsi %49, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.1447093Z %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.1447237Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked> 2026-02-21T09:37:50.1447327Z %55 = tt.broadcast %53 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1447428Z %56 = arith.select %12, %55, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1447527Z %57 = tt.broadcast %54 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1447621Z %58 = arith.select %14, %57, %56 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked> 2026-02-21T09:37:50.1447708Z %59 = tt.reshape %58 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2> 2026-02-21T09:37:50.1447794Z %60 = arith.sitofp %59 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2> 2026-02-21T09:37:50.1447905Z %61 = ttg.local_alloc %60 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem> 2026-02-21T09:37:50.1448068Z %62 = ttg.local_load %61 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.1448325Z %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:50.1448429Z %64 = arith.truncf %63 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:50.1448563Z %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:50.1448620Z %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma> 2026-02-21T09:37:50.1448749Z %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:50.1448825Z %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:50.1448903Z %69 = tt.broadcast %67 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:50.1448958Z %70 = arith.addi %68, %69 : tensor<16x64xi32, #mma> 2026-02-21T09:37:50.1449048Z %71 = tt.addptr %15, %70 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:50.1449106Z tt.store %71, %64 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:50.1449172Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:37:50.1449207Z tt.return 2026-02-21T09:37:50.1449238Z } 2026-02-21T09:37:50.1449269Z } 2026-02-21T09:37:50.1449273Z 2026-02-21T09:37:50.1449301Z {-# 2026-02-21T09:37:50.1449341Z external_resources: { 2026-02-21T09:37:50.1449377Z mlir_reproducer: { 2026-02-21T09:37:50.1450349Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:50.1450456Z disable_threading: false, 2026-02-21T09:37:50.1450491Z verify_each: true 2026-02-21T09:37:50.1450520Z } 2026-02-21T09:37:50.1450549Z } 2026-02-21T09:37:50.1450577Z #-} 2026-02-21T09:37:50.1450813Z /tmp/torchinductor_root/qr/cqrrueayeq7jpuik2iuudslnn32l6itvmutzhyd7ctjkz356xuau.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:50.1451224Z /tmp/torchinductor_root/qr/cqrrueayeq7jpuik2iuudslnn32l6itvmutzhyd7ctjkz356xuau.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:50.1451334Z [106s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:50.1451965Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 1], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:37:50.1452020Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:50.1452099Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:50.4682119Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:37:50.4687468Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:37:50.4687931Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:37:50.4688235Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:37:50.4688530Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:37:50.4688807Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:37:50.4689057Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:37:50.4689286Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:37:50.4689467Z #smem = #ttg.shared_memory 2026-02-21T09:37:50.4689695Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:37:50.4690162Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:37:50.4690628Z %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:37:50.4690801Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.4690969Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.4691148Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma> 2026-02-21T09:37:50.4691305Z %c5311_i32 = arith.constant 5311 : i32 2026-02-21T09:37:50.4691456Z %cst_3 = arith.constant dense<7168> : tensor<4x1xi32, #blocked1> 2026-02-21T09:37:50.4691697Z %cst_4 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2> 2026-02-21T09:37:50.4691845Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:37:50.4691963Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:37:50.4692075Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:37:50.4692231Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:37:50.4692345Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:37:50.4692458Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:37:50.4692571Z %c14592_i32 = arith.constant 14592 : i32 2026-02-21T09:37:50.4692692Z %c9728_i32 = arith.constant 9728 : i32 2026-02-21T09:37:50.4692833Z %cst_5 = arith.constant dense<0> : tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4692976Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:37:50.4693084Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:37:50.4693193Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:37:50.4693307Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:37:50.4693418Z %c4864_i32 = arith.constant 4864 : i32 2026-02-21T09:37:50.4693602Z %cst_6 = arith.constant dense<4> : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4693807Z %0 = tt.get_program_id x : i32 2026-02-21T09:37:50.4693999Z %1 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.4694273Z %2 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.4694535Z %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:50.4694792Z %4 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.4695048Z %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.4695312Z %6 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:50.4695546Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T09:37:50.4718019Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<4x64x!tt.ptr, #blocked1> 2026-02-21T09:37:50.4718290Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:37:50.4718702Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:37:50.4719099Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.4719347Z %12 = arith.cmpi eq, %11, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.4719547Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x64xi1, #blocked> 2026-02-21T09:37:50.4719738Z %14 = arith.cmpi eq, %11, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:37:50.4719923Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x64xi1, #blocked> 2026-02-21T09:37:50.4720127Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:50.4720284Z %17 = arith.subi %c5311_i32, %0 : i32 2026-02-21T09:37:50.4720405Z %18 = arith.divui %17, %c4864_i32 : i32 2026-02-21T09:37:50.4720519Z %19 = arith.remsi %18, %c3_i32 : i32 2026-02-21T09:37:50.4720631Z %20 = arith.subi %18, %19 : i32 2026-02-21T09:37:50.4720740Z %21 = arith.muli %20, %c4864_i32 : i32 2026-02-21T09:37:50.4720854Z %22 = arith.addi %0, %21 : i32 2026-02-21T09:37:50.4720977Z scf.for %arg3 = %0 to %22 step %c14592_i32 : i32 { 2026-02-21T09:37:50.4721112Z %23 = arith.divsi %arg3, %c8_i32 : i32 2026-02-21T09:37:50.4721229Z %24 = arith.muli %23, %c2_i32 : i32 2026-02-21T09:37:50.4721392Z %25 = arith.subi %c112_i32, %24 : i32 2026-02-21T09:37:50.4721506Z %26 = arith.minsi %25, %c2_i32 : i32 2026-02-21T09:37:50.4721624Z %27 = arith.remsi %arg3, %c8_i32 : i32 2026-02-21T09:37:50.4721739Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:37:50.4721867Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:37:50.4721979Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:37:50.4722088Z %31 = arith.muli %29, %c64_i32 : i32 2026-02-21T09:37:50.4722258Z %32 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.4722469Z %33 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.4722725Z %34 = arith.addi %32, %1 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.4722931Z %35 = arith.addi %33, %2 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.4723087Z %36 = arith.muli %30, %c16_i32 : i32 2026-02-21T09:37:50.4723255Z %37 = tt.splat %36 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:50.4723464Z %38 = tt.splat %36 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.4723668Z %39 = arith.addi %37, %3 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:50.4723874Z %40 = arith.addi %38, %4 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.4724134Z %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T09:37:50.4724382Z %42 = arith.muli %41, %cst_4 : tensor<16x1xi32, #blocked2> 2026-02-21T09:37:50.4724572Z %43 = tt.broadcast %42 : tensor<16x1xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4724843Z %44 = tt.expand_dims %34 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1> 2026-02-21T09:37:50.4725115Z %45 = tt.broadcast %44 : tensor<1x64xi32, #blocked1> -> tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4725396Z %46 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:50.4725669Z %121 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.4725896Z %122 = arith.addi %121, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.4726068Z %123 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:50.4726236Z %124 = tt.splat %123 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:50.4726452Z %125 = arith.addi %124, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:50.4726726Z %126 = tt.expand_dims %125 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:37:50.4727003Z %127 = tt.broadcast %126 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4727198Z %128 = arith.addi %43, %127 : tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4727400Z %129 = tt.addptr %7, %128 : tensor<16x8x!tt.ptr, #blocked2>, tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4727607Z %130 = tt.load %129 : tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T09:37:50.4727828Z %131 = ttg.local_alloc %130 : (tensor<16x8xbf16, #blocked2>) -> !ttg.memdesc<16x8xbf16, #shared, #smem> 2026-02-21T09:37:50.4728159Z %132 = ttg.local_load %131 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.4728564Z %133 = arith.extf %132 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.4728985Z %134 = tt.expand_dims %122 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:37:50.4729232Z %135 = arith.muli %134, %cst_3 : tensor<4x1xi32, #blocked1> 2026-02-21T09:37:50.4729425Z %136 = tt.broadcast %135 : tensor<4x1xi32, #blocked1> -> tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4729640Z %137 = arith.addi %136, %45 : tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4729835Z %138 = tt.addptr %8, %137 : tensor<4x64x!tt.ptr, #blocked1>, tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4730037Z %139 = tt.load %138 : tensor<4x64x!tt.ptr, #blocked1> 2026-02-21T09:37:50.4730276Z %140 = ttg.convert_layout %139 : tensor<4x64xi8, #blocked1> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4730556Z %141 = arith.shli %140, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4730796Z %142 = arith.shrsi %141, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4731029Z %143 = arith.shrsi %140, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4731318Z %144 = tt.expand_dims %142 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:37:50.4731651Z %145 = tt.expand_dims %143 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:37:50.4731933Z %146 = tt.broadcast %144 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4732172Z %147 = arith.select %13, %146, %cst_5 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4732408Z %148 = tt.broadcast %145 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4732641Z %149 = arith.select %15, %148, %147 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4732870Z %150 = tt.reshape %149 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked3> 2026-02-21T09:37:50.4733110Z %151 = arith.sitofp %150 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3> 2026-02-21T09:37:50.4733359Z %152 = ttg.local_alloc %151 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:37:50.4733679Z %153 = ttg.local_load %152 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.4734152Z %154 = tt.dot %133, %153, %arg5, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:50.4734504Z scf.yield %154 : tensor<16x64xf32, #mma> 2026-02-21T09:37:50.4734633Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:37:50.4734798Z %47 = arith.truncf %46 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:50.4735053Z %48 = tt.expand_dims %40 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:50.4735286Z %49 = arith.muli %48, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:37:50.4735505Z %50 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:50.4735752Z %51 = tt.broadcast %49 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4735945Z %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4736112Z %53 = arith.addi %51, %52 : tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4736293Z %54 = tt.addptr %16, %53 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4736478Z tt.store %54, %47 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:50.4736663Z %55 = arith.addi %arg3, %c4864_i32 : i32 2026-02-21T09:37:50.4736784Z %56 = arith.divsi %55, %c8_i32 : i32 2026-02-21T09:37:50.4736902Z %57 = arith.muli %56, %c2_i32 : i32 2026-02-21T09:37:50.4737017Z %58 = arith.subi %c112_i32, %57 : i32 2026-02-21T09:37:50.4737152Z %59 = arith.minsi %58, %c2_i32 : i32 2026-02-21T09:37:50.4737265Z %60 = arith.remsi %55, %c8_i32 : i32 2026-02-21T09:37:50.4737375Z %61 = arith.remsi %60, %59 : i32 2026-02-21T09:37:50.4737487Z %62 = arith.addi %57, %61 : i32 2026-02-21T09:37:50.4737594Z %63 = arith.divsi %60, %59 : i32 2026-02-21T09:37:50.4737704Z %64 = arith.muli %62, %c64_i32 : i32 2026-02-21T09:37:50.4737867Z %65 = tt.splat %64 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.4738077Z %66 = tt.splat %64 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.4738290Z %67 = arith.addi %65, %1 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.4738495Z %68 = arith.addi %66, %2 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.4738657Z %69 = arith.muli %63, %c16_i32 : i32 2026-02-21T09:37:50.4738817Z %70 = tt.splat %69 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:50.4739025Z %71 = tt.splat %69 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.4739230Z %72 = arith.addi %70, %3 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:50.4739434Z %73 = arith.addi %71, %4 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.4739696Z %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T09:37:50.4739940Z %75 = arith.muli %74, %cst_4 : tensor<16x1xi32, #blocked2> 2026-02-21T09:37:50.4740129Z %76 = tt.broadcast %75 : tensor<16x1xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4740400Z %77 = tt.expand_dims %67 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1> 2026-02-21T09:37:50.4740689Z %78 = tt.broadcast %77 : tensor<1x64xi32, #blocked1> -> tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4740954Z %79 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:50.4741218Z %121 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.4741445Z %122 = arith.addi %121, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.4741619Z %123 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:50.4741788Z %124 = tt.splat %123 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:50.4742009Z %125 = arith.addi %124, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:50.4742279Z %126 = tt.expand_dims %125 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:37:50.4749282Z %127 = tt.broadcast %126 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4749507Z %128 = arith.addi %76, %127 : tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4749714Z %129 = tt.addptr %7, %128 : tensor<16x8x!tt.ptr, #blocked2>, tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4749922Z %130 = tt.load %129 : tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T09:37:50.4750143Z %131 = ttg.local_alloc %130 : (tensor<16x8xbf16, #blocked2>) -> !ttg.memdesc<16x8xbf16, #shared, #smem> 2026-02-21T09:37:50.4750473Z %132 = ttg.local_load %131 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.4750941Z %133 = arith.extf %132 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.4751328Z %134 = tt.expand_dims %122 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:37:50.4751608Z %135 = arith.muli %134, %cst_3 : tensor<4x1xi32, #blocked1> 2026-02-21T09:37:50.4751799Z %136 = tt.broadcast %135 : tensor<4x1xi32, #blocked1> -> tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4751992Z %137 = arith.addi %136, %78 : tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4752185Z %138 = tt.addptr %8, %137 : tensor<4x64x!tt.ptr, #blocked1>, tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4752386Z %139 = tt.load %138 : tensor<4x64x!tt.ptr, #blocked1> 2026-02-21T09:37:50.4752629Z %140 = ttg.convert_layout %139 : tensor<4x64xi8, #blocked1> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4752912Z %141 = arith.shli %140, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4753147Z %142 = arith.shrsi %141, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4753378Z %143 = arith.shrsi %140, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4753665Z %144 = tt.expand_dims %142 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:37:50.4753999Z %145 = tt.expand_dims %143 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:37:50.4754277Z %146 = tt.broadcast %144 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4754512Z %147 = arith.select %13, %146, %cst_5 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4754771Z %148 = tt.broadcast %145 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4755000Z %149 = arith.select %15, %148, %147 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4755258Z %150 = tt.reshape %149 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked3> 2026-02-21T09:37:50.4755480Z %151 = arith.sitofp %150 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3> 2026-02-21T09:37:50.4755727Z %152 = ttg.local_alloc %151 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:37:50.4756049Z %153 = ttg.local_load %152 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.4756520Z %154 = tt.dot %133, %153, %arg5, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:50.4756871Z scf.yield %154 : tensor<16x64xf32, #mma> 2026-02-21T09:37:50.4757001Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:37:50.4757165Z %80 = arith.truncf %79 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:50.4757418Z %81 = tt.expand_dims %73 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:50.4757646Z %82 = arith.muli %81, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:37:50.4757866Z %83 = tt.expand_dims %68 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:50.4758113Z %84 = tt.broadcast %82 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4758305Z %85 = tt.broadcast %83 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4758473Z %86 = arith.addi %84, %85 : tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4758685Z %87 = tt.addptr %16, %86 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4758870Z tt.store %87, %80 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:50.4759013Z %88 = arith.addi %arg3, %c9728_i32 : i32 2026-02-21T09:37:50.4759133Z %89 = arith.divsi %88, %c8_i32 : i32 2026-02-21T09:37:50.4759265Z %90 = arith.muli %89, %c2_i32 : i32 2026-02-21T09:37:50.4759381Z %91 = arith.subi %c112_i32, %90 : i32 2026-02-21T09:37:50.4759497Z %92 = arith.minsi %91, %c2_i32 : i32 2026-02-21T09:37:50.4759610Z %93 = arith.remsi %88, %c8_i32 : i32 2026-02-21T09:37:50.4759721Z %94 = arith.remsi %93, %92 : i32 2026-02-21T09:37:50.4759833Z %95 = arith.addi %90, %94 : i32 2026-02-21T09:37:50.4759938Z %96 = arith.divsi %93, %92 : i32 2026-02-21T09:37:50.4760048Z %97 = arith.muli %95, %c64_i32 : i32 2026-02-21T09:37:50.4760209Z %98 = tt.splat %97 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.4760421Z %99 = tt.splat %97 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.4760632Z %100 = arith.addi %98, %1 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.4760840Z %101 = arith.addi %99, %2 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.4761004Z %102 = arith.muli %96, %c16_i32 : i32 2026-02-21T09:37:50.4761172Z %103 = tt.splat %102 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:50.4761386Z %104 = tt.splat %102 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.4761605Z %105 = arith.addi %103, %3 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:50.4761814Z %106 = arith.addi %104, %4 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.4762088Z %107 = tt.expand_dims %105 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T09:37:50.4762339Z %108 = arith.muli %107, %cst_4 : tensor<16x1xi32, #blocked2> 2026-02-21T09:37:50.4762552Z %109 = tt.broadcast %108 : tensor<16x1xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4762862Z %110 = tt.expand_dims %100 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1> 2026-02-21T09:37:50.4763138Z %111 = tt.broadcast %110 : tensor<1x64xi32, #blocked1> -> tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4763406Z %112 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:50.4763671Z %121 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.4763895Z %122 = arith.addi %121, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.4764069Z %123 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:50.4764241Z %124 = tt.splat %123 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:50.4764461Z %125 = arith.addi %124, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:50.4764731Z %126 = tt.expand_dims %125 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:37:50.4765006Z %127 = tt.broadcast %126 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4765197Z %128 = arith.addi %109, %127 : tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4765396Z %129 = tt.addptr %7, %128 : tensor<16x8x!tt.ptr, #blocked2>, tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4765603Z %130 = tt.load %129 : tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T09:37:50.4765819Z %131 = ttg.local_alloc %130 : (tensor<16x8xbf16, #blocked2>) -> !ttg.memdesc<16x8xbf16, #shared, #smem> 2026-02-21T09:37:50.4766186Z %132 = ttg.local_load %131 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.4766593Z %133 = arith.extf %132 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.4766987Z %134 = tt.expand_dims %122 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:37:50.4767234Z %135 = arith.muli %134, %cst_3 : tensor<4x1xi32, #blocked1> 2026-02-21T09:37:50.4767424Z %136 = tt.broadcast %135 : tensor<4x1xi32, #blocked1> -> tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4767618Z %137 = arith.addi %136, %111 : tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4767812Z %138 = tt.addptr %8, %137 : tensor<4x64x!tt.ptr, #blocked1>, tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4768011Z %139 = tt.load %138 : tensor<4x64x!tt.ptr, #blocked1> 2026-02-21T09:37:50.4768254Z %140 = ttg.convert_layout %139 : tensor<4x64xi8, #blocked1> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4768531Z %141 = arith.shli %140, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4768766Z %142 = arith.shrsi %141, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4768999Z %143 = arith.shrsi %140, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4769284Z %144 = tt.expand_dims %142 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:37:50.4769616Z %145 = tt.expand_dims %143 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:37:50.4769901Z %146 = tt.broadcast %144 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4770138Z %147 = arith.select %13, %146, %cst_5 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4770400Z %148 = tt.broadcast %145 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4770626Z %149 = arith.select %15, %148, %147 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4770851Z %150 = tt.reshape %149 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked3> 2026-02-21T09:37:50.4771069Z %151 = arith.sitofp %150 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3> 2026-02-21T09:37:50.4771319Z %152 = ttg.local_alloc %151 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:37:50.4771644Z %153 = ttg.local_load %152 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.4772111Z %154 = tt.dot %133, %153, %arg5, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:50.4772462Z scf.yield %154 : tensor<16x64xf32, #mma> 2026-02-21T09:37:50.4772591Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:37:50.4772755Z %113 = arith.truncf %112 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:50.4773017Z %114 = tt.expand_dims %106 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:50.4773249Z %115 = arith.muli %114, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:37:50.4773476Z %116 = tt.expand_dims %101 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:50.4773726Z %117 = tt.broadcast %115 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4773957Z %118 = tt.broadcast %116 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4774135Z %119 = arith.addi %117, %118 : tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4774320Z %120 = tt.addptr %16, %119 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4774526Z tt.store %120, %113 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:50.4774681Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:37:50.4774838Z scf.for %arg3 = %22 to %c448_i32 step %c4864_i32 : i32 { 2026-02-21T09:37:50.4774976Z %23 = arith.divsi %arg3, %c8_i32 : i32 2026-02-21T09:37:50.4775095Z %24 = arith.muli %23, %c2_i32 : i32 2026-02-21T09:37:50.4775211Z %25 = arith.subi %c112_i32, %24 : i32 2026-02-21T09:37:50.4775326Z %26 = arith.minsi %25, %c2_i32 : i32 2026-02-21T09:37:50.4775446Z %27 = arith.remsi %arg3, %c8_i32 : i32 2026-02-21T09:37:50.4775558Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:37:50.4775672Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:37:50.4775778Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:37:50.4775891Z %31 = arith.muli %29, %c64_i32 : i32 2026-02-21T09:37:50.4776052Z %32 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.4776261Z %33 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.4776469Z %34 = arith.addi %32, %1 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:37:50.4776672Z %35 = arith.addi %33, %2 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:37:50.4776830Z %36 = arith.muli %30, %c16_i32 : i32 2026-02-21T09:37:50.4776989Z %37 = tt.splat %36 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:50.4777196Z %38 = tt.splat %36 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.4777402Z %39 = arith.addi %37, %3 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:37:50.4777603Z %40 = arith.addi %38, %4 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:37:50.4777885Z %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T09:37:50.4778128Z %42 = arith.muli %41, %cst_4 : tensor<16x1xi32, #blocked2> 2026-02-21T09:37:50.4778316Z %43 = tt.broadcast %42 : tensor<16x1xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4778588Z %44 = tt.expand_dims %34 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1> 2026-02-21T09:37:50.4778855Z %45 = tt.broadcast %44 : tensor<1x64xi32, #blocked1> -> tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4779116Z %46 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x64xf32, #mma>) : i32 { 2026-02-21T09:37:50.4779382Z %55 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.4779602Z %56 = arith.addi %55, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:37:50.4779772Z %57 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:37:50.4779933Z %58 = tt.splat %57 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:50.4780145Z %59 = arith.addi %58, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:37:50.4780410Z %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:37:50.4780678Z %61 = tt.broadcast %60 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4780865Z %62 = arith.addi %43, %61 : tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4781089Z %63 = tt.addptr %7, %62 : tensor<16x8x!tt.ptr, #blocked2>, tensor<16x8xi32, #blocked2> 2026-02-21T09:37:50.4781288Z %64 = tt.load %63 : tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T09:37:50.4781503Z %65 = ttg.local_alloc %64 : (tensor<16x8xbf16, #blocked2>) -> !ttg.memdesc<16x8xbf16, #shared, #smem> 2026-02-21T09:37:50.4781842Z %66 = ttg.local_load %65 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.4782239Z %67 = arith.extf %66 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.4782613Z %68 = tt.expand_dims %56 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:37:50.4782856Z %69 = arith.muli %68, %cst_3 : tensor<4x1xi32, #blocked1> 2026-02-21T09:37:50.4783040Z %70 = tt.broadcast %69 : tensor<4x1xi32, #blocked1> -> tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4783228Z %71 = arith.addi %70, %45 : tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4783421Z %72 = tt.addptr %8, %71 : tensor<4x64x!tt.ptr, #blocked1>, tensor<4x64xi32, #blocked1> 2026-02-21T09:37:50.4783612Z %73 = tt.load %72 : tensor<4x64x!tt.ptr, #blocked1> 2026-02-21T09:37:50.4783848Z %74 = ttg.convert_layout %73 : tensor<4x64xi8, #blocked1> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4784119Z %75 = arith.shli %74, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4784349Z %76 = arith.shrsi %75, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4784577Z %77 = arith.shrsi %74, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:37:50.4784851Z %78 = tt.expand_dims %76 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:37:50.4785179Z %79 = tt.expand_dims %77 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:37:50.4785467Z %80 = tt.broadcast %78 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4785699Z %81 = arith.select %13, %80, %cst_5 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4785927Z %82 = tt.broadcast %79 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4786146Z %83 = arith.select %15, %82, %81 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:37:50.4786364Z %84 = tt.reshape %83 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked3> 2026-02-21T09:37:50.4786575Z %85 = arith.sitofp %84 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3> 2026-02-21T09:37:50.4786820Z %86 = ttg.local_alloc %85 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:37:50.4787135Z %87 = ttg.local_load %86 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:37:50.4787596Z %88 = tt.dot %67, %87, %arg5, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma> 2026-02-21T09:37:50.4787937Z scf.yield %88 : tensor<16x64xf32, #mma> 2026-02-21T09:37:50.4788066Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:37:50.4788226Z %47 = arith.truncf %46 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma> 2026-02-21T09:37:50.4788480Z %48 = tt.expand_dims %40 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:37:50.4788707Z %49 = arith.muli %48, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:37:50.4788960Z %50 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:37:50.4789205Z %51 = tt.broadcast %49 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4789399Z %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4789581Z %53 = arith.addi %51, %52 : tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4789755Z %54 = tt.addptr %16, %53 : tensor<16x64x!tt.ptr, #mma>, tensor<16x64xi32, #mma> 2026-02-21T09:37:50.4789938Z tt.store %54, %47 : tensor<16x64x!tt.ptr, #mma> 2026-02-21T09:37:50.4790092Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:37:50.4790225Z tt.return 2026-02-21T09:37:50.4790301Z } 2026-02-21T09:37:50.4790376Z } 2026-02-21T09:37:50.4790419Z 2026-02-21T09:37:50.4790451Z {-# 2026-02-21T09:37:50.4790529Z external_resources: { 2026-02-21T09:37:50.4790628Z mlir_reproducer: { 2026-02-21T09:37:50.4791620Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:37:50.4792612Z disable_threading: false, 2026-02-21T09:37:50.4792718Z verify_each: true 2026-02-21T09:37:50.4792804Z } 2026-02-21T09:37:50.4792878Z } 2026-02-21T09:37:50.4792945Z #-} 2026-02-21T09:37:50.4793216Z /tmp/torchinductor_root/ht/chtj3uegjhd2go2a3wm2vl7tyhopnj23oo6r55fq36bsvovs2hug.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:37:50.4793888Z /tmp/torchinductor_root/ht/chtj3uegjhd2go2a3wm2vl7tyhopnj23oo6r55fq36bsvovs2hug.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:37:50.4794452Z [106s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:37:50.4795221Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[1, 3], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:37:50.4795927Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:37:50.4796093Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:37:51.4110865Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 16.0 configs/s 2026-02-21T09:37:55.3789541Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 230.1 2026-02-21T09:37:55.3790198Z configs/s 2026-02-21T09:37:55.9644558Z [112s] Generation 2 complete: 2026-02-21T09:37:55.9644945Z error=15 2026-02-21T09:37:55.9645151Z ok=87 2026-02-21T09:37:55.9645359Z min=0.1117 2026-02-21T09:37:55.9645568Z mid=0.2379 2026-02-21T09:37:55.9645762Z max=8.9027 2026-02-21T09:37:55.9645988Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:37:55.9646355Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:37:55.9646705Z 'l2_groupings': [1], 2026-02-21T09:37:55.9646981Z 'load_eviction_policies': ['', ''], 2026-02-21T09:37:55.9647290Z 'loop_orders': [[0, 1]], 2026-02-21T09:37:55.9647996Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:37:55.9648266Z 'num_stages': 2, 2026-02-21T09:37:55.9648493Z 'num_warps': 4, 2026-02-21T09:37:55.9648741Z 'pid_type': 'flat', 2026-02-21T09:37:55.9649003Z 'range_flattens': [None, None], 2026-02-21T09:37:55.9649304Z 'range_multi_buffers': [None, None], 2026-02-21T09:37:55.9649720Z 'range_num_stages': [0, 0], 2026-02-21T09:37:55.9649996Z 'range_unroll_factors': [0, 1], 2026-02-21T09:37:55.9650290Z 'range_warp_specializes': [], 2026-02-21T09:37:55.9650568Z 'waves_per_eu': 2} 2026-02-21T09:37:55.9984414Z [112s] Fitting surrogate: 308 points, 308 targets 2026-02-21T09:37:56.9687166Z [113s] Generation 3 starting: 98 neighbors, 5 active search path(s) 2026-02-21T09:38:13.1621123Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 7.5 configs/s 2026-02-21T09:38:18.4554075Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:38:18.4555601Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:38:18.4558104Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:38:18.4559068Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:38:18.4559866Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:38:18.4560581Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:38:18.4561216Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:38:18.4561676Z #smem = #ttg.shared_memory 2026-02-21T09:38:18.4562035Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:38:18.4562836Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:38:18.4563713Z %cst = arith.constant dense<0.000000e+00> : tensor<16x32xf32, #mma> 2026-02-21T09:38:18.4563957Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:38:18.4564137Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:38:18.4564326Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:38:18.4564499Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:38:18.4564722Z %cst_0 = arith.constant dense<0> : tensor<4x2x32xi8, #blocked> 2026-02-21T09:38:18.4564940Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:38:18.4565114Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:38:18.4565294Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:38:18.4565511Z %cst_1 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1> 2026-02-21T09:38:18.4565849Z %cst_2 = arith.constant dense<7168> : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4566235Z %cst_3 = arith.constant dense<4> : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4566564Z %cst_4 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:38:18.4566821Z %cst_5 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:38:18.4567078Z %cst_6 = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:38:18.4567298Z %0 = tt.get_program_id x : i32 2026-02-21T09:38:18.4567470Z %1 = arith.remsi %0, %c224_i32 : i32 2026-02-21T09:38:18.4567653Z %2 = arith.divsi %0, %c224_i32 : i32 2026-02-21T09:38:18.4567821Z %3 = arith.muli %1, %c32_i32 : i32 2026-02-21T09:38:18.4568187Z %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:38:18.4568872Z %5 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:38:18.4569297Z %6 = tt.splat %3 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:38:18.4569743Z %7 = tt.splat %3 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:38:18.4570111Z %8 = arith.addi %6, %4 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:38:18.4570480Z %9 = arith.addi %7, %5 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:38:18.4570724Z %10 = arith.muli %2, %c16_i32 : i32 2026-02-21T09:38:18.4571026Z %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:38:18.4571427Z %12 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:38:18.4571717Z %13 = tt.splat %10 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:38:18.4572022Z %14 = tt.splat %10 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:38:18.4572266Z %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:38:18.4572513Z %16 = arith.addi %14, %12 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:38:18.4572835Z %17 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:38:18.4573194Z %18 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:38:18.4573551Z %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1> 2026-02-21T09:38:18.4573845Z %20 = arith.muli %19, %cst_1 : tensor<16x1xi32, #blocked1> 2026-02-21T09:38:18.4574070Z %21 = tt.broadcast %20 : tensor<16x1xi32, #blocked1> -> tensor<16x8xi32, #blocked1> 2026-02-21T09:38:18.4574350Z %22 = tt.splat %arg0 : !tt.ptr -> tensor<16x8x!tt.ptr, #blocked1> 2026-02-21T09:38:18.4574741Z %23 = tt.expand_dims %8 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4575237Z %24 = tt.broadcast %23 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4575614Z %25 = tt.splat %arg1 : !tt.ptr -> tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4575972Z %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:38:18.4576453Z %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:38:18.4576916Z %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:38:18.4577219Z %29 = arith.cmpi eq, %28, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:38:18.4577449Z %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked> 2026-02-21T09:38:18.4577680Z %31 = arith.cmpi eq, %28, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:38:18.4577904Z %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked> 2026-02-21T09:38:18.4578205Z %33 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg4 = %cst) -> (tensor<16x32xf32, #mma>) : i32 { 2026-02-21T09:38:18.4578661Z %43 = tt.splat %arg3 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:38:18.4579017Z %44 = arith.addi %43, %17 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:38:18.4579269Z %45 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:38:18.4579490Z %46 = tt.splat %45 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:38:18.4579742Z %47 = arith.addi %46, %18 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:38:18.4580060Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:38:18.4580379Z %49 = tt.broadcast %48 : tensor<1x8xi32, #blocked1> -> tensor<16x8xi32, #blocked1> 2026-02-21T09:38:18.4580603Z %50 = arith.addi %21, %49 : tensor<16x8xi32, #blocked1> 2026-02-21T09:38:18.4580833Z %51 = tt.addptr %22, %50 : tensor<16x8x!tt.ptr, #blocked1>, tensor<16x8xi32, #blocked1> 2026-02-21T09:38:18.4581076Z %52 = tt.load %51 : tensor<16x8x!tt.ptr, #blocked1> 2026-02-21T09:38:18.4581325Z %53 = ttg.local_alloc %52 : (tensor<16x8xbf16, #blocked1>) -> !ttg.memdesc<16x8xbf16, #shared, #smem> 2026-02-21T09:38:18.4581676Z %54 = ttg.local_load %53 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:38:18.4582077Z %55 = arith.extf %54 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:38:18.4582532Z %56 = tt.expand_dims %44 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4582886Z %57 = arith.muli %56, %cst_2 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4583191Z %58 = tt.broadcast %57 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4583492Z %59 = arith.addi %58, %24 : tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4583826Z %60 = tt.addptr %25, %59 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4584133Z %61 = tt.load %60 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4584360Z %62 = arith.shli %61, %cst_3 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4584594Z %63 = arith.shrsi %62, %cst_3 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4584828Z %64 = arith.shrsi %61, %cst_3 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:38:18.4585116Z %65 = tt.expand_dims %63 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:38:18.4585444Z %66 = tt.expand_dims %64 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:38:18.4585725Z %67 = tt.broadcast %65 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:38:18.4585957Z %68 = arith.select %30, %67, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:38:18.4586190Z %69 = tt.broadcast %66 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:38:18.4586420Z %70 = arith.select %32, %69, %68 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:38:18.4586639Z %71 = tt.reshape %70 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:38:18.4586858Z %72 = arith.sitofp %71 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:38:18.4587129Z %73 = ttg.local_alloc %72 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:38:18.4587448Z %74 = ttg.local_load %73 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:38:18.4587918Z %75 = tt.dot %55, %74, %arg4, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma> 2026-02-21T09:38:18.4588281Z scf.yield %75 : tensor<16x32xf32, #mma> 2026-02-21T09:38:18.4588415Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:38:18.4588578Z %34 = arith.truncf %33 : tensor<16x32xf32, #mma> to tensor<16x32xbf16, #mma> 2026-02-21T09:38:18.4588837Z %35 = tt.expand_dims %16 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:38:18.4589073Z %36 = arith.muli %35, %cst_6 : tensor<16x1xi32, #mma> 2026-02-21T09:38:18.4589297Z %37 = tt.expand_dims %9 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma> 2026-02-21T09:38:18.4589547Z %38 = tt.broadcast %36 : tensor<16x1xi32, #mma> -> tensor<16x32xi32, #mma> 2026-02-21T09:38:18.4589740Z %39 = tt.broadcast %37 : tensor<1x32xi32, #mma> -> tensor<16x32xi32, #mma> 2026-02-21T09:38:18.4589916Z %40 = arith.addi %38, %39 : tensor<16x32xi32, #mma> 2026-02-21T09:38:18.4590087Z %41 = tt.splat %arg2 : !tt.ptr -> tensor<16x32x!tt.ptr, #mma> 2026-02-21T09:38:18.4590294Z %42 = tt.addptr %41, %40 : tensor<16x32x!tt.ptr, #mma>, tensor<16x32xi32, #mma> 2026-02-21T09:38:18.4590490Z tt.store %42, %34 : tensor<16x32x!tt.ptr, #mma> 2026-02-21T09:38:18.4590619Z tt.return 2026-02-21T09:38:18.4590707Z } 2026-02-21T09:38:18.4590786Z } 2026-02-21T09:38:18.4590835Z 2026-02-21T09:38:18.4590868Z {-# 2026-02-21T09:38:18.4590953Z external_resources: { 2026-02-21T09:38:18.4591058Z mlir_reproducer: { 2026-02-21T09:38:18.4592054Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:38:18.4593073Z disable_threading: false, 2026-02-21T09:38:18.4593182Z verify_each: true 2026-02-21T09:38:18.4593277Z } 2026-02-21T09:38:18.4593354Z } 2026-02-21T09:38:18.4593433Z #-} 2026-02-21T09:38:18.4593713Z /tmp/torchinductor_root/da/cdattooczxmlgvhivkfyxauwjzd73sq74blzs7vbmlnrihpx52hr.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:38:18.4594401Z /tmp/torchinductor_root/da/cdattooczxmlgvhivkfyxauwjzd73sq74blzs7vbmlnrihpx52hr.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:38:18.4594960Z [134s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:38:18.4595670Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:38:18.4596326Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:38:18.4596531Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:38:19.5143755Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 15.8 configs/s 2026-02-21T09:38:26.2784685Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 155.3 2026-02-21T09:38:26.2785707Z configs/s 2026-02-21T09:38:26.8504467Z [142s] Generation 3 complete: 2026-02-21T09:38:26.8504825Z error=2 2026-02-21T09:38:26.8505034Z ok=101 2026-02-21T09:38:26.8505243Z min=0.1116 2026-02-21T09:38:26.8505453Z mid=0.2058 2026-02-21T09:38:26.8505661Z max=12.0261 2026-02-21T09:38:26.8505903Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:38:26.8506272Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:38:26.8506630Z 'l2_groupings': [1], 2026-02-21T09:38:26.8506900Z 'load_eviction_policies': ['', ''], 2026-02-21T09:38:26.8507219Z 'loop_orders': [[0, 1]], 2026-02-21T09:38:26.8507503Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:38:26.8507807Z 'num_stages': 2, 2026-02-21T09:38:26.8508029Z 'num_warps': 4, 2026-02-21T09:38:26.8508273Z 'pid_type': 'flat', 2026-02-21T09:38:26.8508549Z 'range_flattens': [None, True], 2026-02-21T09:38:26.8508861Z 'range_multi_buffers': [None, None], 2026-02-21T09:38:26.8509177Z 'range_num_stages': [0, 0], 2026-02-21T09:38:26.8509453Z 'range_unroll_factors': [0, 1], 2026-02-21T09:38:26.8509751Z 'range_warp_specializes': [], 2026-02-21T09:38:26.8510022Z 'waves_per_eu': 2} 2026-02-21T09:38:26.9175240Z [143s] Fitting surrogate: 411 points, 411 targets 2026-02-21T09:38:27.8925056Z [143s] Generation 4 starting: 93 neighbors, 5 active search path(s) 2026-02-21T09:38:45.0472271Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 3.1 configs/s 2026-02-21T09:38:50.9074607Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 95/95 16.4 configs/s 2026-02-21T09:38:56.6565832Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 164.2 2026-02-21T09:38:56.6566186Z configs/s 2026-02-21T09:38:57.2563058Z [173s] Generation 4 complete: 2026-02-21T09:38:57.2564014Z error=7 2026-02-21T09:38:57.2564254Z ok=91 2026-02-21T09:38:57.2564452Z min=0.1092 2026-02-21T09:38:57.2564682Z mid=0.2270 2026-02-21T09:38:57.2564886Z max=8.0735 2026-02-21T09:38:57.2565198Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:38:57.2565574Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:38:57.2565942Z 'l2_groupings': [1], 2026-02-21T09:38:57.2566199Z 'load_eviction_policies': ['', ''], 2026-02-21T09:38:57.2566473Z 'loop_orders': [[0, 1]], 2026-02-21T09:38:57.2566722Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:38:57.2566961Z 'num_stages': 1, 2026-02-21T09:38:57.2567429Z 'num_warps': 4, 2026-02-21T09:38:57.2567726Z 'pid_type': 'flat', 2026-02-21T09:38:57.2567982Z 'range_flattens': [None, True], 2026-02-21T09:38:57.2568266Z 'range_multi_buffers': [None, None], 2026-02-21T09:38:57.2568617Z 'range_num_stages': [0, 0], 2026-02-21T09:38:57.2568873Z 'range_unroll_factors': [0, 1], 2026-02-21T09:38:57.2569135Z 'range_warp_specializes': [], 2026-02-21T09:38:57.2569391Z 'waves_per_eu': 3} 2026-02-21T09:38:57.3202466Z [173s] Fitting surrogate: 509 points, 509 targets 2026-02-21T09:38:58.2904321Z [174s] Generation 5 starting: 91 neighbors, 5 active search path(s) 2026-02-21T09:39:17.7830277Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 2.2 configs/s 2026-02-21T09:39:23.3461630Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 17.0 configs/s 2026-02-21T09:39:31.4165172Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 119.2 2026-02-21T09:39:31.4165790Z configs/s 2026-02-21T09:39:32.1046830Z [208s] Generation 5 complete: 2026-02-21T09:39:32.1047212Z error=7 2026-02-21T09:39:32.1047406Z ok=89 2026-02-21T09:39:32.1047604Z min=0.1086 2026-02-21T09:39:32.1047800Z mid=0.1562 2026-02-21T09:39:32.1048433Z max=5.9452 2026-02-21T09:39:32.1048653Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:39:32.1049025Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:39:32.1049367Z 'l2_groupings': [1], 2026-02-21T09:39:32.1049632Z 'load_eviction_policies': ['', ''], 2026-02-21T09:39:32.1050045Z 'loop_orders': [[0, 1]], 2026-02-21T09:39:32.1050309Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:39:32.1050568Z 'num_stages': 1, 2026-02-21T09:39:32.1050784Z 'num_warps': 4, 2026-02-21T09:39:32.1051009Z 'pid_type': 'flat', 2026-02-21T09:39:32.1051255Z 'range_flattens': [None, True], 2026-02-21T09:39:32.1051545Z 'range_multi_buffers': [None, None], 2026-02-21T09:39:32.1051836Z 'range_num_stages': [0, 0], 2026-02-21T09:39:32.1052105Z 'range_unroll_factors': [0, 1], 2026-02-21T09:39:32.1052398Z 'range_warp_specializes': [], 2026-02-21T09:39:32.1052671Z 'waves_per_eu': 3} 2026-02-21T09:39:32.1978945Z [208s] Fitting surrogate: 605 points, 605 targets 2026-02-21T09:39:33.0933917Z [209s] Generation 6 starting: 85 neighbors, 5 active search path(s) 2026-02-21T09:39:52.5166883Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 0.9 configs/s 2026-02-21T09:39:56.6908658Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:39:56.6917439Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}> 2026-02-21T09:39:56.6918347Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:39:56.6919188Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:39:56.6919963Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:39:56.6920872Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:39:56.6921376Z #smem = #ttg.shared_memory 2026-02-21T09:39:56.6921969Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:39:56.6923293Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:39:56.6924334Z %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:39:56.6924761Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:39:56.6925262Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:39:56.6925726Z %cst_2 = arith.constant dense<7168> : tensor<2x1xi32, #blocked1> 2026-02-21T09:39:56.6926746Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2> 2026-02-21T09:39:56.6927122Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:39:56.6927602Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<16x128xf32, #mma> 2026-02-21T09:39:56.6927968Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:39:56.6928358Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:39:56.6928673Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:39:56.6929066Z %cst_5 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6929509Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:39:56.6929842Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:39:56.6930169Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:39:56.6930526Z %c4092_i32 = arith.constant 4092 : i32 2026-02-21T09:39:56.6930857Z %c6_i32 = arith.constant 6 : i32 2026-02-21T09:39:56.6931405Z %cst_6 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6931942Z %0 = tt.get_program_id x : i32 2026-02-21T09:39:56.6932211Z %1 = arith.divsi %0, %c224_i32 : i32 2026-02-21T09:39:56.6932452Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T09:39:56.6932696Z %3 = arith.subi %c4_i32, %2 : i32 2026-02-21T09:39:56.6932963Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T09:39:56.6933349Z %5 = arith.remsi %0, %c224_i32 : i32 2026-02-21T09:39:56.6933613Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:39:56.6933860Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:39:56.6934098Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:39:56.6934282Z %9 = arith.muli %7, %c16_i32 : i32 2026-02-21T09:39:56.6934578Z %10 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:39:56.6934984Z %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:39:56.6935400Z %12 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:39:56.6935717Z %13 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:39:56.6936026Z %14 = arith.addi %12, %10 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:39:56.6936410Z %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:39:56.6936652Z %16 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:39:56.6936933Z %17 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:39:56.6937342Z %18 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:39:56.6937709Z %19 = tt.splat %16 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:39:56.6938057Z %20 = tt.splat %16 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:39:56.6938376Z %21 = arith.addi %19, %17 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:39:56.6938680Z %22 = arith.addi %20, %18 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:39:56.6939044Z %23 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:39:56.6939440Z %24 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:39:56.6939887Z %25 = tt.expand_dims %14 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T09:39:56.6940254Z %26 = arith.muli %25, %cst_3 : tensor<16x1xi32, #blocked2> 2026-02-21T09:39:56.6940534Z %27 = tt.broadcast %26 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:39:56.6940849Z %28 = tt.splat %arg0 : !tt.ptr -> tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:39:56.6941294Z %29 = tt.expand_dims %22 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:39:56.6941710Z %30 = tt.broadcast %29 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:39:56.6942026Z %31 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:39:56.6942400Z %32 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:39:56.6942874Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:39:56.6943326Z %34 = tt.expand_dims %33 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:39:56.6943614Z %35 = arith.cmpi eq, %34, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:39:56.6943841Z %36 = tt.broadcast %35 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:39:56.6944059Z %37 = arith.cmpi eq, %34, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:39:56.6944276Z %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:39:56.6944575Z %39 = scf.for %arg3 = %c0_i32 to %c4092_i32 step %c6_i32 iter_args(%arg4 = %cst_4) -> (tensor<16x128xf32, #mma>) : i32 { 2026-02-21T09:39:56.6944901Z %50 = tt.splat %arg3 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:39:56.6945157Z %51 = arith.addi %50, %23 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:39:56.6945352Z %52 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:39:56.6945539Z %53 = tt.splat %52 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:39:56.6945787Z %54 = arith.addi %53, %24 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:39:56.6946105Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:39:56.6946436Z %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:39:56.6946651Z %57 = arith.addi %27, %56 : tensor<16x4xi32, #blocked2> 2026-02-21T09:39:56.6946877Z %58 = tt.addptr %28, %57 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:39:56.6947110Z %59 = tt.load %58 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:39:56.6947411Z %60 = ttg.convert_layout %59 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:39:56.6947870Z %61 = arith.extf %60 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:39:56.6948300Z %62 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:39:56.6948577Z %63 = arith.muli %62, %cst_2 : tensor<2x1xi32, #blocked1> 2026-02-21T09:39:56.6948794Z %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:39:56.6949016Z %65 = arith.addi %64, %30 : tensor<2x128xi32, #blocked1> 2026-02-21T09:39:56.6949237Z %66 = tt.addptr %31, %65 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:39:56.6949461Z %67 = tt.load %66 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:39:56.6949737Z %68 = ttg.convert_layout %67 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6950053Z %69 = arith.shli %68, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6950318Z %70 = arith.shrsi %69, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6950619Z %71 = arith.shrsi %68, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6950940Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:39:56.6951321Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:39:56.6951641Z %74 = tt.broadcast %72 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6951916Z %75 = arith.select %36, %74, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6952182Z %76 = tt.broadcast %73 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6952427Z %77 = arith.select %38, %76, %75 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6952652Z %78 = tt.reshape %77 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked1> 2026-02-21T09:39:56.6952866Z %79 = arith.sitofp %78 : tensor<4x128xi8, #blocked1> to tensor<4x128xf32, #blocked1> 2026-02-21T09:39:56.6953113Z %80 = ttg.local_alloc %79 : (tensor<4x128xf32, #blocked1>) -> !ttg.memdesc<4x128xf32, #shared, #smem> 2026-02-21T09:39:56.6953447Z %81 = ttg.local_load %80 : !ttg.memdesc<4x128xf32, #shared, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:39:56.6953913Z %82 = tt.dot %61, %81, %arg4, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma> 2026-02-21T09:39:56.6954256Z %83 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:39:56.6954421Z %84 = tt.splat %83 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:39:56.6954644Z %85 = arith.addi %84, %23 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:39:56.6954814Z %86 = arith.muli %83, %c2_i32 : i32 2026-02-21T09:39:56.6954973Z %87 = tt.splat %86 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:39:56.6955204Z %88 = arith.addi %87, %24 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:39:56.6955470Z %89 = tt.expand_dims %88 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:39:56.6955738Z %90 = tt.broadcast %89 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:39:56.6955923Z %91 = arith.addi %27, %90 : tensor<16x4xi32, #blocked2> 2026-02-21T09:39:56.6956114Z %92 = tt.addptr %28, %91 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:39:56.6956313Z %93 = tt.load %92 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:39:56.6956572Z %94 = ttg.convert_layout %93 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:39:56.6956963Z %95 = arith.extf %94 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:39:56.6957342Z %96 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:39:56.6957581Z %97 = arith.muli %96, %cst_2 : tensor<2x1xi32, #blocked1> 2026-02-21T09:39:56.6957770Z %98 = tt.broadcast %97 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:39:56.6957956Z %99 = arith.addi %98, %30 : tensor<2x128xi32, #blocked1> 2026-02-21T09:39:56.6958177Z %100 = tt.addptr %31, %99 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:39:56.6958390Z %101 = tt.load %100 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:39:56.6958669Z %102 = ttg.convert_layout %101 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6958957Z %103 = arith.shli %102, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6959197Z %104 = arith.shrsi %103, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6959439Z %105 = arith.shrsi %102, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6959735Z %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:39:56.6960075Z %107 = tt.expand_dims %105 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:39:56.6960364Z %108 = tt.broadcast %106 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6960606Z %109 = arith.select %36, %108, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6969505Z %110 = tt.broadcast %107 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6969781Z %111 = arith.select %38, %110, %109 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6970064Z %112 = tt.reshape %111 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked1> 2026-02-21T09:39:56.6970294Z %113 = arith.sitofp %112 : tensor<4x128xi8, #blocked1> to tensor<4x128xf32, #blocked1> 2026-02-21T09:39:56.6970548Z %114 = ttg.local_alloc %113 : (tensor<4x128xf32, #blocked1>) -> !ttg.memdesc<4x128xf32, #shared, #smem> 2026-02-21T09:39:56.6970873Z %115 = ttg.local_load %114 : !ttg.memdesc<4x128xf32, #shared, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:39:56.6971352Z %116 = tt.dot %95, %115, %82, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma> 2026-02-21T09:39:56.6971695Z %117 = arith.addi %arg3, %c4_i32 : i32 2026-02-21T09:39:56.6971890Z %118 = tt.splat %117 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:39:56.6972116Z %119 = arith.addi %118, %23 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:39:56.6972288Z %120 = arith.muli %117, %c2_i32 : i32 2026-02-21T09:39:56.6972457Z %121 = tt.splat %120 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:39:56.6972673Z %122 = arith.addi %121, %24 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:39:56.6972950Z %123 = tt.expand_dims %122 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:39:56.6973228Z %124 = tt.broadcast %123 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:39:56.6973424Z %125 = arith.addi %27, %124 : tensor<16x4xi32, #blocked2> 2026-02-21T09:39:56.6973627Z %126 = tt.addptr %28, %125 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:39:56.6973833Z %127 = tt.load %126 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:39:56.6974099Z %128 = ttg.convert_layout %127 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:39:56.6974494Z %129 = arith.extf %128 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:39:56.6974871Z %130 = tt.expand_dims %119 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:39:56.6975118Z %131 = arith.muli %130, %cst_2 : tensor<2x1xi32, #blocked1> 2026-02-21T09:39:56.6975344Z %132 = tt.broadcast %131 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:39:56.6975539Z %133 = arith.addi %132, %30 : tensor<2x128xi32, #blocked1> 2026-02-21T09:39:56.6975743Z %134 = tt.addptr %31, %133 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:39:56.6975945Z %135 = tt.load %134 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:39:56.6976189Z %136 = ttg.convert_layout %135 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6976467Z %137 = arith.shli %136, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6976704Z %138 = arith.shrsi %137, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6976939Z %139 = arith.shrsi %136, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6977228Z %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:39:56.6977567Z %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:39:56.6977856Z %142 = tt.broadcast %140 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6978109Z %143 = arith.select %36, %142, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6978348Z %144 = tt.broadcast %141 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6978580Z %145 = arith.select %38, %144, %143 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6978814Z %146 = tt.reshape %145 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked1> 2026-02-21T09:39:56.6979036Z %147 = arith.sitofp %146 : tensor<4x128xi8, #blocked1> to tensor<4x128xf32, #blocked1> 2026-02-21T09:39:56.6979293Z %148 = ttg.local_alloc %147 : (tensor<4x128xf32, #blocked1>) -> !ttg.memdesc<4x128xf32, #shared, #smem> 2026-02-21T09:39:56.6979636Z %149 = ttg.local_load %148 : !ttg.memdesc<4x128xf32, #shared, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:39:56.6980104Z %150 = tt.dot %129, %149, %116, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma> 2026-02-21T09:39:56.6980457Z scf.yield %150 : tensor<16x128xf32, #mma> 2026-02-21T09:39:56.6980591Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:39:56.6980806Z %40 = scf.for %arg3 = %c4092_i32 to %c4096_i32 step %c2_i32 iter_args(%arg4 = %39) -> (tensor<16x128xf32, #mma>) : i32 { 2026-02-21T09:39:56.6981076Z %50 = tt.splat %arg3 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:39:56.6981298Z %51 = arith.addi %50, %23 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:39:56.6981470Z %52 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:39:56.6981634Z %53 = tt.splat %52 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:39:56.6981845Z %54 = arith.addi %53, %24 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:39:56.6982121Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:39:56.6982391Z %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2> 2026-02-21T09:39:56.6982584Z %57 = arith.addi %27, %56 : tensor<16x4xi32, #blocked2> 2026-02-21T09:39:56.6982777Z %58 = tt.addptr %28, %57 : tensor<16x4x!tt.ptr, #blocked2>, tensor<16x4xi32, #blocked2> 2026-02-21T09:39:56.6982978Z %59 = tt.load %58 : tensor<16x4x!tt.ptr, #blocked2> 2026-02-21T09:39:56.6983270Z %60 = ttg.convert_layout %59 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:39:56.6983671Z %61 = arith.extf %60 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:39:56.6984043Z %62 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:39:56.6984283Z %63 = arith.muli %62, %cst_2 : tensor<2x1xi32, #blocked1> 2026-02-21T09:39:56.6984470Z %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:39:56.6984661Z %65 = arith.addi %64, %30 : tensor<2x128xi32, #blocked1> 2026-02-21T09:39:56.6984852Z %66 = tt.addptr %31, %65 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:39:56.6985051Z %67 = tt.load %66 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:39:56.6985294Z %68 = ttg.convert_layout %67 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6985566Z %69 = arith.shli %68, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6985816Z %70 = arith.shrsi %69, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6986043Z %71 = arith.shrsi %68, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:39:56.6986329Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:39:56.6986663Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:39:56.6986944Z %74 = tt.broadcast %72 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6987183Z %75 = arith.select %36, %74, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6987416Z %76 = tt.broadcast %73 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6987661Z %77 = arith.select %38, %76, %75 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:39:56.6987886Z %78 = tt.reshape %77 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked1> 2026-02-21T09:39:56.6988102Z %79 = arith.sitofp %78 : tensor<4x128xi8, #blocked1> to tensor<4x128xf32, #blocked1> 2026-02-21T09:39:56.6988350Z %80 = ttg.local_alloc %79 : (tensor<4x128xf32, #blocked1>) -> !ttg.memdesc<4x128xf32, #shared, #smem> 2026-02-21T09:39:56.6988666Z %81 = ttg.local_load %80 : !ttg.memdesc<4x128xf32, #shared, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:39:56.6989135Z %82 = tt.dot %61, %81, %arg4, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma> 2026-02-21T09:39:56.6989478Z scf.yield %82 : tensor<16x128xf32, #mma> 2026-02-21T09:39:56.6989606Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:39:56.6989774Z %41 = arith.truncf %40 : tensor<16x128xf32, #mma> to tensor<16x128xbf16, #mma> 2026-02-21T09:39:56.6990030Z %42 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:39:56.6990260Z %43 = arith.muli %42, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:39:56.6990488Z %44 = tt.expand_dims %21 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:39:56.6990741Z %45 = tt.broadcast %43 : tensor<16x1xi32, #mma> -> tensor<16x128xi32, #mma> 2026-02-21T09:39:56.6990938Z %46 = tt.broadcast %44 : tensor<1x128xi32, #mma> -> tensor<16x128xi32, #mma> 2026-02-21T09:39:56.6991138Z %47 = arith.addi %45, %46 : tensor<16x128xi32, #mma> 2026-02-21T09:39:56.6991308Z %48 = tt.splat %arg2 : !tt.ptr -> tensor<16x128x!tt.ptr, #mma> 2026-02-21T09:39:56.6991520Z %49 = tt.addptr %48, %47 : tensor<16x128x!tt.ptr, #mma>, tensor<16x128xi32, #mma> 2026-02-21T09:39:56.6991707Z tt.store %49, %41 : tensor<16x128x!tt.ptr, #mma> 2026-02-21T09:39:56.6991839Z tt.return 2026-02-21T09:39:56.6991919Z } 2026-02-21T09:39:56.6991996Z } 2026-02-21T09:39:56.6992042Z 2026-02-21T09:39:56.6992073Z {-# 2026-02-21T09:39:56.6992158Z external_resources: { 2026-02-21T09:39:56.6992256Z mlir_reproducer: { 2026-02-21T09:39:56.6993256Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:39:56.6994275Z disable_threading: false, 2026-02-21T09:39:56.6994383Z verify_each: true 2026-02-21T09:39:56.6994471Z } 2026-02-21T09:39:56.6994546Z } 2026-02-21T09:39:56.6994614Z #-} 2026-02-21T09:39:56.6994892Z /tmp/torchinductor_root/iu/ciu33d2kqmchnrze6gaykjz6ozjy4nn6ut2tq2xu35tct73cvs4a.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:39:56.6995575Z /tmp/torchinductor_root/iu/ciu33d2kqmchnrze6gaykjz6ozjy4nn6ut2tq2xu35tct73cvs4a.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:39:56.6996120Z [232s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:39:56.6996853Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:39:56.6997503Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:39:56.6997670Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:39:58.0283015Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 16.2 configs/s 2026-02-21T09:40:02.8317985Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 195.9 2026-02-21T09:40:02.8318645Z configs/s 2026-02-21T09:40:03.4209266Z [239s] Generation 6 complete: 2026-02-21T09:40:03.4209637Z error=5 2026-02-21T09:40:03.4209833Z ok=85 2026-02-21T09:40:03.4210020Z min=0.1021 2026-02-21T09:40:03.4210217Z mid=0.2053 2026-02-21T09:40:03.4210418Z max=12.5282 2026-02-21T09:40:03.4210637Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:40:03.4210987Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:40:03.4211324Z 'l2_groupings': [1], 2026-02-21T09:40:03.4211580Z 'load_eviction_policies': ['', ''], 2026-02-21T09:40:03.4211874Z 'loop_orders': [[0, 1]], 2026-02-21T09:40:03.4212141Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:40:03.4212393Z 'num_stages': 2, 2026-02-21T09:40:03.4212613Z 'num_warps': 8, 2026-02-21T09:40:03.4212828Z 'pid_type': 'flat', 2026-02-21T09:40:03.4213076Z 'range_flattens': [None, False], 2026-02-21T09:40:03.4213366Z 'range_multi_buffers': [None, False], 2026-02-21T09:40:03.4213656Z 'range_num_stages': [0, 0], 2026-02-21T09:40:03.4214288Z 'range_unroll_factors': [0, 1], 2026-02-21T09:40:03.4214582Z 'range_warp_specializes': [], 2026-02-21T09:40:03.4214843Z 'waves_per_eu': 2} 2026-02-21T09:40:03.4898227Z [239s] Fitting surrogate: 695 points, 695 targets 2026-02-21T09:40:05.0373543Z [241s] Generation 7 starting: 55 neighbors, 3 active search path(s) 2026-02-21T09:40:15.7569388Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 3.7 configs/s 2026-02-21T09:40:19.1154771Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 56/56 17.4 configs/s 2026-02-21T09:40:23.1271952Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 231.2 2026-02-21T09:40:23.1272562Z configs/s 2026-02-21T09:40:23.6336524Z [259s] Generation 7 complete: 2026-02-21T09:40:23.6869679Z error=4 2026-02-21T09:40:23.6869903Z ok=54 2026-02-21T09:40:23.6870112Z min=0.1020 2026-02-21T09:40:23.6870320Z mid=0.1978 2026-02-21T09:40:23.6870515Z max=7.7511 2026-02-21T09:40:23.6870785Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:40:23.6871164Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:40:23.6871538Z 'l2_groupings': [1], 2026-02-21T09:40:23.6871816Z 'load_eviction_policies': ['', ''], 2026-02-21T09:40:23.6872121Z 'loop_orders': [[0, 1]], 2026-02-21T09:40:23.6872642Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:40:23.6872906Z 'num_stages': 2, 2026-02-21T09:40:23.6873140Z 'num_warps': 8, 2026-02-21T09:40:23.6873368Z 'pid_type': 'flat', 2026-02-21T09:40:23.6873626Z 'range_flattens': [None, False], 2026-02-21T09:40:23.6873937Z 'range_multi_buffers': [None, False], 2026-02-21T09:40:23.6874249Z 'range_num_stages': [0, 0], 2026-02-21T09:40:23.6874534Z 'range_unroll_factors': [0, 1], 2026-02-21T09:40:23.6874833Z 'range_warp_specializes': [], 2026-02-21T09:40:23.6875106Z 'waves_per_eu': 2} 2026-02-21T09:40:23.6875390Z [259s] Fitting surrogate: 753 points, 753 targets 2026-02-21T09:40:24.2932538Z [260s] Generation 8 starting: 46 neighbors, 3 active search path(s) 2026-02-21T09:40:32.6712518Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 9.6 configs/s 2026-02-21T09:40:35.5866545Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 17.0 configs/s 2026-02-21T09:40:38.7600339Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 288.1 2026-02-21T09:40:38.7600946Z configs/s 2026-02-21T09:40:39.2439725Z [275s] Generation 8 complete: 2026-02-21T09:40:39.2440065Z error=3 2026-02-21T09:40:39.2440293Z ok=46 2026-02-21T09:40:39.2440501Z min=0.1019 2026-02-21T09:40:39.2440713Z mid=0.2245 2026-02-21T09:40:39.2440912Z max=6.0647 2026-02-21T09:40:39.2441140Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:40:39.2444036Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:40:39.2444484Z 'l2_groupings': [2], 2026-02-21T09:40:39.2444771Z 'load_eviction_policies': ['', ''], 2026-02-21T09:40:39.2445091Z 'loop_orders': [[1, 0]], 2026-02-21T09:40:39.2445410Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:40:39.2445688Z 'num_stages': 3, 2026-02-21T09:40:39.2445943Z 'num_warps': 8, 2026-02-21T09:40:39.2446182Z 'pid_type': 'flat', 2026-02-21T09:40:39.2446446Z 'range_flattens': [None, True], 2026-02-21T09:40:39.2446769Z 'range_multi_buffers': [None, None], 2026-02-21T09:40:39.2447088Z 'range_num_stages': [0, 3], 2026-02-21T09:40:39.2447375Z 'range_unroll_factors': [0, 1], 2026-02-21T09:40:39.2447679Z 'range_warp_specializes': [], 2026-02-21T09:40:39.2447967Z 'waves_per_eu': 4} 2026-02-21T09:40:39.2850521Z [275s] Fitting surrogate: 802 points, 802 targets 2026-02-21T09:40:39.7338037Z [275s] Generation 9 starting: 33 neighbors, 2 active search path(s) 2026-02-21T09:40:45.7168716Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 9.0 configs/s 2026-02-21T09:40:47.9534131Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 16.2 configs/s 2026-02-21T09:40:49.3855858Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 568.5 2026-02-21T09:40:49.3856436Z configs/s 2026-02-21T09:40:49.8218649Z [285s] Generation 9 complete: 2026-02-21T09:40:49.8219000Z error=1 2026-02-21T09:40:49.8219209Z ok=34 2026-02-21T09:40:49.8219435Z min=0.1019 2026-02-21T09:40:49.8219646Z mid=0.6240 2026-02-21T09:40:49.8219837Z max=4.6029 2026-02-21T09:40:49.8220064Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:40:49.8220429Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:40:49.8220780Z 'l2_groupings': [2], 2026-02-21T09:40:49.8221049Z 'load_eviction_policies': ['', ''], 2026-02-21T09:40:49.8221357Z 'loop_orders': [[1, 0]], 2026-02-21T09:40:49.8221641Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:40:49.8221907Z 'num_stages': 3, 2026-02-21T09:40:49.8222134Z 'num_warps': 8, 2026-02-21T09:40:49.8222370Z 'pid_type': 'flat', 2026-02-21T09:40:49.8222632Z 'range_flattens': [None, True], 2026-02-21T09:40:49.8222946Z 'range_multi_buffers': [None, None], 2026-02-21T09:40:49.8223254Z 'range_num_stages': [0, 3], 2026-02-21T09:40:49.8223531Z 'range_unroll_factors': [0, 1], 2026-02-21T09:40:49.8223791Z 'range_warp_specializes': [], 2026-02-21T09:40:49.8224016Z 'waves_per_eu': 4} 2026-02-21T09:40:49.8414581Z [285s] Fitting surrogate: 837 points, 837 targets 2026-02-21T09:40:50.3307219Z [286s] Generation 10 starting: 34 neighbors, 2 active search path(s) 2026-02-21T09:40:57.4232151Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 7.3 configs/s 2026-02-21T09:40:59.7069124Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 35/35 15.9 configs/s 2026-02-21T09:41:01.0895848Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 590.8 2026-02-21T09:41:01.0896065Z configs/s 2026-02-21T09:41:01.4663604Z [297s] Generation 10 complete: 2026-02-21T09:41:01.4663964Z ok=36 2026-02-21T09:41:01.4664178Z min=0.1020 2026-02-21T09:41:01.4664430Z mid=1.0527 2026-02-21T09:41:01.4664632Z max=4.3885 2026-02-21T09:41:01.4664856Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:41:01.4665563Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:41:01.4665911Z 'l2_groupings': [2], 2026-02-21T09:41:01.4666183Z 'load_eviction_policies': ['', ''], 2026-02-21T09:41:01.4666510Z 'loop_orders': [[1, 0]], 2026-02-21T09:41:01.4666780Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:41:01.4667045Z 'num_stages': 3, 2026-02-21T09:41:01.4667281Z 'num_warps': 8, 2026-02-21T09:41:01.4667507Z 'pid_type': 'flat', 2026-02-21T09:41:01.4667761Z 'range_flattens': [None, True], 2026-02-21T09:41:01.4668068Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:01.4668370Z 'range_num_stages': [0, 3], 2026-02-21T09:41:01.4668652Z 'range_unroll_factors': [0, 1], 2026-02-21T09:41:01.4668938Z 'range_warp_specializes': [], 2026-02-21T09:41:01.4669223Z 'waves_per_eu': 4} 2026-02-21T09:41:01.4845355Z [297s] Fitting surrogate: 873 points, 873 targets 2026-02-21T09:41:01.8221879Z [297s] Generation 11 starting: 22 neighbors, 1 active search path(s) 2026-02-21T09:41:06.6282818Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 3.7 configs/s 2026-02-21T09:41:08.2147684Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 23/23 15.9 configs/s 2026-02-21T09:41:08.2153645Z [304s] Generation 11 complete: 2026-02-21T09:41:08.2153906Z ok=24 2026-02-21T09:41:08.2154078Z min=0.1020 2026-02-21T09:41:08.2154249Z mid=1.1884 2026-02-21T09:41:08.2154416Z max=4.5023 2026-02-21T09:41:08.2154600Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:41:08.2154901Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:41:08.2155189Z 'l2_groupings': [2], 2026-02-21T09:41:08.2155415Z 'load_eviction_policies': ['', ''], 2026-02-21T09:41:08.2155666Z 'loop_orders': [[1, 0]], 2026-02-21T09:41:08.2155896Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:41:08.2156117Z 'num_stages': 3, 2026-02-21T09:41:08.2156301Z 'num_warps': 8, 2026-02-21T09:41:08.2156503Z 'pid_type': 'flat', 2026-02-21T09:41:08.2157082Z 'range_flattens': [None, True], 2026-02-21T09:41:08.2157330Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:08.2157585Z 'range_num_stages': [0, 3], 2026-02-21T09:41:08.2157813Z 'range_unroll_factors': [0, 1], 2026-02-21T09:41:08.2158129Z 'range_warp_specializes': [], 2026-02-21T09:41:08.2158356Z 'waves_per_eu': 4} 2026-02-21T09:41:08.2186019Z [304s] Fitting surrogate: 897 points, 897 targets 2026-02-21T09:41:08.5209851Z [304s] Generation 12 starting: 18 neighbors, 1 active search path(s) 2026-02-21T09:41:12.6201249Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 2.5 configs/s 2026-02-21T09:41:14.0426640Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 14.8 configs/s 2026-02-21T09:41:14.0431234Z [310s] Generation 12 complete: 2026-02-21T09:41:14.0431536Z ok=20 2026-02-21T09:41:14.0431720Z min=0.1020 2026-02-21T09:41:14.0431905Z mid=1.2441 2026-02-21T09:41:14.0432084Z max=14.1591 2026-02-21T09:41:14.0432313Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:41:14.0432641Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:41:14.0432967Z 'l2_groupings': [2], 2026-02-21T09:41:14.0433204Z 'load_eviction_policies': ['', ''], 2026-02-21T09:41:14.0433474Z 'loop_orders': [[1, 0]], 2026-02-21T09:41:14.0433724Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:41:14.0433961Z 'num_stages': 3, 2026-02-21T09:41:14.0434159Z 'num_warps': 8, 2026-02-21T09:41:14.0434362Z 'pid_type': 'flat', 2026-02-21T09:41:14.0434587Z 'range_flattens': [None, True], 2026-02-21T09:41:14.0434850Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:14.0435114Z 'range_num_stages': [0, 3], 2026-02-21T09:41:14.0435356Z 'range_unroll_factors': [0, 1], 2026-02-21T09:41:14.0435612Z 'range_warp_specializes': [], 2026-02-21T09:41:14.0435852Z 'waves_per_eu': 4} 2026-02-21T09:41:14.0463975Z [310s] Fitting surrogate: 917 points, 917 targets 2026-02-21T09:41:14.3503957Z [310s] Generation 13 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:41:18.7773174Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 3.0 configs/s 2026-02-21T09:41:20.2201798Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 15.3 configs/s 2026-02-21T09:41:20.2206781Z [316s] Generation 13 complete: 2026-02-21T09:41:20.2207139Z ok=21 2026-02-21T09:41:20.2207349Z min=0.1020 2026-02-21T09:41:20.2207565Z mid=1.3551 2026-02-21T09:41:20.2207764Z max=5.4281 2026-02-21T09:41:20.2207989Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:41:20.2208348Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:41:20.2208695Z 'l2_groupings': [2], 2026-02-21T09:41:20.2208967Z 'load_eviction_policies': ['', ''], 2026-02-21T09:41:20.2209279Z 'loop_orders': [[1, 0]], 2026-02-21T09:41:20.2209555Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:41:20.2209824Z 'num_stages': 3, 2026-02-21T09:41:20.2210048Z 'num_warps': 8, 2026-02-21T09:41:20.2210279Z 'pid_type': 'flat', 2026-02-21T09:41:20.2210536Z 'range_flattens': [None, True], 2026-02-21T09:41:20.2210814Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:20.2211043Z 'range_num_stages': [0, 3], 2026-02-21T09:41:20.2211243Z 'range_unroll_factors': [0, 1], 2026-02-21T09:41:20.2211466Z 'range_warp_specializes': [], 2026-02-21T09:41:20.2211667Z 'waves_per_eu': 4} 2026-02-21T09:41:20.2240165Z [316s] Fitting surrogate: 938 points, 938 targets 2026-02-21T09:41:20.5299422Z [316s] Generation 14 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:41:23.3771512Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 7.1 configs/s 2026-02-21T09:41:24.7722968Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 15.9 configs/s 2026-02-21T09:41:24.7728431Z [320s] Generation 14 complete: 2026-02-21T09:41:24.7728781Z ok=21 2026-02-21T09:41:24.7729003Z min=0.1020 2026-02-21T09:41:24.7729216Z mid=1.0311 2026-02-21T09:41:24.7729420Z max=4.5522 2026-02-21T09:41:24.7729648Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:41:24.7730012Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:41:24.7730679Z 'l2_groupings': [2], 2026-02-21T09:41:24.7730914Z 'load_eviction_policies': ['', ''], 2026-02-21T09:41:24.7731196Z 'loop_orders': [[1, 0]], 2026-02-21T09:41:24.7731433Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:41:24.7731664Z 'num_stages': 3, 2026-02-21T09:41:24.7731952Z 'num_warps': 8, 2026-02-21T09:41:24.7732151Z 'pid_type': 'flat', 2026-02-21T09:41:24.7732372Z 'range_flattens': [None, True], 2026-02-21T09:41:24.7732635Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:24.7732897Z 'range_num_stages': [0, 3], 2026-02-21T09:41:24.7733132Z 'range_unroll_factors': [0, 1], 2026-02-21T09:41:24.7733386Z 'range_warp_specializes': [], 2026-02-21T09:41:24.7733622Z 'waves_per_eu': 4} 2026-02-21T09:41:24.7760912Z [320s] Fitting surrogate: 959 points, 959 targets 2026-02-21T09:41:25.0758397Z [321s] Generation 15 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:41:56.4376421Z [352s] Timeout after 30s compiling Config(block_sizes=[128, 2, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:41:56.4398424Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 0.4 configs/s 2026-02-21T09:41:57.8901928Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 14.1 configs/s 2026-02-21T09:41:57.8913593Z [353s] Generation 15 complete: 2026-02-21T09:41:57.8916739Z timeout=1 2026-02-21T09:41:57.8917028Z ok=20 2026-02-21T09:41:57.8917233Z min=0.1020 2026-02-21T09:41:57.8917431Z mid=1.8204 2026-02-21T09:41:57.8917625Z max=12.7759 2026-02-21T09:41:57.8917846Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:41:57.8918232Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:41:57.8918560Z 'l2_groupings': [2], 2026-02-21T09:41:57.8918855Z 'load_eviction_policies': ['', ''], 2026-02-21T09:41:57.8919153Z 'loop_orders': [[1, 0]], 2026-02-21T09:41:57.8919418Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:41:57.8919697Z 'num_stages': 3, 2026-02-21T09:41:57.8919923Z 'num_warps': 8, 2026-02-21T09:41:57.8920147Z 'pid_type': 'flat', 2026-02-21T09:41:57.8920401Z 'range_flattens': [None, True], 2026-02-21T09:41:57.8920696Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:57.8920992Z 'range_num_stages': [0, 3], 2026-02-21T09:41:57.8921256Z 'range_unroll_factors': [0, 1], 2026-02-21T09:41:57.8921529Z 'range_warp_specializes': [], 2026-02-21T09:41:57.8921804Z 'waves_per_eu': 4} 2026-02-21T09:41:57.8942078Z [353s] Fitting surrogate: 980 points, 980 targets 2026-02-21T09:41:58.1979866Z [354s] Generation 16 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:42:13.1923274Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 0.9 configs/s 2026-02-21T09:42:14.8820719Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 12.8 configs/s 2026-02-21T09:42:14.8823596Z [370s] Generation 16 complete: 2026-02-21T09:42:14.8823884Z ok=21 2026-02-21T09:42:14.8824063Z min=0.1020 2026-02-21T09:42:14.8824248Z mid=2.2351 2026-02-21T09:42:14.8824407Z max=12.5435 2026-02-21T09:42:14.8824595Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:42:14.8824886Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:42:14.8825172Z 'l2_groupings': [2], 2026-02-21T09:42:14.8825387Z 'load_eviction_policies': ['', ''], 2026-02-21T09:42:14.8825631Z 'loop_orders': [[1, 0]], 2026-02-21T09:42:14.8825845Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:42:14.8826052Z 'num_stages': 3, 2026-02-21T09:42:14.8826242Z 'num_warps': 8, 2026-02-21T09:42:14.8826416Z 'pid_type': 'flat', 2026-02-21T09:42:14.8826622Z 'range_flattens': [None, True], 2026-02-21T09:42:14.8826861Z 'range_multi_buffers': [None, None], 2026-02-21T09:42:14.8827101Z 'range_num_stages': [0, 3], 2026-02-21T09:42:14.8827322Z 'range_unroll_factors': [0, 1], 2026-02-21T09:42:14.8827551Z 'range_warp_specializes': [], 2026-02-21T09:42:14.8827777Z 'waves_per_eu': 4} 2026-02-21T09:42:14.8857462Z [370s] Fitting surrogate: 1001 points, 1001 targets 2026-02-21T09:42:15.1735515Z [371s] Generation 17 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:42:23.3247920Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 1.0 configs/s 2026-02-21T09:42:25.0272483Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 12.7 configs/s 2026-02-21T09:42:25.0277145Z [381s] Generation 17 complete: 2026-02-21T09:42:25.0277475Z ok=21 2026-02-21T09:42:25.0277714Z min=0.1020 2026-02-21T09:42:25.0277926Z mid=1.0149 2026-02-21T09:42:25.0278133Z max=32.8611 2026-02-21T09:42:25.0278370Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:42:25.0278749Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:42:25.0279102Z 'l2_groupings': [2], 2026-02-21T09:42:25.0279379Z 'load_eviction_policies': ['', ''], 2026-02-21T09:42:25.0279694Z 'loop_orders': [[1, 0]], 2026-02-21T09:42:25.0280002Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:42:25.0280280Z 'num_stages': 3, 2026-02-21T09:42:25.0280829Z 'num_warps': 8, 2026-02-21T09:42:25.0281063Z 'pid_type': 'flat', 2026-02-21T09:42:25.0281322Z 'range_flattens': [None, True], 2026-02-21T09:42:25.0281637Z 'range_multi_buffers': [None, None], 2026-02-21T09:42:25.0281942Z 'range_num_stages': [0, 3], 2026-02-21T09:42:25.0282223Z 'range_unroll_factors': [0, 1], 2026-02-21T09:42:25.0282521Z 'range_warp_specializes': [], 2026-02-21T09:42:25.0282882Z 'waves_per_eu': 4} 2026-02-21T09:42:25.0314424Z [381s] Fitting surrogate: 1022 points, 1022 targets 2026-02-21T09:42:25.3215097Z [381s] Generation 18 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:42:29.4404382Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 4.7 configs/s 2026-02-21T09:42:31.0335240Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 13.7 configs/s 2026-02-21T09:42:31.0339427Z [387s] Generation 18 complete: 2026-02-21T09:42:31.0339965Z ok=21 2026-02-21T09:42:31.0340256Z min=0.1020 2026-02-21T09:42:31.0340473Z mid=1.9155 2026-02-21T09:42:31.0340705Z max=10.0905 2026-02-21T09:42:31.0340945Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:42:31.0341320Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:42:31.0341689Z 'l2_groupings': [2], 2026-02-21T09:42:31.0341981Z 'load_eviction_policies': ['', ''], 2026-02-21T09:42:31.0342306Z 'loop_orders': [[1, 0]], 2026-02-21T09:42:31.0342591Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:42:31.0342870Z 'num_stages': 3, 2026-02-21T09:42:31.0343102Z 'num_warps': 8, 2026-02-21T09:42:31.0343344Z 'pid_type': 'flat', 2026-02-21T09:42:31.0343604Z 'range_flattens': [None, True], 2026-02-21T09:42:31.0343871Z 'range_multi_buffers': [None, None], 2026-02-21T09:42:31.0343997Z 'range_num_stages': [0, 3], 2026-02-21T09:42:31.0344114Z 'range_unroll_factors': [0, 1], 2026-02-21T09:42:31.0344232Z 'range_warp_specializes': [], 2026-02-21T09:42:31.0344347Z 'waves_per_eu': 4} 2026-02-21T09:42:31.0372960Z [387s] Fitting surrogate: 1043 points, 1043 targets 2026-02-21T09:42:31.3370005Z [387s] Generation 19 starting: 21 neighbors, 1 active search path(s) 2026-02-21T09:43:03.2891508Z [419s] Timeout after 30s compiling Config(block_sizes=[64, 4, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:43:03.2911816Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 0.2 configs/s 2026-02-21T09:43:05.3938835Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 22/22 11.1 configs/s 2026-02-21T09:43:05.3942900Z [421s] Generation 19 complete: 2026-02-21T09:43:05.3945849Z timeout=1 2026-02-21T09:43:05.3946388Z ok=22 2026-02-21T09:43:05.3946679Z min=0.1020 2026-02-21T09:43:05.3946947Z mid=3.7060 2026-02-21T09:43:05.3949870Z max=21.7618 2026-02-21T09:43:05.3950105Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:43:05.3950488Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:43:05.3950813Z 'l2_groupings': [2], 2026-02-21T09:43:05.3951378Z 'load_eviction_policies': ['', ''], 2026-02-21T09:43:05.3951668Z 'loop_orders': [[1, 0]], 2026-02-21T09:43:05.3951924Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:43:05.3952169Z 'num_stages': 3, 2026-02-21T09:43:05.3952384Z 'num_warps': 8, 2026-02-21T09:43:05.3952601Z 'pid_type': 'flat', 2026-02-21T09:43:05.3952832Z 'range_flattens': [None, True], 2026-02-21T09:43:05.3953116Z 'range_multi_buffers': [None, None], 2026-02-21T09:43:05.3953397Z 'range_num_stages': [0, 3], 2026-02-21T09:43:05.3953647Z 'range_unroll_factors': [0, 1], 2026-02-21T09:43:05.3953921Z 'range_warp_specializes': [], 2026-02-21T09:43:05.3954172Z 'waves_per_eu': 4} 2026-02-21T09:43:05.3976606Z [421s] Fitting surrogate: 1066 points, 1066 targets 2026-02-21T09:43:06.4657220Z [422s] Generation 20 starting: 18 neighbors, 1 active search path(s) 2026-02-21T09:43:09.8016655Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 6.6 configs/s 2026-02-21T09:43:10.5429606Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:43:10.5432472Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 4], order = [2, 1, 0]}> 2026-02-21T09:43:10.5433246Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 4], order = [1, 0]}> 2026-02-21T09:43:10.5433950Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:43:10.5434607Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 16], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:43:10.5435204Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:43:10.5435779Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:43:10.5436200Z #smem = #ttg.shared_memory 2026-02-21T09:43:10.5436679Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:43:10.5437590Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:43:10.5438429Z %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma> 2026-02-21T09:43:10.5438759Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:43:10.5439087Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:43:10.5439722Z %cst_2 = arith.constant dense<7168> : tensor<4x1xi32, #blocked1> 2026-02-21T09:43:10.5440059Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2> 2026-02-21T09:43:10.5440417Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma> 2026-02-21T09:43:10.5440717Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:43:10.5440947Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:43:10.5441168Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:43:10.5441377Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:43:10.5441645Z %cst_5 = arith.constant dense<0> : tensor<4x2x256xi8, #blocked> 2026-02-21T09:43:10.5441933Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:43:10.5442153Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:43:10.5442364Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:43:10.5442676Z %c28_i32 = arith.constant 28 : i32 2026-02-21T09:43:10.5443025Z %cst_6 = arith.constant dense<4> : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:43:10.5443384Z %0 = tt.get_program_id x : i32 2026-02-21T09:43:10.5443590Z %1 = arith.divsi %0, %c8_i32 : i32 2026-02-21T09:43:10.5443805Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:43:10.5444017Z %3 = arith.subi %c28_i32, %2 : i32 2026-02-21T09:43:10.5444299Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:43:10.5444504Z %5 = arith.remsi %0, %c8_i32 : i32 2026-02-21T09:43:10.5444711Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:43:10.5444917Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:43:10.5445114Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:43:10.5445320Z %9 = arith.muli %7, %c256_i32 : i32 2026-02-21T09:43:10.5445687Z %10 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:43:10.5446223Z %11 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:43:10.5446660Z %12 = tt.splat %9 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:43:10.5446956Z %13 = tt.splat %9 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:43:10.5447281Z %14 = arith.addi %12, %10 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:43:10.5447576Z %15 = arith.addi %13, %11 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:43:10.5447807Z %16 = arith.muli %8, %c16_i32 : i32 2026-02-21T09:43:10.5448080Z %17 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:43:10.5448448Z %18 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:43:10.5448770Z %19 = tt.splat %16 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:43:10.5449062Z %20 = tt.splat %16 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:43:10.5449351Z %21 = arith.addi %19, %17 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:43:10.5449634Z %22 = arith.addi %20, %18 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:43:10.5449962Z %23 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:43:10.5450339Z %24 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:43:10.5450749Z %25 = tt.expand_dims %21 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T09:43:10.5451090Z %26 = arith.muli %25, %cst_3 : tensor<16x1xi32, #blocked2> 2026-02-21T09:43:10.5451353Z %27 = tt.broadcast %26 : tensor<16x1xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:43:10.5451637Z %28 = tt.splat %arg0 : !tt.ptr -> tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T09:43:10.5452061Z %29 = tt.expand_dims %15 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:43:10.5452445Z %30 = tt.broadcast %29 : tensor<1x256xi32, #blocked1> -> tensor<4x256xi32, #blocked1> 2026-02-21T09:43:10.5452733Z %31 = tt.splat %arg1 : !tt.ptr -> tensor<4x256x!tt.ptr, #blocked1> 2026-02-21T09:43:10.5453108Z %32 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:43:10.5453670Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:43:10.5454231Z %34 = tt.expand_dims %33 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:43:10.5454572Z %35 = arith.cmpi eq, %34, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:43:10.5454845Z %36 = tt.broadcast %35 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x256xi1, #blocked> 2026-02-21T09:43:10.5455114Z %37 = arith.cmpi eq, %34, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:43:10.5455371Z %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x256xi1, #blocked> 2026-02-21T09:43:10.5455760Z %39 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg4 = %cst_4) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T09:43:10.5456127Z %49 = tt.splat %arg3 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:43:10.5456415Z %50 = arith.addi %49, %23 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:43:10.5456604Z %51 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:43:10.5456781Z %52 = tt.splat %51 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:43:10.5457016Z %53 = arith.addi %52, %24 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:43:10.5457311Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:43:10.5457623Z %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:43:10.5457830Z %56 = arith.addi %27, %55 : tensor<16x8xi32, #blocked2> 2026-02-21T09:43:10.5458037Z %57 = tt.addptr %28, %56 : tensor<16x8x!tt.ptr, #blocked2>, tensor<16x8xi32, #blocked2> 2026-02-21T09:43:10.5458257Z %58 = tt.load %57 : tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T09:43:10.5458494Z %59 = ttg.local_alloc %58 : (tensor<16x8xbf16, #blocked2>) -> !ttg.memdesc<16x8xbf16, #shared, #smem> 2026-02-21T09:43:10.5458844Z %60 = ttg.local_load %59 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:43:10.5459279Z %61 = arith.extf %60 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:43:10.5459685Z %62 = tt.expand_dims %50 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:43:10.5459950Z %63 = arith.muli %62, %cst_2 : tensor<4x1xi32, #blocked1> 2026-02-21T09:43:10.5460158Z %64 = tt.broadcast %63 : tensor<4x1xi32, #blocked1> -> tensor<4x256xi32, #blocked1> 2026-02-21T09:43:10.5460362Z %65 = arith.addi %64, %30 : tensor<4x256xi32, #blocked1> 2026-02-21T09:43:10.5460578Z %66 = tt.addptr %31, %65 : tensor<4x256x!tt.ptr, #blocked1>, tensor<4x256xi32, #blocked1> 2026-02-21T09:43:10.5460788Z %67 = tt.load %66 : tensor<4x256x!tt.ptr, #blocked1> 2026-02-21T09:43:10.5461048Z %68 = ttg.convert_layout %67 : tensor<4x256xi8, #blocked1> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:43:10.5461382Z %69 = arith.shli %68, %cst_6 : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:43:10.5461632Z %70 = arith.shrsi %69, %cst_6 : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:43:10.5461882Z %71 = arith.shrsi %68, %cst_6 : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:43:10.5462192Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x256xi8, #blocked> 2026-02-21T09:43:10.5462556Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x256xi8, #blocked> 2026-02-21T09:43:10.5462866Z %74 = tt.broadcast %72 : tensor<4x1x256xi8, #blocked> -> tensor<4x2x256xi8, #blocked> 2026-02-21T09:43:10.5463124Z %75 = arith.select %36, %74, %cst_5 : tensor<4x2x256xi1, #blocked>, tensor<4x2x256xi8, #blocked> 2026-02-21T09:43:10.5463378Z %76 = tt.broadcast %73 : tensor<4x1x256xi8, #blocked> -> tensor<4x2x256xi8, #blocked> 2026-02-21T09:43:10.5463630Z %77 = arith.select %38, %76, %75 : tensor<4x2x256xi1, #blocked>, tensor<4x2x256xi8, #blocked> 2026-02-21T09:43:10.5463879Z %78 = tt.reshape %77 : tensor<4x2x256xi8, #blocked> -> tensor<8x256xi8, #blocked1> 2026-02-21T09:43:10.5464119Z %79 = arith.sitofp %78 : tensor<8x256xi8, #blocked1> to tensor<8x256xf32, #blocked1> 2026-02-21T09:43:10.5464401Z %80 = ttg.local_alloc %79 : (tensor<8x256xf32, #blocked1>) -> !ttg.memdesc<8x256xf32, #shared1, #smem> 2026-02-21T09:43:10.5464750Z %81 = ttg.local_load %80 : !ttg.memdesc<8x256xf32, #shared1, #smem> -> tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:43:10.5465261Z %82 = tt.dot %61, %81, %arg4, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:43:10.5465636Z scf.yield %82 : tensor<16x256xf32, #mma> 2026-02-21T09:43:10.5465849Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32} 2026-02-21T09:43:10.5466118Z %40 = arith.truncf %39 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma> 2026-02-21T09:43:10.5466396Z %41 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:43:10.5466648Z %42 = arith.muli %41, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:43:10.5466885Z %43 = tt.expand_dims %14 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:43:10.5467139Z %44 = tt.broadcast %42 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:43:10.5467335Z %45 = tt.broadcast %43 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:43:10.5467509Z %46 = arith.addi %44, %45 : tensor<16x256xi32, #mma> 2026-02-21T09:43:10.5467678Z %47 = tt.splat %arg2 : !tt.ptr -> tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:43:10.5467888Z %48 = tt.addptr %47, %46 : tensor<16x256x!tt.ptr, #mma>, tensor<16x256xi32, #mma> 2026-02-21T09:43:10.5468078Z tt.store %48, %40 : tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:43:10.5468204Z tt.return 2026-02-21T09:43:10.5468288Z } 2026-02-21T09:43:10.5468362Z } 2026-02-21T09:43:10.5468407Z 2026-02-21T09:43:10.5468437Z {-# 2026-02-21T09:43:10.5468518Z external_resources: { 2026-02-21T09:43:10.5468615Z mlir_reproducer: { 2026-02-21T09:43:10.5469655Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:43:10.5470639Z disable_threading: false, 2026-02-21T09:43:10.5470744Z verify_each: true 2026-02-21T09:43:10.5470833Z } 2026-02-21T09:43:10.5470905Z } 2026-02-21T09:43:10.5470974Z #-} 2026-02-21T09:43:10.5471245Z /tmp/torchinductor_root/ys/cysxx4khj5nqdldlywplcst5qbz6wuczzqt4j7dvs6oooswg7jlh.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:43:10.5471922Z /tmp/torchinductor_root/ys/cysxx4khj5nqdldlywplcst5qbz6wuczzqt4j7dvs6oooswg7jlh.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:43:10.5472471Z [426s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:43:10.5473194Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:43:10.5473864Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:43:10.5474033Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:43:10.7915235Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 21.6 configs/s 2026-02-21T09:43:10.7918863Z [426s] Generation 20 complete: 2026-02-21T09:43:10.7919204Z error=5 2026-02-21T09:43:10.7919410Z ok=15 2026-02-21T09:43:10.7919610Z min=0.1020 2026-02-21T09:43:10.7919818Z mid=0.4212 2026-02-21T09:43:10.7920018Z max=2.3786 2026-02-21T09:43:10.7920245Z best={'block_sizes': [64, 64, 32], 2026-02-21T09:43:10.7920630Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:43:10.7920985Z 'l2_groupings': [2], 2026-02-21T09:43:10.7921455Z 'load_eviction_policies': ['', ''], 2026-02-21T09:43:10.7921764Z 'loop_orders': [[1, 0]], 2026-02-21T09:43:10.7922049Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:43:10.7922326Z 'num_stages': 3, 2026-02-21T09:43:10.7922559Z 'num_warps': 8, 2026-02-21T09:43:10.7922895Z 'pid_type': 'flat', 2026-02-21T09:43:10.7923161Z 'range_flattens': [None, True], 2026-02-21T09:43:10.7923461Z 'range_multi_buffers': [None, None], 2026-02-21T09:43:10.7923770Z 'range_num_stages': [0, 3], 2026-02-21T09:43:10.7924050Z 'range_unroll_factors': [0, 1], 2026-02-21T09:43:10.7924345Z 'range_warp_specializes': [], 2026-02-21T09:43:10.7924630Z 'waves_per_eu': 4} 2026-02-21T09:43:10.7951464Z [426s] Fitting surrogate: 1086 points, 1086 targets 2026-02-21T09:43:10.9203210Z [427s] Autotuning complete in 427.0s after searching 1045 configs. 2026-02-21T09:43:10.9203798Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:43:10.9205649Z @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T09:43:10.9207107Z 2026-02-21T09:43:10.9207424Z [427s] Code of selected kernel: /tmp/torchinductor_root/f4/cf4x2z7r3h3z2xnybclxecodxrnqfexakhkt2hd4zmrekctdjdps.py 2026-02-21T09:43:10.9363182Z from __future__ import annotations 2026-02-21T09:43:10.9363440Z 2026-02-21T09:43:10.9363788Z import torch 2026-02-21T09:43:10.9364076Z import triton 2026-02-21T09:43:10.9364335Z import triton.language as tl 2026-02-21T09:43:10.9364757Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:43:10.9365090Z 2026-02-21T09:43:10.9365589Z _BLOCK_SIZE_2 = tl.constexpr(32) 2026-02-21T09:43:10.9365887Z _BLOCK_SIZE_1 = tl.constexpr(64) 2026-02-21T09:43:10.9366175Z _BLOCK_SIZE_0 = tl.constexpr(64) 2026-02-21T09:43:10.9366354Z 2026-02-21T09:43:10.9366418Z @triton.jit 2026-02-21T09:43:10.9366673Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr): 2026-02-21T09:43:10.9367025Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:43:10.9367259Z num_pid_m = tl.cdiv(7168, _BLOCK_SIZE_2) 2026-02-21T09:43:10.9367443Z num_pid_n = tl.cdiv(64, _BLOCK_SIZE_1) 2026-02-21T09:43:10.9367631Z inner_2d_pid = tl.program_id(0) 2026-02-21T09:43:10.9367798Z num_pid_in_group = 2 * num_pid_n 2026-02-21T09:43:10.9367996Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T09:43:10.9368185Z first_pid_m = group_id * 2 2026-02-21T09:43:10.9368366Z group_size_m = min(num_pid_m - first_pid_m, 2) 2026-02-21T09:43:10.9368627Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T09:43:10.9368906Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T09:43:10.9369117Z offset_2 = pid_0 * _BLOCK_SIZE_2 2026-02-21T09:43:10.9369334Z indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32) 2026-02-21T09:43:10.9369600Z offset_1 = pid_1 * _BLOCK_SIZE_1 2026-02-21T09:43:10.9369812Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T09:43:10.9370104Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:43:10.9370400Z acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32) 2026-02-21T09:43:10.9370712Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:43:10.9371107Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:43:10.9371492Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:43:10.9371779Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:43:10.9372105Z for offset_3 in tl.range(0, 4096, _BLOCK_SIZE_0, loop_unroll_factor=1, num_stages=3, flatten=True): 2026-02-21T09:43:10.9372453Z indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T09:43:10.9372664Z acc_copy = acc 2026-02-21T09:43:10.9372814Z acc_copy_0 = acc_copy 2026-02-21T09:43:10.9373018Z # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2 2026-02-21T09:43:10.9373232Z mul = 2 * offset_3 2026-02-21T09:43:10.9373477Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:43:10.9373746Z iota = mul + tl.arange(0, mul_1) 2026-02-21T09:43:10.9374132Z load = tl.load(A + (indices_1[:, None] * 8192 + iota[None, :] * 1), None) 2026-02-21T09:43:10.9374510Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:43:10.9374942Z # src[int4_gemm.py:66]: torch.float32 2026-02-21T09:43:10.9375198Z # src[int4_gemm.py:67]: ) # [BLOCK_SIZE_M, BLOCK_SIZE_K] 2026-02-21T09:43:10.9375466Z v_0 = tl.cast(load, tl.float32) 2026-02-21T09:43:10.9375775Z # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n] # [BLOCK_SIZE_K//2, BLOCK_SIZE_N] 2026-02-21T09:43:10.9383369Z b_tile = tl.load(B + (indices_3[:, None] * 7168 + indices_2[None, :] * 1), None) 2026-02-21T09:43:10.9383640Z # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8) # Sign-extend low 4 bits 2026-02-21T09:43:10.9383855Z v_1 = tl.full([], 4, tl.int8) 2026-02-21T09:43:10.9383998Z v_2 = b_tile << v_1 2026-02-21T09:43:10.9384120Z v_3 = tl.full([], 4, tl.int8) 2026-02-21T09:43:10.9384245Z v_4 = v_2 >> v_3 2026-02-21T09:43:10.9384418Z # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8) # Sign-extend high 4 bits 2026-02-21T09:43:10.9384691Z v_5 = tl.full([], 4, tl.int8) 2026-02-21T09:43:10.9384813Z v_6 = b_tile >> v_5 2026-02-21T09:43:10.9384975Z # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1) 2026-02-21T09:43:10.9385157Z stack_idx = tl.arange(0, 2) 2026-02-21T09:43:10.9385304Z broadcast_idx = stack_idx[None, :, None] 2026-02-21T09:43:10.9385455Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T09:43:10.9385593Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T09:43:10.9385745Z stacked_result = tl.zeros_like(expanded_0) 2026-02-21T09:43:10.9385894Z mask_0 = broadcast_idx == 0 2026-02-21T09:43:10.9386070Z stacked_result = tl.where(mask_0, expanded_0, stacked_result) 2026-02-21T09:43:10.9386242Z mask_1 = broadcast_idx == 1 2026-02-21T09:43:10.9386399Z stacked_result = tl.where(mask_1, expanded_1, stacked_result) 2026-02-21T09:43:10.9386572Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:43:10.9386759Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:43:10.9386930Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:43:10.9387088Z view = tl.reshape(stacked_result, [_SHAPE_DIM_2, _BLOCK_SIZE_2]) 2026-02-21T09:43:10.9387258Z v_7 = tl.cast(view, tl.float32) 2026-02-21T09:43:10.9387427Z # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2) # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1] 2026-02-21T09:43:10.9387602Z a_tile_1 = v_0[:, :, None] 2026-02-21T09:43:10.9387739Z # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0) 2026-02-21T09:43:10.9387888Z b_unpacked_1 = v_7[None, :, :] 2026-02-21T09:43:10.9388072Z # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1) # [BLOCK_SIZE_M, BLOCK_SIZE_N] 2026-02-21T09:43:10.9388260Z v_8 = a_tile_1 * b_unpacked_1 2026-02-21T09:43:10.9388383Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:43:10.9388513Z acc = acc_copy_0 + sum_1 2026-02-21T09:43:10.9388654Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T09:43:10.9388820Z v_10 = tl.cast(acc, tl.bfloat16) 2026-02-21T09:43:10.9388975Z tl.store(C + (indices_1[:, None] * 7168 + indices_2[None, :] * 1), v_10, None) 2026-02-21T09:43:10.9389099Z 2026-02-21T09:43:10.9389185Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher): 2026-02-21T09:43:10.9389339Z """ 2026-02-21T09:43:10.9389448Z BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T09:43:10.9389548Z 2026-02-21T09:43:10.9389609Z This kernel performs matrix multiplication where: 2026-02-21T09:43:10.9389756Z - A is a bfloat16 matrix of shape [M, K] 2026-02-21T09:43:10.9389915Z - B is an int8 matrix of shape [K//2, N] containing packed int4 values 2026-02-21T09:43:10.9390081Z (two 4-bit values packed into each int8) 2026-02-21T09:43:10.9390162Z 2026-02-21T09:43:10.9390198Z Args: 2026-02-21T09:43:10.9390314Z A (Tensor): Input tensor of shape [M, K] in bfloat16 format. 2026-02-21T09:43:10.9390492Z B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format. 2026-02-21T09:43:10.9390607Z 2026-02-21T09:43:10.9390640Z Returns: 2026-02-21T09:43:10.9390755Z Tensor: Output tensor of shape [M, N] in bfloat16 format. 2026-02-21T09:43:10.9390888Z """ 2026-02-21T09:43:10.9390975Z # src[int4_gemm.py:50]: M, K = A.shape 2026-02-21T09:43:10.9391087Z M, K = A.shape 2026-02-21T09:43:10.9391181Z # src[int4_gemm.py:51]: _, N = B.shape 2026-02-21T09:43:10.9391288Z _, N = B.shape 2026-02-21T09:43:10.9391429Z # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:43:10.9391629Z C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:43:10.9391800Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:43:10.9391941Z _BLOCK_SIZE_2 = 32 2026-02-21T09:43:10.9392036Z _BLOCK_SIZE_1 = 64 2026-02-21T09:43:10.9392229Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:43:10.9392494Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:43:10.9392746Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:43:10.9392921Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:43:10.9393024Z _BLOCK_SIZE_0 = 64 2026-02-21T09:43:10.9393144Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:43:10.9393321Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:43:10.9393486Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:43:10.9393612Z _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0 2026-02-21T09:43:10.9393751Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:43:10.9393942Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:43:10.9394104Z # src[int4_gemm.py:57-91]: ... 2026-02-21T09:43:10.9394240Z _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0) 2026-02-21T09:43:10.9394607Z _launcher(_helion_matmul_bf16_int4, (triton.cdiv(7168, _BLOCK_SIZE_2) * triton.cdiv(64, _BLOCK_SIZE_1),), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, num_warps=8, num_stages=3, waves_per_eu=4, matrix_instr_nonkdim=16) 2026-02-21T09:43:10.9394957Z # src[int4_gemm.py:93]: return C 2026-02-21T09:43:10.9395065Z return C 2026-02-21T09:43:11.9568276Z WARNING:tritonbench.utils.triton_op:Completed input ID 14: 2026-02-21T09:43:11.9568716Z x_val 2026-02-21T09:43:11.9568932Z ------------------- 2026-02-21T09:43:11.9569171Z (64, 1, 7168, 8192) 2026-02-21T09:43:11.9569311Z 2026-02-21T09:43:11.9581722Z 50%|█████ | 5/10 [40:33<41:27, 497.48s/it]WARNING:tritonbench.utils.triton_op:Running input ID 17: 2026-02-21T09:43:11.9582134Z x_val 2026-02-21T09:43:11.9582309Z --------------------- 2026-02-21T09:43:11.9582526Z (1, 4096, 8192, 1024) 2026-02-21T09:43:11.9584267Z INFO:tritonbench.utils.triton_op:Took 0.14ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T09:43:13.1274108Z INFO:tritonbench.utils.triton_op:Took 3.74ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T09:43:15.3164920Z INFO:tritonbench.utils.triton_op:Took 0.20ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T09:43:15.3183015Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:43:15.3183314Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:43:15.3183568Z 'dtype': 'torch.bfloat16', 2026-02-21T09:43:15.3183813Z 'shape': (1, 4096, 1024), 2026-02-21T09:43:15.3184056Z 'stride': (4194304, 1024, 1)}, 2026-02-21T09:43:15.3184293Z { 'device': 'cuda:0', 2026-02-21T09:43:15.3184519Z 'dtype': 'torch.int32', 2026-02-21T09:43:15.3184745Z 'shape': (1024, 8192), 2026-02-21T09:43:15.3184992Z 'stride': (8192, 1)}), 2026-02-21T09:43:15.3185208Z 'kwargs': {}} 2026-02-21T09:43:15.3221240Z INFO:tritonbench.utils.triton_op:Took 3.97ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T09:43:15.5115292Z [0s] Autotune random seed: 2138032649 2026-02-21T09:43:15.5335758Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:43:48.9946486Z [33s] Timeout after 30s compiling Config(block_sizes=[64, 256, 2], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T09:44:00.1565328Z [44s] Timeout after 30s compiling Config(block_sizes=[8, 4096, 4], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[2, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:44:00.1587259Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s 2026-02-21T09:44:04.4013718Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:44:04.4017443Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T09:44:04.4019243Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:44:04.4019624Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:44:04.4019942Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:44:04.4020461Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:44:04.4020717Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:44:04.4020897Z #smem = #ttg.shared_memory 2026-02-21T09:44:04.4021129Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:44:04.4024750Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:44:04.4025149Z %cst = arith.constant dense<8192> : tensor<1x512xi64, #mma> 2026-02-21T09:44:04.4025324Z %cst_0 = arith.constant dense<0> : tensor<1x512xi64, #mma> 2026-02-21T09:44:04.4025580Z %cst_1 = arith.constant dense<4096> : tensor<32x1xi64, #mma> 2026-02-21T09:44:04.4025743Z %cst_2 = arith.constant dense<0> : tensor<32x1xi64, #mma> 2026-02-21T09:44:04.4025904Z %cst_3 = arith.constant dense<8192> : tensor<32x1xi64, #mma> 2026-02-21T09:44:04.4026073Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:04.4026246Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:04.4026424Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<32x512xf32, #mma> 2026-02-21T09:44:04.4026610Z %cst_7 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1> 2026-02-21T09:44:04.4026759Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:44:04.4026879Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:44:04.4027018Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:44:04.4027134Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:44:04.4027249Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:44:04.4027366Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:44:04.4027473Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:44:04.4027590Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:44:04.4027737Z %cst_8 = arith.constant dense<0> : tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4027888Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:44:04.4028000Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:44:04.4028110Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:44:04.4028288Z %cst_9 = arith.constant dense<4> : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4028477Z %0 = tt.get_program_id x : i32 2026-02-21T09:44:04.4028591Z %1 = arith.divsi %0, %c1024_i32 : i32 2026-02-21T09:44:04.4028705Z %2 = arith.muli %1, %c8_i32 : i32 2026-02-21T09:44:04.4028817Z %3 = arith.subi %c16_i32, %2 : i32 2026-02-21T09:44:04.4029011Z %4 = arith.minsi %3, %c8_i32 : i32 2026-02-21T09:44:04.4029123Z %5 = arith.remsi %0, %c1024_i32 : i32 2026-02-21T09:44:04.4029239Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:44:04.4029348Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:44:04.4029452Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:44:04.4029585Z %9 = arith.muli %7, %c512_i32 : i32 2026-02-21T09:44:04.4029790Z %10 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:44:04.4030070Z %11 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:04.4030320Z %12 = tt.splat %9 : i32 -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:44:04.4030541Z %13 = arith.addi %12, %10 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:44:04.4030714Z %14 = arith.muli %8, %c32_i32 : i32 2026-02-21T09:44:04.4030912Z %15 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:04.4031179Z %16 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:04.4031435Z %17 = tt.splat %14 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:04.4031664Z %18 = arith.addi %17, %15 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:04.4031905Z %19 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:04.4032212Z %20 = tt.expand_dims %18 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:44:04.4032461Z %21 = arith.muli %20, %cst_7 : tensor<32x1xi32, #blocked1> 2026-02-21T09:44:04.4032714Z %22 = tt.broadcast %21 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:04.4032931Z %23 = tt.splat %arg0 : !tt.ptr -> tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:04.4033217Z %24 = tt.expand_dims %13 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x512xi32, #blocked2> 2026-02-21T09:44:04.4033482Z %25 = tt.splat %arg1 : !tt.ptr -> tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T09:44:04.4033775Z %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:44:04.4034184Z %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:44:04.4034581Z %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:04.4034832Z %29 = arith.cmpi eq, %28, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:04.4035026Z %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked> 2026-02-21T09:44:04.4035224Z %31 = arith.cmpi eq, %28, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:04.4035411Z %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked> 2026-02-21T09:44:04.4035675Z %33 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst_6) -> (tensor<32x512xf32, #mma>) : i32 { 2026-02-21T09:44:04.4035889Z %60 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:44:04.4036061Z %61 = tt.splat %60 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:04.4036275Z %62 = arith.addi %61, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:04.4036545Z %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:04.4036832Z %64 = tt.broadcast %63 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:04.4037019Z %65 = arith.addi %22, %64 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:04.4037216Z %66 = tt.addptr %23, %65 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:04.4037420Z %67 = tt.load %66 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:04.4037680Z %68 = ttg.convert_layout %67 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:04.4038075Z %69 = arith.extf %68 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:04.4038355Z %70 = arith.muli %arg3, %c8192_i32 : i32 2026-02-21T09:44:04.4038497Z %71 = tt.splat %70 : i32 -> tensor<1x512xi32, #blocked2> 2026-02-21T09:44:04.4038651Z %72 = arith.addi %71, %24 : tensor<1x512xi32, #blocked2> 2026-02-21T09:44:04.4038841Z %73 = tt.addptr %25, %72 : tensor<1x512x!tt.ptr, #blocked2>, tensor<1x512xi32, #blocked2> 2026-02-21T09:44:04.4039040Z %74 = tt.load %73 : tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T09:44:04.4039276Z %75 = ttg.convert_layout %74 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4039568Z %76 = arith.shli %75, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4039798Z %77 = arith.shrsi %76, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4040025Z %78 = arith.shrsi %75, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4040308Z %79 = tt.expand_dims %77 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:44:04.4040666Z %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:44:04.4040948Z %81 = tt.broadcast %79 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4041206Z %82 = arith.select %30, %81, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4041439Z %83 = tt.broadcast %80 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4041669Z %84 = arith.select %32, %83, %82 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4041893Z %85 = tt.reshape %84 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3> 2026-02-21T09:44:04.4042114Z %86 = arith.sitofp %85 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3> 2026-02-21T09:44:04.4042363Z %87 = ttg.local_alloc %86 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem> 2026-02-21T09:44:04.4042754Z %88 = ttg.local_load %87 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:04.4043224Z %89 = tt.dot %69, %88, %arg4, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma> 2026-02-21T09:44:04.4043567Z %90 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:44:04.4043690Z %91 = arith.muli %90, %c2_i32 : i32 2026-02-21T09:44:04.4043857Z %92 = tt.splat %91 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:04.4044072Z %93 = arith.addi %92, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:04.4044342Z %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:04.4044632Z %95 = tt.broadcast %94 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:04.4044820Z %96 = arith.addi %22, %95 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:04.4045016Z %97 = tt.addptr %23, %96 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:04.4045213Z %98 = tt.load %97 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:04.4045470Z %99 = ttg.convert_layout %98 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:04.4045867Z %100 = arith.extf %99 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:04.4046153Z %101 = arith.muli %90, %c8192_i32 : i32 2026-02-21T09:44:04.4046296Z %102 = tt.splat %101 : i32 -> tensor<1x512xi32, #blocked2> 2026-02-21T09:44:04.4046453Z %103 = arith.addi %102, %24 : tensor<1x512xi32, #blocked2> 2026-02-21T09:44:04.4046656Z %104 = tt.addptr %25, %103 : tensor<1x512x!tt.ptr, #blocked2>, tensor<1x512xi32, #blocked2> 2026-02-21T09:44:04.4046856Z %105 = tt.load %104 : tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T09:44:04.4047103Z %106 = ttg.convert_layout %105 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4047407Z %107 = arith.shli %106, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4047641Z %108 = arith.shrsi %107, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4047878Z %109 = arith.shrsi %106, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4048165Z %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:44:04.4048527Z %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:44:04.4048817Z %112 = tt.broadcast %110 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4049088Z %113 = arith.select %30, %112, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4049331Z %114 = tt.broadcast %111 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4049565Z %115 = arith.select %32, %114, %113 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4049797Z %116 = tt.reshape %115 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3> 2026-02-21T09:44:04.4050022Z %117 = arith.sitofp %116 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3> 2026-02-21T09:44:04.4050277Z %118 = ttg.local_alloc %117 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem> 2026-02-21T09:44:04.4050604Z %119 = ttg.local_load %118 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:04.4051073Z %120 = tt.dot %100, %119, %89, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma> 2026-02-21T09:44:04.4051422Z %121 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:44:04.4051544Z %122 = arith.muli %121, %c2_i32 : i32 2026-02-21T09:44:04.4051711Z %123 = tt.splat %122 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:04.4051935Z %124 = arith.addi %123, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:04.4052211Z %125 = tt.expand_dims %124 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:04.4052490Z %126 = tt.broadcast %125 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:04.4052705Z %127 = arith.addi %22, %126 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:04.4052904Z %128 = tt.addptr %23, %127 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:04.4053110Z %129 = tt.load %128 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:04.4053370Z %130 = ttg.convert_layout %129 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:04.4053766Z %131 = arith.extf %130 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:04.4054049Z %132 = arith.muli %121, %c8192_i32 : i32 2026-02-21T09:44:04.4054189Z %133 = tt.splat %132 : i32 -> tensor<1x512xi32, #blocked2> 2026-02-21T09:44:04.4054346Z %134 = arith.addi %133, %24 : tensor<1x512xi32, #blocked2> 2026-02-21T09:44:04.4054543Z %135 = tt.addptr %25, %134 : tensor<1x512x!tt.ptr, #blocked2>, tensor<1x512xi32, #blocked2> 2026-02-21T09:44:04.4054747Z %136 = tt.load %135 : tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T09:44:04.4054992Z %137 = ttg.convert_layout %136 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4055290Z %138 = arith.shli %137, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4055527Z %139 = arith.shrsi %138, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4055761Z %140 = arith.shrsi %137, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4056051Z %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:44:04.4056407Z %142 = tt.expand_dims %140 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:44:04.4056695Z %143 = tt.broadcast %141 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4056964Z %144 = arith.select %30, %143, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4057201Z %145 = tt.broadcast %142 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4057435Z %146 = arith.select %32, %145, %144 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4057667Z %147 = tt.reshape %146 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3> 2026-02-21T09:44:04.4057894Z %148 = arith.sitofp %147 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3> 2026-02-21T09:44:04.4058155Z %149 = ttg.local_alloc %148 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem> 2026-02-21T09:44:04.4058485Z %150 = ttg.local_load %149 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:04.4058967Z %151 = tt.dot %131, %150, %120, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma> 2026-02-21T09:44:04.4059318Z %152 = arith.addi %arg3, %c3_i32 : i32 2026-02-21T09:44:04.4059443Z %153 = arith.muli %152, %c2_i32 : i32 2026-02-21T09:44:04.4059610Z %154 = tt.splat %153 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:04.4059838Z %155 = arith.addi %154, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:04.4060114Z %156 = tt.expand_dims %155 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:04.4060392Z %157 = tt.broadcast %156 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:04.4060599Z %158 = arith.addi %22, %157 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:04.4060797Z %159 = tt.addptr %23, %158 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:04.4061002Z %160 = tt.load %159 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:04.4061262Z %161 = ttg.convert_layout %160 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:04.4061659Z %162 = arith.extf %161 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:04.4061939Z %163 = arith.muli %152, %c8192_i32 : i32 2026-02-21T09:44:04.4062078Z %164 = tt.splat %163 : i32 -> tensor<1x512xi32, #blocked2> 2026-02-21T09:44:04.4062234Z %165 = arith.addi %164, %24 : tensor<1x512xi32, #blocked2> 2026-02-21T09:44:04.4062430Z %166 = tt.addptr %25, %165 : tensor<1x512x!tt.ptr, #blocked2>, tensor<1x512xi32, #blocked2> 2026-02-21T09:44:04.4062634Z %167 = tt.load %166 : tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T09:44:04.4062876Z %168 = ttg.convert_layout %167 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4063172Z %169 = arith.shli %168, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4063408Z %170 = arith.shrsi %169, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4063643Z %171 = arith.shrsi %168, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:04.4063931Z %172 = tt.expand_dims %170 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:44:04.4064288Z %173 = tt.expand_dims %171 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:44:04.4064573Z %174 = tt.broadcast %172 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4064833Z %175 = arith.select %30, %174, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4065089Z %176 = tt.broadcast %173 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4065327Z %177 = arith.select %32, %176, %175 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:44:04.4065560Z %178 = tt.reshape %177 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3> 2026-02-21T09:44:04.4065785Z %179 = arith.sitofp %178 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3> 2026-02-21T09:44:04.4066043Z %180 = ttg.local_alloc %179 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem> 2026-02-21T09:44:04.4066371Z %181 = ttg.local_load %180 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:04.4066840Z %182 = tt.dot %162, %181, %151, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma> 2026-02-21T09:44:04.4067193Z scf.yield %182 : tensor<32x512xf32, #mma> 2026-02-21T09:44:04.4067319Z } {tt.num_stages = 1 : i32} 2026-02-21T09:44:04.4067473Z %34 = arith.truncf %33 : tensor<32x512xf32, #mma> to tensor<32x512xbf16, #mma> 2026-02-21T09:44:04.4067637Z %35 = arith.extsi %14 : i32 to i64 2026-02-21T09:44:04.4067756Z %36 = arith.extsi %9 : i32 to i64 2026-02-21T09:44:04.4067907Z %37 = tt.splat %arg2 : !tt.ptr -> tensor<32x512x!tt.ptr, #mma> 2026-02-21T09:44:04.4068109Z %38 = tt.splat %35 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:04.4068400Z %39 = arith.extsi %16 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:04.4068667Z %40 = arith.addi %38, %39 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:04.4068925Z %41 = tt.expand_dims %40 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T09:44:04.4069157Z %42 = arith.muli %41, %cst_3 : tensor<32x1xi64, #mma> 2026-02-21T09:44:04.4069330Z %43 = tt.broadcast %42 : tensor<32x1xi64, #mma> -> tensor<32x512xi64, #mma> 2026-02-21T09:44:04.4069533Z %44 = tt.splat %36 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:04.4069806Z %45 = arith.extsi %11 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:04.4070075Z %46 = arith.addi %44, %45 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:04.4070334Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:44:04.4070589Z %48 = tt.broadcast %47 : tensor<1x512xi64, #mma> -> tensor<32x512xi64, #mma> 2026-02-21T09:44:04.4070767Z %49 = arith.addi %43, %48 : tensor<32x512xi64, #mma> 2026-02-21T09:44:04.4070947Z %50 = tt.addptr %37, %49 : tensor<32x512x!tt.ptr, #mma>, tensor<32x512xi64, #mma> 2026-02-21T09:44:04.4092106Z %51 = arith.cmpi sge, %41, %cst_2 : tensor<32x1xi64, #mma> 2026-02-21T09:44:04.4092265Z %52 = arith.cmpi slt, %41, %cst_1 : tensor<32x1xi64, #mma> 2026-02-21T09:44:04.4092420Z %53 = arith.andi %51, %52 : tensor<32x1xi1, #mma> 2026-02-21T09:44:04.4092587Z %54 = tt.broadcast %53 : tensor<32x1xi1, #mma> -> tensor<32x512xi1, #mma> 2026-02-21T09:44:04.4092766Z %55 = arith.cmpi sge, %47, %cst_0 : tensor<1x512xi64, #mma> 2026-02-21T09:44:04.4092928Z %56 = arith.cmpi slt, %47, %cst : tensor<1x512xi64, #mma> 2026-02-21T09:44:04.4093107Z %57 = arith.andi %55, %56 : tensor<1x512xi1, #mma> 2026-02-21T09:44:04.4093277Z %58 = tt.broadcast %57 : tensor<1x512xi1, #mma> -> tensor<32x512xi1, #mma> 2026-02-21T09:44:04.4093447Z %59 = arith.andi %54, %58 : tensor<32x512xi1, #mma> 2026-02-21T09:44:04.4093621Z tt.store %50, %34, %59 : tensor<32x512x!tt.ptr, #mma> 2026-02-21T09:44:04.4093757Z tt.return 2026-02-21T09:44:04.4093837Z } 2026-02-21T09:44:04.4093914Z } 2026-02-21T09:44:04.4093958Z 2026-02-21T09:44:04.4093989Z {-# 2026-02-21T09:44:04.4094071Z external_resources: { 2026-02-21T09:44:04.4094169Z mlir_reproducer: { 2026-02-21T09:44:04.4095182Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:44:04.4096179Z disable_threading: false, 2026-02-21T09:44:04.4096285Z verify_each: true 2026-02-21T09:44:04.4096373Z } 2026-02-21T09:44:04.4096447Z } 2026-02-21T09:44:04.4096515Z #-} 2026-02-21T09:44:04.4096797Z /tmp/torchinductor_root/ar/cardaviyeotv76douoeiukvte6fo47v6lv74hfry5ejrxtjxtc3t.py:12:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:44:04.4097479Z /tmp/torchinductor_root/ar/cardaviyeotv76douoeiukvte6fo47v6lv74hfry5ejrxtjxtc3t.py:12:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:44:04.4098027Z [48s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:44:04.4098782Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 512], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:44:04.4099439Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:44:04.4099607Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:44:05.0240671Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:44:05.0248499Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:44:05.0248962Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:44:05.0249396Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:44:05.0249939Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [8, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:44:05.0250380Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:44:05.0251032Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:44:05.0251571Z %cst = arith.constant dense<0.000000e+00> : tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0251843Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:44:05.0252004Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:44:05.0252163Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:44:05.0252418Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T09:44:05.0252581Z %c54_i32 = arith.constant 54 : i32 2026-02-21T09:44:05.0252737Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:44:05.0252884Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:44:05.0253037Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:44:05.0253228Z %cst_0 = arith.constant dense<0> : tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0253424Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:44:05.0253567Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:44:05.0253722Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:44:05.0253882Z %c255_i32 = arith.constant 255 : i32 2026-02-21T09:44:05.0254123Z %cst_1 = arith.constant dense<0> : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0254419Z %cst_2 = arith.constant dense<0> : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0254615Z %c-1_i32 = arith.constant -1 : i32 2026-02-21T09:44:05.0254815Z %cst_3 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1> 2026-02-21T09:44:05.0255095Z %cst_4 = arith.constant dense<4> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0255379Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:05.0255604Z %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:05.0255827Z %cst_7 = arith.constant dense<8192> : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0256049Z %cst_8 = arith.constant dense<0> : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0256266Z %cst_9 = arith.constant dense<4096> : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0256484Z %cst_10 = arith.constant dense<0> : tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0256703Z %cst_11 = arith.constant dense<8192> : tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0256989Z %0 = tt.get_program_id x : i32 2026-02-21T09:44:05.0257141Z %1 = arith.muli %0, %c54_i32 : i32 2026-02-21T09:44:05.0257293Z %2 = arith.addi %1, %c54_i32 : i32 2026-02-21T09:44:05.0257449Z %3 = arith.minsi %2, %c32768_i32 : i32 2026-02-21T09:44:05.0257710Z %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.0258072Z %5 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.0258475Z %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.0258870Z %7 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.0259217Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0259564Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:05.0259872Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0260285Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:44:05.0260848Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:44:05.0261374Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:05.0261711Z %14 = arith.cmpi eq, %13, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:05.0271543Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked> 2026-02-21T09:44:05.0271762Z %16 = arith.cmpi eq, %13, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:05.0271955Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked> 2026-02-21T09:44:05.0272193Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<32x32x!tt.ptr, #mma> 2026-02-21T09:44:05.0272469Z %19 = arith.extsi %5 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.0272799Z %20 = arith.extsi %7 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.0273029Z %21 = arith.subi %3, %1 : i32 2026-02-21T09:44:05.0273144Z %22 = arith.remsi %21, %c4_i32 : i32 2026-02-21T09:44:05.0273269Z %23 = arith.subi %21, %22 : i32 2026-02-21T09:44:05.0273386Z %24 = arith.addi %1, %23 : i32 2026-02-21T09:44:05.0273515Z scf.for %arg3 = %1 to %24 step %c4_i32 : i32 { 2026-02-21T09:44:05.0273661Z %29 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:44:05.0273785Z %30 = arith.muli %29, %c2_i32 : i32 2026-02-21T09:44:05.0273910Z %31 = arith.subi %c256_i32, %30 : i32 2026-02-21T09:44:05.0274032Z %32 = arith.minsi %31, %c2_i32 : i32 2026-02-21T09:44:05.0274158Z %33 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:44:05.0274278Z %34 = arith.remsi %33, %32 : i32 2026-02-21T09:44:05.0274397Z %35 = arith.addi %30, %34 : i32 2026-02-21T09:44:05.0274512Z %36 = arith.divsi %33, %32 : i32 2026-02-21T09:44:05.0274626Z %37 = arith.muli %35, %c32_i32 : i32 2026-02-21T09:44:05.0274838Z %38 = tt.splat %37 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.0275133Z %39 = arith.addi %38, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.0275347Z %40 = arith.muli %36, %c32_i32 : i32 2026-02-21T09:44:05.0275538Z %41 = tt.splat %40 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.0275759Z %42 = arith.addi %41, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.0276042Z %43 = tt.expand_dims %42 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:44:05.0276293Z %44 = arith.muli %43, %cst_3 : tensor<32x1xi32, #blocked1> 2026-02-21T09:44:05.0276489Z %45 = tt.broadcast %44 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0276841Z %46 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0277232Z %47 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T09:44:05.0277452Z %200 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:44:05.0277631Z %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0277863Z %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0278146Z %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:05.0278464Z %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0278659Z %205 = arith.addi %45, %204 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0278857Z %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0279067Z %207 = tt.load %206 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:05.0279354Z %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0279754Z %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0280062Z %210 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:44:05.0280240Z %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0280469Z %212 = arith.addi %211, %46 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0280782Z %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0281091Z %214 = tt.load %213 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0281326Z %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0281562Z %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0281802Z %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0282094Z %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0282427Z %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0282789Z %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0283026Z %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0283259Z %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0283517Z %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0283743Z %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:44:05.0283967Z %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:44:05.0284264Z %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0284724Z %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0285075Z %228 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:44:05.0285198Z %229 = arith.muli %228, %c2_i32 : i32 2026-02-21T09:44:05.0285416Z %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0285639Z %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0285913Z %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:05.0286206Z %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0286398Z %234 = arith.addi %45, %233 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0286599Z %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0286802Z %236 = tt.load %235 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:05.0287066Z %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0287484Z %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0287768Z %239 = arith.muli %228, %c8192_i32 : i32 2026-02-21T09:44:05.0287979Z %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0288207Z %241 = arith.addi %240, %46 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0288519Z %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0288823Z %243 = tt.load %242 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0289055Z %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0289292Z %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0289528Z %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0289815Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0290145Z %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0290427Z %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0290664Z %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0290895Z %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0291128Z %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0291371Z %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:44:05.0291594Z %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:44:05.0291894Z %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0292351Z %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0292696Z scf.yield %256 : tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0292819Z } {tt.num_stages = 1 : i32} 2026-02-21T09:44:05.0292975Z %48 = arith.truncf %47 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma> 2026-02-21T09:44:05.0293139Z %49 = arith.extsi %40 : i32 to i64 2026-02-21T09:44:05.0293255Z %50 = arith.extsi %37 : i32 to i64 2026-02-21T09:44:05.0293414Z %51 = tt.splat %49 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.0293616Z %52 = arith.addi %51, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.0293872Z %53 = tt.expand_dims %52 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0294123Z %54 = arith.muli %53, %cst_7 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0294297Z %55 = tt.broadcast %54 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0294497Z %56 = tt.splat %50 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.0294699Z %57 = arith.addi %56, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.0294949Z %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0295210Z %59 = tt.broadcast %58 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0295383Z %60 = arith.addi %55, %59 : tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0295580Z %61 = tt.addptr %18, %60 : tensor<32x32x!tt.ptr, #mma>, tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0295769Z %62 = arith.cmpi sge, %53, %cst_8 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0295931Z %63 = arith.cmpi slt, %53, %cst_9 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0296082Z %64 = arith.andi %62, %63 : tensor<32x1xi1, #mma> 2026-02-21T09:44:05.0296244Z %65 = tt.broadcast %64 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0296426Z %66 = arith.cmpi sge, %58, %cst_10 : tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0296582Z %67 = arith.cmpi slt, %58, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0296734Z %68 = arith.andi %66, %67 : tensor<1x32xi1, #mma> 2026-02-21T09:44:05.0296894Z %69 = tt.broadcast %68 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0297068Z %70 = arith.andi %65, %69 : tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0297222Z tt.store %61, %48, %70 : tensor<32x32x!tt.ptr, #mma> 2026-02-21T09:44:05.0297367Z %71 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:44:05.0297489Z %72 = arith.divsi %71, %c256_i32 : i32 2026-02-21T09:44:05.0297606Z %73 = arith.muli %72, %c2_i32 : i32 2026-02-21T09:44:05.0297724Z %74 = arith.subi %c256_i32, %73 : i32 2026-02-21T09:44:05.0297839Z %75 = arith.minsi %74, %c2_i32 : i32 2026-02-21T09:44:05.0297957Z %76 = arith.remsi %71, %c256_i32 : i32 2026-02-21T09:44:05.0298071Z %77 = arith.remsi %76, %75 : i32 2026-02-21T09:44:05.0298189Z %78 = arith.addi %73, %77 : i32 2026-02-21T09:44:05.0298298Z %79 = arith.divsi %76, %75 : i32 2026-02-21T09:44:05.0298411Z %80 = arith.muli %78, %c32_i32 : i32 2026-02-21T09:44:05.0298617Z %81 = tt.splat %80 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.0298936Z %82 = arith.addi %81, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.0299145Z %83 = arith.muli %79, %c32_i32 : i32 2026-02-21T09:44:05.0299308Z %84 = tt.splat %83 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.0299526Z %85 = arith.addi %84, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.0299800Z %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:44:05.0300046Z %87 = arith.muli %86, %cst_3 : tensor<32x1xi32, #blocked1> 2026-02-21T09:44:05.0300241Z %88 = tt.broadcast %87 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0300589Z %89 = tt.expand_dims %82 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0300975Z %90 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T09:44:05.0301191Z %200 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:44:05.0301382Z %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0301607Z %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0301880Z %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:05.0302160Z %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0302356Z %205 = arith.addi %88, %204 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0302568Z %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0302774Z %207 = tt.load %206 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:05.0303056Z %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0303462Z %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0303746Z %210 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:44:05.0303922Z %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0304148Z %212 = arith.addi %211, %89 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0304457Z %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0304765Z %214 = tt.load %213 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0305003Z %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0305236Z %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0305476Z %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0305760Z %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0306093Z %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0306377Z %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0306632Z %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0306870Z %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0307104Z %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0307335Z %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:44:05.0307557Z %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:44:05.0307849Z %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0308319Z %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0308667Z %228 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:44:05.0308791Z %229 = arith.muli %228, %c2_i32 : i32 2026-02-21T09:44:05.0308964Z %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0309204Z %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0309482Z %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:05.0309760Z %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0309951Z %234 = arith.addi %88, %233 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0310153Z %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0310377Z %236 = tt.load %235 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:05.0310645Z %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0311064Z %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0311343Z %239 = arith.muli %228, %c8192_i32 : i32 2026-02-21T09:44:05.0311519Z %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0311750Z %241 = arith.addi %240, %89 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0312057Z %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0312365Z %243 = tt.load %242 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0312595Z %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0312831Z %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0313068Z %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0313360Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0313694Z %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0313972Z %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0314208Z %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0314458Z %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0314688Z %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0314917Z %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:44:05.0315136Z %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:44:05.0315427Z %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0315884Z %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0316231Z scf.yield %256 : tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0316360Z } {tt.num_stages = 1 : i32} 2026-02-21T09:44:05.0316511Z %91 = arith.truncf %90 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma> 2026-02-21T09:44:05.0316675Z %92 = arith.extsi %83 : i32 to i64 2026-02-21T09:44:05.0316789Z %93 = arith.extsi %80 : i32 to i64 2026-02-21T09:44:05.0316948Z %94 = tt.splat %92 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.0317168Z %95 = arith.addi %94, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.0317420Z %96 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0317652Z %97 = arith.muli %96, %cst_7 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0317823Z %98 = tt.broadcast %97 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0318022Z %99 = tt.splat %93 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.0318249Z %100 = arith.addi %99, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.0318509Z %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0318786Z %102 = tt.broadcast %101 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0318963Z %103 = arith.addi %98, %102 : tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0319149Z %104 = tt.addptr %18, %103 : tensor<32x32x!tt.ptr, #mma>, tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0319342Z %105 = arith.cmpi sge, %96, %cst_8 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0319506Z %106 = arith.cmpi slt, %96, %cst_9 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0319663Z %107 = arith.andi %105, %106 : tensor<32x1xi1, #mma> 2026-02-21T09:44:05.0319829Z %108 = tt.broadcast %107 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0320015Z %109 = arith.cmpi sge, %101, %cst_10 : tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0320179Z %110 = arith.cmpi slt, %101, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0320336Z %111 = arith.andi %109, %110 : tensor<1x32xi1, #mma> 2026-02-21T09:44:05.0320504Z %112 = tt.broadcast %111 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0320679Z %113 = arith.andi %108, %112 : tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0320836Z tt.store %104, %91, %113 : tensor<32x32x!tt.ptr, #mma> 2026-02-21T09:44:05.0320980Z %114 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:44:05.0321103Z %115 = arith.divsi %114, %c256_i32 : i32 2026-02-21T09:44:05.0321223Z %116 = arith.muli %115, %c2_i32 : i32 2026-02-21T09:44:05.0321343Z %117 = arith.subi %c256_i32, %116 : i32 2026-02-21T09:44:05.0321462Z %118 = arith.minsi %117, %c2_i32 : i32 2026-02-21T09:44:05.0321585Z %119 = arith.remsi %114, %c256_i32 : i32 2026-02-21T09:44:05.0321706Z %120 = arith.remsi %119, %118 : i32 2026-02-21T09:44:05.0321838Z %121 = arith.addi %116, %120 : i32 2026-02-21T09:44:05.0321954Z %122 = arith.divsi %119, %118 : i32 2026-02-21T09:44:05.0322070Z %123 = arith.muli %121, %c32_i32 : i32 2026-02-21T09:44:05.0322282Z %124 = tt.splat %123 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.0322865Z %125 = arith.addi %124, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.0323081Z %126 = arith.muli %122, %c32_i32 : i32 2026-02-21T09:44:05.0323250Z %127 = tt.splat %126 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.0323474Z %128 = arith.addi %127, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.0323793Z %129 = tt.expand_dims %128 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:44:05.0332754Z %130 = arith.muli %129, %cst_3 : tensor<32x1xi32, #blocked1> 2026-02-21T09:44:05.0332960Z %131 = tt.broadcast %130 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0333320Z %132 = tt.expand_dims %125 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0333765Z %133 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T09:44:05.0333980Z %200 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:44:05.0334155Z %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0334380Z %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0334677Z %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:05.0334956Z %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0335172Z %205 = arith.addi %131, %204 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0335371Z %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0335578Z %207 = tt.load %206 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:05.0335847Z %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0336250Z %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0336535Z %210 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:44:05.0336717Z %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0336947Z %212 = arith.addi %211, %132 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0337260Z %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0337576Z %214 = tt.load %213 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0337806Z %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0338042Z %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0338282Z %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0338573Z %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0338928Z %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0339213Z %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0339451Z %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0339690Z %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0339920Z %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0340150Z %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:44:05.0340369Z %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:44:05.0340667Z %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0341135Z %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0341497Z %228 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:44:05.0341651Z %229 = arith.muli %228, %c2_i32 : i32 2026-02-21T09:44:05.0341822Z %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0342046Z %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0342325Z %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:05.0342612Z %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0342807Z %234 = arith.addi %131, %233 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0343024Z %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0343227Z %236 = tt.load %235 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:05.0343494Z %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0343891Z %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0344171Z %239 = arith.muli %228, %c8192_i32 : i32 2026-02-21T09:44:05.0344348Z %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0344575Z %241 = arith.addi %240, %132 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0344885Z %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0345193Z %243 = tt.load %242 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0345427Z %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0345661Z %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0345894Z %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0346181Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0346531Z %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0346812Z %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0347051Z %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0347283Z %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0347512Z %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0347738Z %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:44:05.0347959Z %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:44:05.0348252Z %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0348713Z %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0349059Z scf.yield %256 : tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0349204Z } {tt.num_stages = 1 : i32} 2026-02-21T09:44:05.0349361Z %134 = arith.truncf %133 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma> 2026-02-21T09:44:05.0349535Z %135 = arith.extsi %126 : i32 to i64 2026-02-21T09:44:05.0349652Z %136 = arith.extsi %123 : i32 to i64 2026-02-21T09:44:05.0349813Z %137 = tt.splat %135 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.0350022Z %138 = arith.addi %137, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.0350300Z %139 = tt.expand_dims %138 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0350541Z %140 = arith.muli %139, %cst_7 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0350716Z %141 = tt.broadcast %140 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0350937Z %142 = tt.splat %136 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.0351142Z %143 = arith.addi %142, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.0351403Z %144 = tt.expand_dims %143 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0351661Z %145 = tt.broadcast %144 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0351839Z %146 = arith.addi %141, %145 : tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0352027Z %147 = tt.addptr %18, %146 : tensor<32x32x!tt.ptr, #mma>, tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0352221Z %148 = arith.cmpi sge, %139, %cst_8 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0352387Z %149 = arith.cmpi slt, %139, %cst_9 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0352541Z %150 = arith.andi %148, %149 : tensor<32x1xi1, #mma> 2026-02-21T09:44:05.0352714Z %151 = tt.broadcast %150 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0352902Z %152 = arith.cmpi sge, %144, %cst_10 : tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0353067Z %153 = arith.cmpi slt, %144, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0353225Z %154 = arith.andi %152, %153 : tensor<1x32xi1, #mma> 2026-02-21T09:44:05.0353390Z %155 = tt.broadcast %154 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0353566Z %156 = arith.andi %151, %155 : tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0353720Z tt.store %147, %134, %156 : tensor<32x32x!tt.ptr, #mma> 2026-02-21T09:44:05.0353868Z %157 = arith.addi %arg3, %c3_i32 : i32 2026-02-21T09:44:05.0353992Z %158 = arith.divsi %157, %c256_i32 : i32 2026-02-21T09:44:05.0354132Z %159 = arith.muli %158, %c2_i32 : i32 2026-02-21T09:44:05.0354254Z %160 = arith.subi %c256_i32, %159 : i32 2026-02-21T09:44:05.0354373Z %161 = arith.minsi %160, %c2_i32 : i32 2026-02-21T09:44:05.0354493Z %162 = arith.remsi %157, %c256_i32 : i32 2026-02-21T09:44:05.0354609Z %163 = arith.remsi %162, %161 : i32 2026-02-21T09:44:05.0354724Z %164 = arith.addi %159, %163 : i32 2026-02-21T09:44:05.0354836Z %165 = arith.divsi %162, %161 : i32 2026-02-21T09:44:05.0354951Z %166 = arith.muli %164, %c32_i32 : i32 2026-02-21T09:44:05.0355160Z %167 = tt.splat %166 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.0355459Z %168 = arith.addi %167, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.0355670Z %169 = arith.muli %165, %c32_i32 : i32 2026-02-21T09:44:05.0355840Z %170 = tt.splat %169 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.0356062Z %171 = arith.addi %170, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.0356342Z %172 = tt.expand_dims %171 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:44:05.0356606Z %173 = arith.muli %172, %cst_3 : tensor<32x1xi32, #blocked1> 2026-02-21T09:44:05.0356801Z %174 = tt.broadcast %173 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0357150Z %175 = tt.expand_dims %168 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0357538Z %176 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T09:44:05.0357766Z %200 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:44:05.0357938Z %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0358182Z %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0358455Z %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:05.0358730Z %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0358924Z %205 = arith.addi %174, %204 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0359120Z %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0359324Z %207 = tt.load %206 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:05.0359587Z %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0359988Z %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0360274Z %210 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:44:05.0360453Z %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0360681Z %212 = arith.addi %211, %175 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0360991Z %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0361299Z %214 = tt.load %213 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0361531Z %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0361785Z %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0362020Z %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0362310Z %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0362733Z %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0363018Z %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0363253Z %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0363489Z %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0363724Z %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0363951Z %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:44:05.0364177Z %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:44:05.0364486Z %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0364945Z %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0365291Z %228 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:44:05.0365412Z %229 = arith.muli %228, %c2_i32 : i32 2026-02-21T09:44:05.0365599Z %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0365819Z %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0366095Z %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:05.0366395Z %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0366586Z %234 = arith.addi %174, %233 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0366786Z %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0366986Z %236 = tt.load %235 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:05.0367252Z %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0367648Z %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0367924Z %239 = arith.muli %228, %c8192_i32 : i32 2026-02-21T09:44:05.0368103Z %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0368330Z %241 = arith.addi %240, %175 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0368643Z %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0368952Z %243 = tt.load %242 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0369181Z %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0369417Z %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0369669Z %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0369956Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0370292Z %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0370570Z %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0370810Z %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0371042Z %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0371272Z %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0371502Z %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:44:05.0371717Z %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:44:05.0372012Z %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0372478Z %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0372818Z scf.yield %256 : tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0372943Z } {tt.num_stages = 1 : i32} 2026-02-21T09:44:05.0373096Z %177 = arith.truncf %176 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma> 2026-02-21T09:44:05.0373266Z %178 = arith.extsi %169 : i32 to i64 2026-02-21T09:44:05.0373393Z %179 = arith.extsi %166 : i32 to i64 2026-02-21T09:44:05.0373557Z %180 = tt.splat %178 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.0373768Z %181 = arith.addi %180, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.0374043Z %182 = tt.expand_dims %181 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0374277Z %183 = arith.muli %182, %cst_7 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0374451Z %184 = tt.broadcast %183 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0374650Z %185 = tt.splat %179 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.0374852Z %186 = arith.addi %185, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.0375106Z %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0375359Z %188 = tt.broadcast %187 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0375535Z %189 = arith.addi %184, %188 : tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0375716Z %190 = tt.addptr %18, %189 : tensor<32x32x!tt.ptr, #mma>, tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0375910Z %191 = arith.cmpi sge, %182, %cst_8 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0376069Z %192 = arith.cmpi slt, %182, %cst_9 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0376224Z %193 = arith.andi %191, %192 : tensor<32x1xi1, #mma> 2026-02-21T09:44:05.0376390Z %194 = tt.broadcast %193 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0376571Z %195 = arith.cmpi sge, %187, %cst_10 : tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0376733Z %196 = arith.cmpi slt, %187, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0376883Z %197 = arith.andi %195, %196 : tensor<1x32xi1, #mma> 2026-02-21T09:44:05.0377066Z %198 = tt.broadcast %197 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0377236Z %199 = arith.andi %194, %198 : tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0377392Z tt.store %190, %177, %199 : tensor<32x32x!tt.ptr, #mma> 2026-02-21T09:44:05.0377538Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:44:05.0377655Z %25 = arith.subi %3, %24 : i32 2026-02-21T09:44:05.0377765Z %26 = arith.muli %25, %c256_i32 : i32 2026-02-21T09:44:05.0377878Z %27 = arith.subi %24, %c1_i32 : i32 2026-02-21T09:44:05.0378349Z %28:8 = scf.for %arg3 = %c0_i32 to %26 step %c1_i32 iter_args(%arg4 = %c-1_i32, %arg5 = %27, %arg6 = %c0_i32, %arg7 = %cst, %arg8 = %c0_i32, %arg9 = %c0_i32, %arg10 = %cst_2, %arg11 = %cst_1) -> (i32, i32, i32, tensor<32x32xf32, #mma>, i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>) : i32 { 2026-02-21T09:44:05.0378810Z %29 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:44:05.0378934Z %30 = arith.cmpi eq, %arg4, %c255_i32 : i32 2026-02-21T09:44:05.0379065Z %31 = arith.select %30, %c0_i32, %29 : i32 2026-02-21T09:44:05.0379185Z %32 = arith.cmpi eq, %31, %c0_i32 : i32 2026-02-21T09:44:05.0379311Z %33 = arith.select %32, %c0_i32, %arg6 : i32 2026-02-21T09:44:05.0379538Z %34:5 = scf.if %32 -> (i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32) { 2026-02-21T09:44:05.0379783Z %95 = arith.addi %arg5, %c1_i32 : i32 2026-02-21T09:44:05.0379902Z %96 = arith.divsi %95, %c256_i32 : i32 2026-02-21T09:44:05.0380017Z %97 = arith.muli %96, %c2_i32 : i32 2026-02-21T09:44:05.0380131Z %98 = arith.subi %c256_i32, %97 : i32 2026-02-21T09:44:05.0380246Z %99 = arith.minsi %98, %c2_i32 : i32 2026-02-21T09:44:05.0380363Z %100 = arith.remsi %95, %c256_i32 : i32 2026-02-21T09:44:05.0380479Z %101 = arith.remsi %100, %99 : i32 2026-02-21T09:44:05.0380594Z %102 = arith.addi %97, %101 : i32 2026-02-21T09:44:05.0380721Z %103 = arith.divsi %100, %99 : i32 2026-02-21T09:44:05.0380838Z %104 = arith.muli %102, %c32_i32 : i32 2026-02-21T09:44:05.0381062Z %105 = tt.splat %104 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.0381357Z %106 = arith.addi %105, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.0381568Z %107 = arith.muli %103, %c32_i32 : i32 2026-02-21T09:44:05.0381736Z %108 = tt.splat %107 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.0381959Z %109 = arith.addi %108, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.0382242Z %110 = tt.expand_dims %109 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:44:05.0382496Z %111 = arith.muli %110, %cst_3 : tensor<32x1xi32, #blocked1> 2026-02-21T09:44:05.0382691Z %112 = tt.broadcast %111 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0383043Z %113 = tt.expand_dims %106 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0383468Z scf.yield %104, %107, %112, %113, %95 : i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32 2026-02-21T09:44:05.0383700Z } else { 2026-02-21T09:44:05.0383922Z scf.yield %arg8, %arg9, %arg10, %arg11, %arg5 : i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32 2026-02-21T09:44:05.0384172Z } 2026-02-21T09:44:05.0384251Z %35 = arith.muli %33, %c2_i32 : i32 2026-02-21T09:44:05.0384414Z %36 = tt.splat %35 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0384646Z %37 = arith.addi %36, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0384910Z %38 = tt.expand_dims %37 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:05.0385177Z %39 = tt.broadcast %38 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0385362Z %40 = arith.addi %34#2, %39 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0385553Z %41 = tt.addptr %9, %40 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0385747Z %42 = tt.load %41 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:05.0386002Z %43 = ttg.convert_layout %42 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0386393Z %44 = arith.extf %43 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0386665Z %45 = arith.muli %33, %c8192_i32 : i32 2026-02-21T09:44:05.0386830Z %46 = tt.splat %45 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0387052Z %47 = arith.addi %46, %34#3 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0387364Z %48 = tt.addptr %10, %47 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0387666Z %49 = tt.load %48 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0387886Z %50 = arith.shli %49, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0388109Z %51 = arith.shrsi %50, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0388355Z %52 = arith.shrsi %49, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0388629Z %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0388980Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0389249Z %55 = tt.broadcast %53 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0389475Z %56 = arith.select %15, %55, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0389702Z %57 = tt.broadcast %54 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0389921Z %58 = arith.select %17, %57, %56 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0390138Z %59 = tt.reshape %58 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:44:05.0390347Z %60 = arith.sitofp %59 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:44:05.0390628Z %61 = ttg.convert_layout %60 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0391080Z %62 = tt.dot %44, %61, %arg7, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0391412Z %63 = arith.addi %33, %c1_i32 : i32 2026-02-21T09:44:05.0391526Z %64 = arith.muli %63, %c2_i32 : i32 2026-02-21T09:44:05.0391683Z %65 = tt.splat %64 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0391893Z %66 = arith.addi %65, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.0392156Z %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:44:05.0392438Z %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0392625Z %69 = arith.addi %34#2, %68 : tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0392814Z %70 = tt.addptr %9, %69 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:44:05.0393007Z %71 = tt.load %70 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:44:05.0393258Z %72 = ttg.convert_layout %71 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0393642Z %73 = arith.extf %72 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0393911Z %74 = arith.muli %63, %c8192_i32 : i32 2026-02-21T09:44:05.0394074Z %75 = tt.splat %74 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0394295Z %76 = arith.addi %75, %34#3 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0394591Z %77 = tt.addptr %10, %76 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0394902Z %78 = tt.load %77 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0395123Z %79 = arith.shli %78, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0395345Z %80 = arith.shrsi %79, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0395569Z %81 = arith.shrsi %78, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0395843Z %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0396176Z %83 = tt.expand_dims %81 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:44:05.0396445Z %84 = tt.broadcast %82 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0396686Z %85 = arith.select %15, %84, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0396912Z %86 = tt.broadcast %83 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0397129Z %87 = arith.select %17, %86, %85 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:44:05.0397344Z %88 = tt.reshape %87 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:44:05.0397552Z %89 = arith.sitofp %88 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:44:05.0397830Z %90 = ttg.convert_layout %89 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.0398272Z %91 = tt.dot %73, %90, %62, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0398602Z %92 = arith.addi %33, %c2_i32 : i32 2026-02-21T09:44:05.0398724Z %93 = arith.cmpi eq, %31, %c255_i32 : i32 2026-02-21T09:44:05.0398867Z %94 = arith.select %93, %cst, %91 : tensor<32x32xf32, #mma> 2026-02-21T09:44:05.0398997Z scf.if %93 { 2026-02-21T09:44:05.0399128Z %95 = arith.truncf %91 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma> 2026-02-21T09:44:05.0399289Z %96 = arith.extsi %34#1 : i32 to i64 2026-02-21T09:44:05.0399406Z %97 = arith.extsi %34#0 : i32 to i64 2026-02-21T09:44:05.0399562Z %98 = tt.splat %96 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.0399762Z %99 = arith.addi %98, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.0400038Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0400273Z %101 = arith.muli %100, %cst_7 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0400447Z %102 = tt.broadcast %101 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0400648Z %103 = tt.splat %97 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.0400853Z %104 = arith.addi %103, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.0401112Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0401363Z %106 = tt.broadcast %105 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0401541Z %107 = arith.addi %102, %106 : tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0401727Z %108 = tt.addptr %18, %107 : tensor<32x32x!tt.ptr, #mma>, tensor<32x32xi64, #mma> 2026-02-21T09:44:05.0401924Z %109 = arith.cmpi sge, %100, %cst_8 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0402088Z %110 = arith.cmpi slt, %100, %cst_9 : tensor<32x1xi64, #mma> 2026-02-21T09:44:05.0402243Z %111 = arith.andi %109, %110 : tensor<32x1xi1, #mma> 2026-02-21T09:44:05.0402427Z %112 = tt.broadcast %111 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0402692Z %113 = arith.cmpi sge, %105, %cst_10 : tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0402854Z %114 = arith.cmpi slt, %105, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:44:05.0403007Z %115 = arith.andi %113, %114 : tensor<1x32xi1, #mma> 2026-02-21T09:44:05.0403171Z %116 = tt.broadcast %115 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0403344Z %117 = arith.andi %112, %116 : tensor<32x32xi1, #mma> 2026-02-21T09:44:05.0403517Z tt.store %108, %95, %117 : tensor<32x32x!tt.ptr, #mma> 2026-02-21T09:44:05.0403648Z } 2026-02-21T09:44:05.0403904Z scf.yield %31, %34#4, %92, %94, %34#0, %34#1, %34#2, %34#3 : i32, i32, i32, tensor<32x32xf32, #mma>, i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.0404199Z } 2026-02-21T09:44:05.0404273Z tt.return 2026-02-21T09:44:05.0404348Z } 2026-02-21T09:44:05.0404420Z } 2026-02-21T09:44:05.0404463Z 2026-02-21T09:44:05.0404492Z {-# 2026-02-21T09:44:05.0404573Z external_resources: { 2026-02-21T09:44:05.0404669Z mlir_reproducer: { 2026-02-21T09:44:05.0405660Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:44:05.0406641Z disable_threading: false, 2026-02-21T09:44:05.0406744Z verify_each: true 2026-02-21T09:44:05.0406829Z } 2026-02-21T09:44:05.0406897Z } 2026-02-21T09:44:05.0406962Z #-} 2026-02-21T09:44:05.0407234Z /tmp/torchinductor_root/mo/cmoi5lzvsrwofsyxf4bcpxp5qbx23c7jegvec4ziy5bi6i6bxgsl.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:44:05.0407908Z /tmp/torchinductor_root/mo/cmoi5lzvsrwofsyxf4bcpxp5qbx23c7jegvec4ziy5bi6i6bxgsl.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:44:05.0408468Z [49s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:44:05.0409238Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 32], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[4, 2], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:44:05.0409937Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:44:05.0410099Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:44:05.4711487Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:44:05.4720347Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:44:05.4720668Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:44:05.4720982Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:44:05.4721363Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:44:05.4721615Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:44:05.4721794Z #smem = #ttg.shared_memory 2026-02-21T09:44:05.4722018Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:44:05.4722512Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:44:05.4722998Z %cst = arith.constant dense<0.000000e+00> : tensor<128x16xf32, #mma> 2026-02-21T09:44:05.4723187Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:44:05.4723301Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:44:05.4723414Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:44:05.4723561Z %cst_0 = arith.constant dense<0> : tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4723714Z %c6_i32 = arith.constant 6 : i32 2026-02-21T09:44:05.4723824Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:44:05.4723930Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:44:05.4724037Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:44:05.4724160Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:44:05.4724343Z %cst_1 = arith.constant dense<0> : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4724597Z %cst_2 = arith.constant dense<8192> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4724850Z %cst_3 = arith.constant dense<0> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4725097Z %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4725348Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4725597Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4725819Z %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:44:05.4726029Z %cst_8 = arith.constant dense<4> : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4726235Z %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:05.4726409Z %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:05.4726613Z %cst_11 = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:44:05.4726754Z %0 = tt.get_program_id x : i32 2026-02-21T09:44:05.4726871Z %1 = arith.divsi %0, %c512_i32 : i32 2026-02-21T09:44:05.4726985Z %2 = arith.muli %1, %c16_i32 : i32 2026-02-21T09:44:05.4727097Z %3 = arith.subi %c512_i32, %2 : i32 2026-02-21T09:44:05.4727210Z %4 = arith.minsi %3, %c16_i32 : i32 2026-02-21T09:44:05.4727320Z %5 = arith.remsi %0, %c512_i32 : i32 2026-02-21T09:44:05.4727432Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:44:05.4727539Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:44:05.4727645Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:44:05.4727749Z %9 = arith.muli %7, %c16_i32 : i32 2026-02-21T09:44:05.4727984Z %10 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4728292Z %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.4728531Z %12 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.4728734Z %13 = arith.addi %12, %11 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:05.4728909Z %14 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:44:05.4729129Z %15 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.4729399Z %16 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.4729641Z %17 = tt.splat %14 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.4729849Z %18 = tt.splat %14 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.4730055Z %19 = arith.addi %17, %15 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:05.4730284Z %20 = arith.addi %18, %16 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:05.4730518Z %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.4730835Z %22 = tt.expand_dims %19 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:44:05.4731085Z %23 = arith.muli %22, %cst_7 : tensor<128x1xi32, #blocked1> 2026-02-21T09:44:05.4731274Z %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:44:05.4731486Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:44:05.4731646Z %26 = arith.extsi %9 : i32 to i64 2026-02-21T09:44:05.4731834Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4732146Z %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4732579Z %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4732986Z %30 = tt.splat %26 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4733387Z %31 = arith.extsi %10 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4733790Z %32 = arith.addi %30, %31 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4734190Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4734608Z %34 = tt.broadcast %33 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4734920Z %35 = arith.cmpi sge, %33, %cst_3 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4735164Z %36 = arith.cmpi slt, %33, %cst_2 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4735387Z %37 = arith.andi %35, %36 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4735679Z %38 = tt.broadcast %37 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4736032Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:44:05.4736439Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:44:05.4736834Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:05.4737109Z %42 = arith.cmpi eq, %41, %cst_9 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:05.4737301Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked> 2026-02-21T09:44:05.4737502Z %44 = arith.cmpi eq, %41, %cst_10 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:05.4737695Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked> 2026-02-21T09:44:05.4737955Z %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg4 = %cst) -> (tensor<128x16xf32, #mma>) : i32 { 2026-02-21T09:44:05.4738168Z %56 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:44:05.4738355Z %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.4738571Z %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.4738877Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:44:05.4739151Z %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:44:05.4739342Z %61 = arith.addi %24, %60 : tensor<128x4xi32, #blocked1> 2026-02-21T09:44:05.4739536Z %62 = tt.addptr %25, %61 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:44:05.4739743Z %63 = tt.load %62 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:44:05.4739961Z %64 = ttg.local_alloc %63 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:44:05.4740299Z %65 = ttg.local_load %64 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.4740708Z %66 = arith.extf %65 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.4740990Z %67 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:44:05.4741197Z %68 = tt.splat %67 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4741484Z %69 = arith.addi %68, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4741863Z %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4742213Z %71 = arith.muli %70, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4742529Z %72 = tt.broadcast %71 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4742828Z %73 = arith.addi %72, %34 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4743131Z %74 = tt.addptr %27, %73 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4743442Z %75 = arith.cmpi sge, %70, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4743683Z %76 = arith.cmpi slt, %70, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4743912Z %77 = arith.andi %75, %76 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4744208Z %78 = tt.broadcast %77 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4744510Z %79 = arith.andi %78, %38 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4744749Z %80 = tt.load %74, %79, %cst_1 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4744991Z %81 = arith.shli %80, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4745230Z %82 = arith.shrsi %81, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4745460Z %83 = arith.shrsi %80, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4745738Z %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T09:44:05.4746062Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T09:44:05.4746350Z %86 = tt.broadcast %84 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4746578Z %87 = arith.select %43, %86, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4746840Z %88 = tt.broadcast %85 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4747066Z %89 = arith.select %45, %88, %87 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4747283Z %90 = tt.reshape %89 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T09:44:05.4747495Z %91 = arith.sitofp %90 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T09:44:05.4747776Z %92 = ttg.convert_layout %91 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.4748249Z %93 = tt.dot %66, %92, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma> 2026-02-21T09:44:05.4748598Z %94 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:44:05.4748726Z %95 = arith.muli %94, %c2_i32 : i32 2026-02-21T09:44:05.4748896Z %96 = tt.splat %95 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.4749109Z %97 = arith.addi %96, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.4749375Z %98 = tt.expand_dims %97 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:44:05.4749642Z %99 = tt.broadcast %98 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:44:05.4749829Z %100 = arith.addi %24, %99 : tensor<128x4xi32, #blocked1> 2026-02-21T09:44:05.4750035Z %101 = tt.addptr %25, %100 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:44:05.4750259Z %102 = tt.load %101 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:44:05.4750484Z %103 = ttg.local_alloc %102 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:44:05.4750818Z %104 = ttg.local_load %103 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.4751229Z %105 = arith.extf %104 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.4751516Z %106 = arith.extsi %94 : i32 to i64 2026-02-21T09:44:05.4751724Z %107 = tt.splat %106 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4752023Z %108 = arith.addi %107, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4752413Z %109 = tt.expand_dims %108 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4752768Z %110 = arith.muli %109, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4753090Z %111 = tt.broadcast %110 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4753392Z %112 = arith.addi %111, %34 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4753696Z %113 = tt.addptr %27, %112 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4754010Z %114 = arith.cmpi sge, %109, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4754265Z %115 = arith.cmpi slt, %109, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4754500Z %116 = arith.andi %114, %115 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4754817Z %117 = tt.broadcast %116 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4755112Z %118 = arith.andi %117, %38 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4755353Z %119 = tt.load %113, %118, %cst_1 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4755595Z %120 = arith.shli %119, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4755830Z %121 = arith.shrsi %120, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4756071Z %122 = arith.shrsi %119, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4756360Z %123 = tt.expand_dims %121 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T09:44:05.4756694Z %124 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T09:44:05.4757003Z %125 = tt.broadcast %123 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4757237Z %126 = arith.select %43, %125, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4757477Z %127 = tt.broadcast %124 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4757708Z %128 = arith.select %45, %127, %126 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4757934Z %129 = tt.reshape %128 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T09:44:05.4758159Z %130 = arith.sitofp %129 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T09:44:05.4758465Z %131 = ttg.convert_layout %130 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.4758925Z %132 = tt.dot %105, %131, %93, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma> 2026-02-21T09:44:05.4759269Z %133 = arith.addi %arg3, %c4_i32 : i32 2026-02-21T09:44:05.4759389Z %134 = arith.muli %133, %c2_i32 : i32 2026-02-21T09:44:05.4759561Z %135 = tt.splat %134 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.4759783Z %136 = arith.addi %135, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.4760067Z %137 = tt.expand_dims %136 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:44:05.4760347Z %138 = tt.broadcast %137 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:44:05.4760544Z %139 = arith.addi %24, %138 : tensor<128x4xi32, #blocked1> 2026-02-21T09:44:05.4760750Z %140 = tt.addptr %25, %139 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:44:05.4760971Z %141 = tt.load %140 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:44:05.4761195Z %142 = ttg.local_alloc %141 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:44:05.4761526Z %143 = ttg.local_load %142 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.4761932Z %144 = arith.extf %143 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.4762230Z %145 = arith.extsi %133 : i32 to i64 2026-02-21T09:44:05.4762436Z %146 = tt.splat %145 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4762786Z %147 = arith.addi %146, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4763177Z %148 = tt.expand_dims %147 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4763526Z %149 = arith.muli %148, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4763833Z %150 = tt.broadcast %149 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4764133Z %151 = arith.addi %150, %34 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4764439Z %152 = tt.addptr %27, %151 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4764755Z %153 = arith.cmpi sge, %148, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4765000Z %154 = arith.cmpi slt, %148, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4765238Z %155 = arith.andi %153, %154 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4765536Z %156 = tt.broadcast %155 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4765827Z %157 = arith.andi %156, %38 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4766069Z %158 = tt.load %152, %157, %cst_1 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4766331Z %159 = arith.shli %158, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4766563Z %160 = arith.shrsi %159, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4766797Z %161 = arith.shrsi %158, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4767079Z %162 = tt.expand_dims %160 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T09:44:05.4767410Z %163 = tt.expand_dims %161 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T09:44:05.4767691Z %164 = tt.broadcast %162 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4767926Z %165 = arith.select %43, %164, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4768162Z %166 = tt.broadcast %163 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4768391Z %167 = arith.select %45, %166, %165 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4768617Z %168 = tt.reshape %167 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T09:44:05.4768844Z %169 = arith.sitofp %168 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T09:44:05.4769154Z %170 = ttg.convert_layout %169 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.4769612Z %171 = tt.dot %144, %170, %132, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma> 2026-02-21T09:44:05.4769952Z %172 = arith.addi %arg3, %c6_i32 : i32 2026-02-21T09:44:05.4770076Z %173 = arith.muli %172, %c2_i32 : i32 2026-02-21T09:44:05.4770265Z %174 = tt.splat %173 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.4770486Z %175 = arith.addi %174, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:05.4770783Z %176 = tt.expand_dims %175 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:44:05.4771055Z %177 = tt.broadcast %176 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:44:05.4771253Z %178 = arith.addi %24, %177 : tensor<128x4xi32, #blocked1> 2026-02-21T09:44:05.4771454Z %179 = tt.addptr %25, %178 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:44:05.4771662Z %180 = tt.load %179 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:44:05.4771884Z %181 = ttg.local_alloc %180 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:44:05.4772211Z %182 = ttg.local_load %181 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.4772615Z %183 = arith.extf %182 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.4772902Z %184 = arith.extsi %172 : i32 to i64 2026-02-21T09:44:05.4773111Z %185 = tt.splat %184 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4773413Z %186 = arith.addi %185, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:05.4773798Z %187 = tt.expand_dims %186 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4774152Z %188 = arith.muli %187, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4774472Z %189 = tt.broadcast %188 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4774768Z %190 = arith.addi %189, %34 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4775073Z %191 = tt.addptr %27, %190 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4775386Z %192 = arith.cmpi sge, %187, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4775623Z %193 = arith.cmpi slt, %187, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4775856Z %194 = arith.andi %192, %193 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4776149Z %195 = tt.broadcast %194 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4776443Z %196 = arith.andi %195, %38 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4776683Z %197 = tt.load %191, %196, %cst_1 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4776933Z %198 = arith.shli %197, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4777163Z %199 = arith.shrsi %198, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4777395Z %200 = arith.shrsi %197, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:05.4777683Z %201 = tt.expand_dims %199 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T09:44:05.4778029Z %202 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T09:44:05.4778329Z %203 = tt.broadcast %201 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4778563Z %204 = arith.select %43, %203, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4778812Z %205 = tt.broadcast %202 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4779040Z %206 = arith.select %45, %205, %204 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:05.4779265Z %207 = tt.reshape %206 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T09:44:05.4779478Z %208 = arith.sitofp %207 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T09:44:05.4779767Z %209 = ttg.convert_layout %208 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:05.4780220Z %210 = tt.dot %183, %209, %171, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma> 2026-02-21T09:44:05.4780565Z scf.yield %210 : tensor<128x16xf32, #mma> 2026-02-21T09:44:05.4780713Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:44:05.4780895Z %47 = arith.truncf %46 : tensor<128x16xf32, #mma> to tensor<128x16xbf16, #mma> 2026-02-21T09:44:05.4781156Z %48 = tt.expand_dims %20 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:44:05.4781386Z %49 = arith.muli %48, %cst_11 : tensor<128x1xi32, #mma> 2026-02-21T09:44:05.4781614Z %50 = tt.expand_dims %13 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma> 2026-02-21T09:44:05.4781865Z %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x16xi32, #mma> 2026-02-21T09:44:05.4782060Z %52 = tt.broadcast %50 : tensor<1x16xi32, #mma> -> tensor<128x16xi32, #mma> 2026-02-21T09:44:05.4782256Z %53 = arith.addi %51, %52 : tensor<128x16xi32, #mma> 2026-02-21T09:44:05.4782421Z %54 = tt.splat %arg2 : !tt.ptr -> tensor<128x16x!tt.ptr, #mma> 2026-02-21T09:44:05.4782630Z %55 = tt.addptr %54, %53 : tensor<128x16x!tt.ptr, #mma>, tensor<128x16xi32, #mma> 2026-02-21T09:44:05.4782820Z tt.store %55, %47 : tensor<128x16x!tt.ptr, #mma> 2026-02-21T09:44:05.4782946Z tt.return 2026-02-21T09:44:05.4783030Z } 2026-02-21T09:44:05.4783102Z } 2026-02-21T09:44:05.4783145Z 2026-02-21T09:44:05.4783179Z {-# 2026-02-21T09:44:05.4783257Z external_resources: { 2026-02-21T09:44:05.4783355Z mlir_reproducer: { 2026-02-21T09:44:05.4784342Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:44:05.4785334Z disable_threading: false, 2026-02-21T09:44:05.4785440Z verify_each: true 2026-02-21T09:44:05.4785527Z } 2026-02-21T09:44:05.4785599Z } 2026-02-21T09:44:05.4785667Z #-} 2026-02-21T09:44:05.4785946Z /tmp/torchinductor_root/mr/cmr3ht6tszsgmysysw2xlrimjdi2xbhhqu2jvkhy3a7kq67i2dot.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:44:05.4786636Z /tmp/torchinductor_root/mr/cmr3ht6tszsgmysysw2xlrimjdi2xbhhqu2jvkhy3a7kq67i2dot.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:44:05.4787177Z [49s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:44:05.4787916Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 16], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:44:05.4788571Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:44:05.4788735Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:44:14.8072838Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:44:14.8075981Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:44:14.8076935Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:44:14.8077770Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:44:14.8078537Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [8, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:44:14.8079382Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:44:14.8080448Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:44:14.8081647Z %cst = arith.constant dense<0.000000e+00> : tensor<128x16xf32, #mma> 2026-02-21T09:44:14.8081997Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:44:14.8082270Z %c38912_i32 = arith.constant 38912 : i32 2026-02-21T09:44:14.8082527Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:44:14.8082858Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T09:44:14.8083190Z %cst_0 = arith.constant dense<0> : tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:14.8083515Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T09:44:14.8083775Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:44:14.8084023Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:44:14.8084269Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:44:14.8084518Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:44:14.8084759Z %c255_i32 = arith.constant 255 : i32 2026-02-21T09:44:14.8085003Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:44:14.8085404Z %cst_1 = arith.constant dense<0> : tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8085893Z %cst_2 = arith.constant dense<0> : tensor<128x4xi32, #blocked1> 2026-02-21T09:44:14.8086224Z %c-1_i32 = arith.constant -1 : i32 2026-02-21T09:44:14.8086470Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:44:14.8086781Z %cst_3 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:44:14.8087351Z %cst_4 = arith.constant dense<8192> : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8087908Z %cst_5 = arith.constant dense<4> : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8088371Z %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:14.8088745Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:14.8089119Z %cst_8 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:44:14.8089491Z %cst_9 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:44:14.8089922Z %cst_10 = arith.constant dense<4096> : tensor<128x1xi64, #mma> 2026-02-21T09:44:14.8090276Z %cst_11 = arith.constant dense<0> : tensor<1x16xi64, #mma> 2026-02-21T09:44:14.8090639Z %cst_12 = arith.constant dense<8192> : tensor<1x16xi64, #mma> 2026-02-21T09:44:14.8090881Z %0 = tt.get_program_id x : i32 2026-02-21T09:44:14.8091183Z %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:14.8091607Z %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:14.8092074Z %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:14.8092541Z %4 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:14.8093001Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:14.8093460Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:14.8093825Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:44:14.8094183Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8094638Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:44:14.8095265Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:44:14.8095874Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:14.8096288Z %12 = arith.cmpi eq, %11, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:14.8096586Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked> 2026-02-21T09:44:14.8096879Z %14 = arith.cmpi eq, %11, %cst_7 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:44:14.8097166Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked> 2026-02-21T09:44:14.8097474Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<128x16x!tt.ptr, #mma> 2026-02-21T09:44:14.8097887Z %17 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:14.8098392Z %18 = arith.extsi %4 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:14.8098735Z %19 = arith.subi %c16384_i32, %0 : i32 2026-02-21T09:44:14.8098931Z %20 = arith.ceildivsi %19, %c38912_i32 : i32 2026-02-21T09:44:14.8099120Z %21 = arith.muli %20, %c256_i32 : i32 2026-02-21T09:44:14.8099299Z %22 = arith.subi %0, %c38912_i32 : i32 2026-02-21T09:44:14.8100036Z %23:8 = scf.for %arg3 = %c0_i32 to %21 step %c1_i32 iter_args(%arg4 = %c-1_i32, %arg5 = %22, %arg6 = %c0_i32, %arg7 = %cst, %arg8 = %c0_i32, %arg9 = %c0_i32, %arg10 = %cst_2, %arg11 = %cst_1) -> (i32, i32, i32, tensor<128x16xf32, #mma>, i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>) : i32 { 2026-02-21T09:44:14.8100778Z %24 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:44:14.8100959Z %25 = arith.cmpi eq, %arg4, %c255_i32 : i32 2026-02-21T09:44:14.8101110Z %26 = arith.select %25, %c0_i32, %24 : i32 2026-02-21T09:44:14.8101258Z %27 = arith.cmpi eq, %26, %c0_i32 : i32 2026-02-21T09:44:14.8101405Z %28 = arith.select %27, %c0_i32, %arg6 : i32 2026-02-21T09:44:14.8101697Z %29:5 = scf.if %27 -> (i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32) { 2026-02-21T09:44:14.8101969Z %64 = arith.addi %arg5, %c38912_i32 : i32 2026-02-21T09:44:14.8102138Z %65 = arith.divsi %64, %c2048_i32 : i32 2026-02-21T09:44:14.8102280Z %66 = arith.muli %65, %c4_i32 : i32 2026-02-21T09:44:14.8102415Z %67 = arith.subi %c32_i32, %66 : i32 2026-02-21T09:44:14.8102554Z %68 = arith.minsi %67, %c4_i32 : i32 2026-02-21T09:44:14.8102694Z %69 = arith.remsi %64, %c2048_i32 : i32 2026-02-21T09:44:14.8102832Z %70 = arith.remsi %69, %68 : i32 2026-02-21T09:44:14.8102965Z %71 = arith.addi %66, %70 : i32 2026-02-21T09:44:14.8103097Z %72 = arith.divsi %69, %68 : i32 2026-02-21T09:44:14.8103235Z %73 = arith.muli %71, %c128_i32 : i32 2026-02-21T09:44:14.8103436Z %74 = tt.splat %73 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:14.8103707Z %75 = arith.addi %74, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:14.8103909Z %76 = arith.muli %72, %c16_i32 : i32 2026-02-21T09:44:14.8104153Z %77 = tt.splat %76 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:14.8104505Z %78 = arith.addi %77, %3 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:14.8104881Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:44:14.8105186Z %80 = arith.muli %79, %cst_3 : tensor<128x1xi32, #blocked1> 2026-02-21T09:44:14.8105416Z %81 = tt.broadcast %80 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:44:14.8105831Z %82 = tt.expand_dims %78 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8106362Z %83 = tt.broadcast %82 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8106788Z scf.yield %73, %76, %81, %83, %64 : i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32 2026-02-21T09:44:14.8107062Z } else { 2026-02-21T09:44:14.8107334Z scf.yield %arg8, %arg9, %arg10, %arg11, %arg5 : i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32 2026-02-21T09:44:14.8107626Z } 2026-02-21T09:44:14.8107832Z %30 = tt.splat %28 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:14.8108173Z %31 = arith.addi %30, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:44:14.8108414Z %32 = arith.muli %28, %c2_i32 : i32 2026-02-21T09:44:14.8108610Z %33 = tt.splat %32 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:14.8108860Z %34 = arith.addi %33, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:44:14.8109177Z %35 = tt.expand_dims %34 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:44:14.8109517Z %36 = tt.broadcast %35 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:44:14.8109742Z %37 = arith.addi %29#2, %36 : tensor<128x4xi32, #blocked1> 2026-02-21T09:44:14.8109974Z %38 = tt.addptr %7, %37 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:44:14.8110211Z %39 = tt.load %38 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:44:14.8110523Z %40 = ttg.convert_layout %39 : tensor<128x4xbf16, #blocked1> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:14.8110995Z %41 = arith.extf %40 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:14.8111465Z %42 = tt.expand_dims %31 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8111813Z %43 = arith.muli %42, %cst_4 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8112111Z %44 = tt.broadcast %43 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8112407Z %45 = arith.addi %44, %29#3 : tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8112702Z %46 = tt.addptr %8, %45 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8113004Z %47 = tt.load %46 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8113227Z %48 = arith.shli %47, %cst_5 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8113454Z %49 = arith.shrsi %48, %cst_5 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8113684Z %50 = arith.shrsi %47, %cst_5 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8113960Z %51 = tt.expand_dims %49 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T09:44:14.8114285Z %52 = tt.expand_dims %50 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T09:44:14.8114563Z %53 = tt.broadcast %51 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:14.8114806Z %54 = arith.select %13, %53, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:14.8115035Z %55 = tt.broadcast %52 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:14.8115260Z %56 = arith.select %15, %55, %54 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T09:44:14.8115476Z %57 = tt.reshape %56 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T09:44:14.8115691Z %58 = arith.sitofp %57 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T09:44:14.8115980Z %59 = ttg.convert_layout %58 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:14.8116448Z %60 = tt.dot %41, %59, %arg7, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma> 2026-02-21T09:44:14.8116796Z %61 = arith.addi %28, %c2_i32 : i32 2026-02-21T09:44:14.8116921Z %62 = arith.cmpi eq, %26, %c255_i32 : i32 2026-02-21T09:44:14.8117068Z %63 = arith.select %62, %cst, %60 : tensor<128x16xf32, #mma> 2026-02-21T09:44:14.8117227Z scf.if %62 { 2026-02-21T09:44:14.8117363Z %64 = arith.truncf %60 : tensor<128x16xf32, #mma> to tensor<128x16xbf16, #mma> 2026-02-21T09:44:14.8117547Z %65 = arith.extsi %29#0 : i32 to i64 2026-02-21T09:44:14.8117662Z %66 = arith.extsi %29#1 : i32 to i64 2026-02-21T09:44:14.8117824Z %67 = tt.splat %65 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:14.8118030Z %68 = arith.addi %67, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:14.8118356Z %69 = tt.expand_dims %68 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:44:14.8118636Z %70 = arith.muli %69, %cst_8 : tensor<128x1xi64, #mma> 2026-02-21T09:44:14.8118856Z %71 = tt.broadcast %70 : tensor<128x1xi64, #mma> -> tensor<128x16xi64, #mma> 2026-02-21T09:44:14.8119183Z %72 = tt.splat %66 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:14.8119448Z %73 = arith.addi %72, %18 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:14.8119730Z %74 = tt.expand_dims %73 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi64, #mma> 2026-02-21T09:44:14.8120026Z %75 = tt.broadcast %74 : tensor<1x16xi64, #mma> -> tensor<128x16xi64, #mma> 2026-02-21T09:44:14.8120249Z %76 = arith.addi %71, %75 : tensor<128x16xi64, #mma> 2026-02-21T09:44:14.8120629Z %77 = tt.addptr %16, %76 : tensor<128x16x!tt.ptr, #mma>, tensor<128x16xi64, #mma> 2026-02-21T09:44:14.8120855Z %78 = arith.cmpi sge, %69, %cst_9 : tensor<128x1xi64, #mma> 2026-02-21T09:44:14.8121044Z %79 = arith.cmpi slt, %69, %cst_10 : tensor<128x1xi64, #mma> 2026-02-21T09:44:14.8121249Z %80 = arith.andi %78, %79 : tensor<128x1xi1, #mma> 2026-02-21T09:44:14.8121449Z %81 = tt.broadcast %80 : tensor<128x1xi1, #mma> -> tensor<128x16xi1, #mma> 2026-02-21T09:44:14.8121650Z %82 = arith.cmpi sge, %74, %cst_11 : tensor<1x16xi64, #mma> 2026-02-21T09:44:14.8121849Z %83 = arith.cmpi slt, %74, %cst_12 : tensor<1x16xi64, #mma> 2026-02-21T09:44:14.8122014Z %84 = arith.andi %82, %83 : tensor<1x16xi1, #mma> 2026-02-21T09:44:14.8122219Z %85 = tt.broadcast %84 : tensor<1x16xi1, #mma> -> tensor<128x16xi1, #mma> 2026-02-21T09:44:14.8122424Z %86 = arith.andi %81, %85 : tensor<128x16xi1, #mma> 2026-02-21T09:44:14.8122638Z tt.store %77, %64, %86 : tensor<128x16x!tt.ptr, #mma> 2026-02-21T09:44:14.8128940Z } 2026-02-21T09:44:14.8129225Z scf.yield %26, %29#4, %61, %63, %29#0, %29#1, %29#2, %29#3 : i32, i32, i32, tensor<128x16xf32, #mma>, i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:14.8129605Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32} 2026-02-21T09:44:14.8129741Z tt.return 2026-02-21T09:44:14.8129823Z } 2026-02-21T09:44:14.8129901Z } 2026-02-21T09:44:14.8129947Z 2026-02-21T09:44:14.8129977Z {-# 2026-02-21T09:44:14.8130059Z external_resources: { 2026-02-21T09:44:14.8130159Z mlir_reproducer: { 2026-02-21T09:44:14.8131150Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:44:14.8132135Z disable_threading: false, 2026-02-21T09:44:14.8132240Z verify_each: true 2026-02-21T09:44:14.8132331Z } 2026-02-21T09:44:14.8132402Z } 2026-02-21T09:44:14.8132473Z #-} 2026-02-21T09:44:14.8132756Z /tmp/torchinductor_root/ra/crazeaaf2iq4gguqx6yt7kdjcm4mzgqgabqh3cprtf2cw7wot75s.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:44:14.8133458Z /tmp/torchinductor_root/ra/crazeaaf2iq4gguqx6yt7kdjcm4mzgqgabqh3cprtf2cw7wot75s.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:44:14.8134006Z [59s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:44:14.8134800Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, False], range_num_stages=[1, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T09:44:14.8135531Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:44:14.8135699Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:44:15.8405420Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:44:15.8410252Z #blocked = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:44:15.8411071Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:44:15.8411823Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:44:15.8412532Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:44:15.8413215Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:44:15.8413822Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:44:15.8414263Z #smem = #ttg.shared_memory 2026-02-21T09:44:15.8414816Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:44:15.8415999Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:44:15.8417099Z %cst = arith.constant dense<4096> : tensor<256x1xi64, #mma> 2026-02-21T09:44:15.8417521Z %cst_0 = arith.constant dense<0> : tensor<256x1xi64, #mma> 2026-02-21T09:44:15.8417933Z %cst_1 = arith.constant dense<8192> : tensor<256x1xi64, #mma> 2026-02-21T09:44:15.8418333Z %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #mma> 2026-02-21T09:44:15.8418718Z %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #mma> 2026-02-21T09:44:15.8419123Z %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked> 2026-02-21T09:44:15.8419529Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked> 2026-02-21T09:44:15.8419939Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked> 2026-02-21T09:44:15.8420311Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:44:15.8420676Z %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:44:15.8421001Z %cst_9 = arith.constant dense<0.000000e+00> : tensor<256x64xf32, #mma> 2026-02-21T09:44:15.8421278Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:44:15.8421491Z %c510_i32 = arith.constant 510 : i32 2026-02-21T09:44:15.8421822Z %cst_10 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:15.8422209Z %cst_11 = arith.constant dense<0> : tensor<1x64xi64, #blocked> 2026-02-21T09:44:15.8422564Z %cst_12 = arith.constant dense<8192> : tensor<1x64xi64, #blocked> 2026-02-21T09:44:15.8422881Z %cst_13 = arith.constant dense<1024> : tensor<256x1xi32, #blocked2> 2026-02-21T09:44:15.8423156Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:44:15.8423360Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:44:15.8423568Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:44:15.8423765Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:44:15.8424012Z %cst_14 = arith.constant dense<0> : tensor<2x64xi8, #blocked> 2026-02-21T09:44:15.8424511Z %cst_15 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked1> 2026-02-21T09:44:15.8424779Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:44:15.8424988Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:44:15.8425362Z %cst_16 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:15.8425697Z %0 = tt.get_program_id x : i32 2026-02-21T09:44:15.8425898Z %1 = arith.divsi %0, %c1024_i32 : i32 2026-02-21T09:44:15.8426105Z %2 = arith.muli %1, %c64_i32 : i32 2026-02-21T09:44:15.8426303Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T09:44:15.8426505Z %4 = arith.minsi %3, %c64_i32 : i32 2026-02-21T09:44:15.8426703Z %5 = arith.remsi %0, %c1024_i32 : i32 2026-02-21T09:44:15.8426933Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:44:15.8427123Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:44:15.8427313Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:44:15.8427494Z %9 = arith.muli %7, %c64_i32 : i32 2026-02-21T09:44:15.8427697Z %10 = arith.muli %8, %c256_i32 : i32 2026-02-21T09:44:15.8428059Z %11 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:44:15.8428568Z %12 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:15.8429021Z %13 = tt.splat %10 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:44:15.8429419Z %14 = arith.addi %13, %11 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:44:15.8429865Z %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:44:15.8430415Z %16 = tt.expand_dims %14 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<256x1xi32, #blocked2> 2026-02-21T09:44:15.8430792Z %17 = arith.muli %16, %cst_13 : tensor<256x1xi32, #blocked2> 2026-02-21T09:44:15.8431095Z %18 = tt.broadcast %17 : tensor<256x1xi32, #blocked2> -> tensor<256x4xi32, #blocked2> 2026-02-21T09:44:15.8431404Z %19 = tt.splat %arg0 : !tt.ptr -> tensor<256x4x!tt.ptr, #blocked2> 2026-02-21T09:44:15.8431620Z %20 = arith.extsi %9 : i32 to i64 2026-02-21T09:44:15.8431814Z %21 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked> 2026-02-21T09:44:15.8432118Z %22 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:15.8432531Z %23 = arith.extsi %22 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:15.8432902Z %24 = tt.splat %20 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:44:15.8433224Z %25 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:44:15.8433639Z %26 = arith.extsi %25 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:44:15.8434015Z %27 = arith.addi %24, %26 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:44:15.8434367Z %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x64xi64, #blocked> 2026-02-21T09:44:15.8434740Z %29 = tt.broadcast %28 : tensor<1x64xi64, #blocked> -> tensor<2x64xi64, #blocked> 2026-02-21T09:44:15.8434997Z %30 = arith.cmpi sge, %28, %cst_11 : tensor<1x64xi64, #blocked> 2026-02-21T09:44:15.8435214Z %31 = arith.cmpi slt, %28, %cst_12 : tensor<1x64xi64, #blocked> 2026-02-21T09:44:15.8435427Z %32 = arith.andi %30, %31 : tensor<1x64xi1, #blocked> 2026-02-21T09:44:15.8435660Z %33 = tt.broadcast %32 : tensor<1x64xi1, #blocked> -> tensor<2x64xi1, #blocked> 2026-02-21T09:44:15.8436051Z %34 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:44:15.8436607Z %35 = tt.expand_dims %34 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:44:15.8437157Z %36 = tt.expand_dims %35 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:44:15.8437492Z %37 = arith.cmpi eq, %36, %cst_8 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:44:15.8437747Z %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x64xi1, #blocked1> 2026-02-21T09:44:15.8437997Z %39 = arith.cmpi eq, %36, %cst_7 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:44:15.8438247Z %40 = tt.broadcast %39 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x64xi1, #blocked1> 2026-02-21T09:44:15.8438533Z %41 = ttg.local_alloc : () -> !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable> 2026-02-21T09:44:15.8438888Z %42 = tt.expand_dims %15 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:44:15.8439242Z %43 = tt.broadcast %42 : tensor<1x4xi32, #blocked2> -> tensor<256x4xi32, #blocked2> 2026-02-21T09:44:15.8439493Z %44 = arith.addi %18, %43 : tensor<256x4xi32, #blocked2> 2026-02-21T09:44:15.8439750Z %45 = tt.addptr %19, %44 : tensor<256x4x!tt.ptr, #blocked2>, tensor<256x4xi32, #blocked2> 2026-02-21T09:44:15.8440018Z %46 = tt.load %45 : tensor<256x4x!tt.ptr, #blocked2> 2026-02-21T09:44:15.8440392Z %47 = ttg.memdesc_index %41[%c0_i32] : !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> 2026-02-21T09:44:15.8440869Z ttg.local_store %46, %47 : tensor<256x4xbf16, #blocked2> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> 2026-02-21T09:44:15.8441456Z %48:3 = scf.for %arg3 = %c0_i32 to %c510_i32 step %c2_i32 iter_args(%arg4 = %cst_9, %arg5 = %c0_i32, %arg6 = %47) -> (tensor<256x64xf32, #mma>, i32, !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>) : i32 { 2026-02-21T09:44:15.8441895Z %103 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:44:15.8442056Z %104 = arith.muli %103, %c2_i32 : i32 2026-02-21T09:44:15.8442267Z %105 = tt.splat %104 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:44:15.8442512Z %106 = arith.addi %105, %15 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:44:15.8442916Z %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:44:15.8443210Z %108 = tt.broadcast %107 : tensor<1x4xi32, #blocked2> -> tensor<256x4xi32, #blocked2> 2026-02-21T09:44:15.8443414Z %109 = arith.addi %18, %108 : tensor<256x4xi32, #blocked2> 2026-02-21T09:44:15.8443630Z %110 = tt.addptr %19, %109 : tensor<256x4x!tt.ptr, #blocked2>, tensor<256x4xi32, #blocked2> 2026-02-21T09:44:15.8443854Z %111 = tt.load %110 : tensor<256x4x!tt.ptr, #blocked2> 2026-02-21T09:44:15.8444173Z %112 = ttg.local_load %arg6 : !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:15.8444640Z %113 = arith.extf %112 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:15.8444953Z %114 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:44:15.8445133Z %115 = tt.splat %114 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:15.8445363Z %116 = arith.addi %115, %23 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:15.8445651Z %117 = tt.expand_dims %116 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:44:15.8445922Z %118 = arith.muli %117, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:44:15.8446119Z %119 = tt.broadcast %118 : tensor<2x1xi64, #blocked> -> tensor<2x64xi64, #blocked> 2026-02-21T09:44:15.8446320Z %120 = arith.addi %119, %29 : tensor<2x64xi64, #blocked> 2026-02-21T09:44:15.8448396Z %121 = tt.addptr %21, %120 : tensor<2x64x!tt.ptr, #blocked>, tensor<2x64xi64, #blocked> 2026-02-21T09:44:15.8448610Z %122 = arith.cmpi sge, %117, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:44:15.8448789Z %123 = arith.cmpi slt, %117, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:44:15.8448960Z %124 = arith.andi %122, %123 : tensor<2x1xi1, #blocked> 2026-02-21T09:44:15.8449152Z %125 = tt.broadcast %124 : tensor<2x1xi1, #blocked> -> tensor<2x64xi1, #blocked> 2026-02-21T09:44:15.8449343Z %126 = arith.andi %125, %33 : tensor<2x64xi1, #blocked> 2026-02-21T09:44:15.8449517Z %127 = tt.load %121, %126, %cst_14 : tensor<2x64x!tt.ptr, #blocked> 2026-02-21T09:44:15.8449794Z %128 = ttg.convert_layout %127 : tensor<2x64xi8, #blocked> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:15.8450088Z %129 = arith.shli %128, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:15.8450335Z %130 = arith.shrsi %129, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:15.8450576Z %131 = arith.shrsi %128, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:15.8450866Z %132 = tt.expand_dims %130 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1> 2026-02-21T09:44:15.8451204Z %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1> 2026-02-21T09:44:15.8451487Z %134 = tt.broadcast %132 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1> 2026-02-21T09:44:15.8451733Z %135 = arith.select %38, %134, %cst_15 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1> 2026-02-21T09:44:15.8451995Z %136 = tt.broadcast %133 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1> 2026-02-21T09:44:15.8452233Z %137 = arith.select %40, %136, %135 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1> 2026-02-21T09:44:15.8452464Z %138 = tt.reshape %137 : tensor<2x2x64xi8, #blocked1> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:44:15.8452682Z %139 = arith.sitofp %138 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:44:15.8452972Z %140 = ttg.convert_layout %139 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:15.8453439Z %141 = tt.dot %113, %140, %arg4, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x64xf32, #mma> 2026-02-21T09:44:15.8453783Z %142 = arith.addi %arg5, %c1_i32 : i32 2026-02-21T09:44:15.8453914Z %143 = arith.cmpi slt, %142, %c1_i32 : i32 2026-02-21T09:44:15.8454041Z %144 = arith.select %143, %142, %c0_i32 : i32 2026-02-21T09:44:15.8454309Z %145 = ttg.memdesc_index %41[%144] : !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> 2026-02-21T09:44:15.8454681Z ttg.local_store %111, %145 : tensor<256x4xbf16, #blocked2> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> 2026-02-21T09:44:15.8454988Z scf.yield %141, %144, %145 : tensor<256x64xf32, #mma>, i32, !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> 2026-02-21T09:44:15.8455237Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:44:15.8455550Z %49 = ttg.local_load %48#2 : !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:15.8456001Z %50 = arith.extf %49 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:15.8456342Z %51 = arith.addi %23, %cst_10 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:44:15.8456612Z %52 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:44:15.8456851Z %53 = arith.muli %52, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:44:15.8457031Z %54 = tt.broadcast %53 : tensor<2x1xi64, #blocked> -> tensor<2x64xi64, #blocked> 2026-02-21T09:44:15.8457215Z %55 = arith.addi %54, %29 : tensor<2x64xi64, #blocked> 2026-02-21T09:44:15.8457401Z %56 = tt.addptr %21, %55 : tensor<2x64x!tt.ptr, #blocked>, tensor<2x64xi64, #blocked> 2026-02-21T09:44:15.8457595Z %57 = arith.cmpi sge, %52, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:44:15.8457760Z %58 = arith.cmpi slt, %52, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:44:15.8457914Z %59 = arith.andi %57, %58 : tensor<2x1xi1, #blocked> 2026-02-21T09:44:15.8458086Z %60 = tt.broadcast %59 : tensor<2x1xi1, #blocked> -> tensor<2x64xi1, #blocked> 2026-02-21T09:44:15.8458264Z %61 = arith.andi %60, %33 : tensor<2x64xi1, #blocked> 2026-02-21T09:44:15.8458424Z %62 = tt.load %56, %61, %cst_14 : tensor<2x64x!tt.ptr, #blocked> 2026-02-21T09:44:15.8458669Z %63 = ttg.convert_layout %62 : tensor<2x64xi8, #blocked> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:15.8458940Z %64 = arith.shli %63, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:15.8459179Z %65 = arith.shrsi %64, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:15.8459410Z %66 = arith.shrsi %63, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:44:15.8459712Z %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1> 2026-02-21T09:44:15.8460043Z %68 = tt.expand_dims %66 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1> 2026-02-21T09:44:15.8460320Z %69 = tt.broadcast %67 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1> 2026-02-21T09:44:15.8460554Z %70 = arith.select %38, %69, %cst_15 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1> 2026-02-21T09:44:15.8460787Z %71 = tt.broadcast %68 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1> 2026-02-21T09:44:15.8461015Z %72 = arith.select %40, %71, %70 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1> 2026-02-21T09:44:15.8461238Z %73 = tt.reshape %72 : tensor<2x2x64xi8, #blocked1> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:44:15.8461446Z %74 = arith.sitofp %73 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:44:15.8461736Z %75 = ttg.convert_layout %74 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:44:15.8462188Z %76 = tt.dot %50, %75, %48#0, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x64xf32, #mma> 2026-02-21T09:44:15.8462581Z ttg.local_dealloc %41 : !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable> 2026-02-21T09:44:15.8462790Z %77 = arith.truncf %76 : tensor<256x64xf32, #mma> to tensor<256x64xbf16, #mma> 2026-02-21T09:44:15.8462951Z %78 = arith.extsi %10 : i32 to i64 2026-02-21T09:44:15.8463104Z %79 = tt.splat %arg2 : !tt.ptr -> tensor<256x64x!tt.ptr, #mma> 2026-02-21T09:44:15.8463302Z %80 = tt.splat %78 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:15.8463590Z %81 = arith.extsi %12 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:15.8463865Z %82 = arith.addi %80, %81 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:44:15.8464137Z %83 = tt.expand_dims %82 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma> 2026-02-21T09:44:15.8464371Z %84 = arith.muli %83, %cst_1 : tensor<256x1xi64, #mma> 2026-02-21T09:44:15.8464543Z %85 = tt.broadcast %84 : tensor<256x1xi64, #mma> -> tensor<256x64xi64, #mma> 2026-02-21T09:44:15.8464742Z %86 = tt.splat %20 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:15.8464977Z %87 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:15.8465272Z %88 = arith.extsi %87 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:15.8465539Z %89 = arith.addi %86, %88 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:44:15.8465788Z %90 = tt.expand_dims %89 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi64, #mma> 2026-02-21T09:44:15.8466038Z %91 = tt.broadcast %90 : tensor<1x64xi64, #mma> -> tensor<256x64xi64, #mma> 2026-02-21T09:44:15.8466209Z %92 = arith.addi %85, %91 : tensor<256x64xi64, #mma> 2026-02-21T09:44:15.8466388Z %93 = tt.addptr %79, %92 : tensor<256x64x!tt.ptr, #mma>, tensor<256x64xi64, #mma> 2026-02-21T09:44:15.8466581Z %94 = arith.cmpi sge, %83, %cst_0 : tensor<256x1xi64, #mma> 2026-02-21T09:44:15.8466739Z %95 = arith.cmpi slt, %83, %cst : tensor<256x1xi64, #mma> 2026-02-21T09:44:15.8466889Z %96 = arith.andi %94, %95 : tensor<256x1xi1, #mma> 2026-02-21T09:44:15.8467051Z %97 = tt.broadcast %96 : tensor<256x1xi1, #mma> -> tensor<256x64xi1, #mma> 2026-02-21T09:44:15.8467229Z %98 = arith.cmpi sge, %90, %cst_3 : tensor<1x64xi64, #mma> 2026-02-21T09:44:15.8467386Z %99 = arith.cmpi slt, %90, %cst_2 : tensor<1x64xi64, #mma> 2026-02-21T09:44:15.8467550Z %100 = arith.andi %98, %99 : tensor<1x64xi1, #mma> 2026-02-21T09:44:15.8467717Z %101 = tt.broadcast %100 : tensor<1x64xi1, #mma> -> tensor<256x64xi1, #mma> 2026-02-21T09:44:15.8467892Z %102 = arith.andi %97, %101 : tensor<256x64xi1, #mma> 2026-02-21T09:44:15.8468048Z tt.store %93, %77, %102 : tensor<256x64x!tt.ptr, #mma> 2026-02-21T09:44:15.8468181Z tt.return 2026-02-21T09:44:15.8468261Z } 2026-02-21T09:44:15.8468333Z } 2026-02-21T09:44:15.8468377Z 2026-02-21T09:44:15.8468408Z {-# 2026-02-21T09:44:15.8468488Z external_resources: { 2026-02-21T09:44:15.8468585Z mlir_reproducer: { 2026-02-21T09:44:15.8469580Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:44:15.8470583Z disable_threading: false, 2026-02-21T09:44:15.8470687Z verify_each: true 2026-02-21T09:44:15.8470778Z } 2026-02-21T09:44:15.8470848Z } 2026-02-21T09:44:15.8470918Z #-} 2026-02-21T09:44:15.8471201Z /tmp/torchinductor_root/a4/ca4zjozlac45fwgzml4rubpym6qwmt76ix5wkhh3pvepmzy2uwlu.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:44:15.8471883Z /tmp/torchinductor_root/a4/ca4zjozlac45fwgzml4rubpym6qwmt76ix5wkhh3pvepmzy2uwlu.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:44:15.8472444Z [60s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:44:15.8473162Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 256, 64], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T09:44:15.8473826Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:44:15.8473993Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:44:25.7480869Z Initial population exploring neighbors 44% ━━━━━━ 44/100 1.7 configs/s 2026-02-21T09:44:25.7485055Z WARNING:tritonbench.utils.triton_op:Completed input ID 17: 2026-02-21T09:44:25.7485499Z x_val 2026-02-21T09:44:25.7485775Z --------------------- 2026-02-21T09:44:25.7486061Z (1, 4096, 8192, 1024) 2026-02-21T09:44:25.7486222Z 2026-02-21T09:44:25.7503274Z 60%|██████ | 6/10 [41:46<23:33, 353.43s/it]WARNING:tritonbench.utils.triton_op:Running input ID 21: 2026-02-21T09:44:25.7503680Z x_val 2026-02-21T09:44:25.7503846Z --------------------- 2026-02-21T09:44:25.7504059Z (4, 4096, 8192, 1024) 2026-02-21T09:44:25.7507186Z INFO:tritonbench.utils.triton_op:Took 0.25ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T09:44:26.7812439Z INFO:tritonbench.utils.triton_op:Took 4.46ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T09:44:28.9459036Z INFO:tritonbench.utils.triton_op:Took 0.17ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T09:44:28.9468760Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:44:28.9469106Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:44:28.9469302Z 'dtype': 'torch.bfloat16', 2026-02-21T09:44:28.9469516Z 'shape': (4, 4096, 1024), 2026-02-21T09:44:28.9470152Z 'stride': (4194304, 1024, 1)}, 2026-02-21T09:44:28.9470319Z { 'device': 'cuda:0', 2026-02-21T09:44:28.9470487Z 'dtype': 'torch.int32', 2026-02-21T09:44:28.9470642Z 'shape': (1024, 8192), 2026-02-21T09:44:28.9470801Z 'stride': (8192, 1)}), 2026-02-21T09:44:28.9470965Z 'kwargs': {}} 2026-02-21T09:44:28.9487560Z INFO:tritonbench.utils.triton_op:Took 2.02ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T09:44:29.1300996Z [0s] Autotune random seed: 2138032649 2026-02-21T09:44:29.1598284Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:45:02.9298681Z [33s] Timeout after 30s compiling Config(block_sizes=[64, 256, 2], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T09:45:11.6856896Z [42s] Timeout after 30s compiling Config(block_sizes=[32, 2048, 2], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:45:15.6222341Z [46s] Timeout after 30s compiling Config(block_sizes=[16, 4096, 1], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T09:45:15.6246816Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.4 configs/s 2026-02-21T09:45:36.3470003Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:45:36.3473659Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T09:45:36.3474152Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:45:36.3474625Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:45:36.3475084Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:45:36.3475504Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:45:36.3477277Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:45:36.3477560Z #smem = #ttg.shared_memory 2026-02-21T09:45:36.3477945Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:45:36.3478634Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:45:36.3479327Z %cst = arith.constant dense<8192> : tensor<1x512xi64, #mma> 2026-02-21T09:45:36.3479592Z %cst_0 = arith.constant dense<0> : tensor<1x512xi64, #mma> 2026-02-21T09:45:36.3479843Z %cst_1 = arith.constant dense<16384> : tensor<32x1xi64, #mma> 2026-02-21T09:45:36.3480411Z %cst_2 = arith.constant dense<0> : tensor<32x1xi64, #mma> 2026-02-21T09:45:36.3480661Z %cst_3 = arith.constant dense<8192> : tensor<32x1xi64, #mma> 2026-02-21T09:45:36.3480935Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:36.3481215Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:36.3481608Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<32x512xf32, #mma> 2026-02-21T09:45:36.3481917Z %cst_7 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1> 2026-02-21T09:45:36.3482118Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:45:36.3482263Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:45:36.3482419Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:45:36.3482670Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:45:36.3482818Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:45:36.3482962Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:45:36.3483099Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:45:36.3483250Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:45:36.3483440Z %cst_8 = arith.constant dense<0> : tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3483635Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:45:36.3483782Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:45:36.3483989Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:45:36.3484211Z %cst_9 = arith.constant dense<4> : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3484456Z %0 = tt.get_program_id x : i32 2026-02-21T09:45:36.3484601Z %1 = arith.divsi %0, %c4096_i32 : i32 2026-02-21T09:45:36.3484748Z %2 = arith.muli %1, %c8_i32 : i32 2026-02-21T09:45:36.3484889Z %3 = arith.subi %c16_i32, %2 : i32 2026-02-21T09:45:36.3485035Z %4 = arith.minsi %3, %c8_i32 : i32 2026-02-21T09:45:36.3485177Z %5 = arith.remsi %0, %c4096_i32 : i32 2026-02-21T09:45:36.3485326Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:45:36.3485539Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:45:36.3485678Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:45:36.3485814Z %9 = arith.muli %7, %c512_i32 : i32 2026-02-21T09:45:36.3486155Z %10 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:45:36.3486514Z %11 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:36.3486826Z %12 = tt.splat %9 : i32 -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:45:36.3487113Z %13 = arith.addi %12, %10 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:45:36.3487332Z %14 = arith.muli %8, %c32_i32 : i32 2026-02-21T09:45:36.3487578Z %15 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:36.3487912Z %16 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:36.3488209Z %17 = tt.splat %14 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:36.3488484Z %18 = arith.addi %17, %15 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:36.3488786Z %19 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:36.3489171Z %20 = tt.expand_dims %18 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:45:36.3489481Z %21 = arith.muli %20, %cst_7 : tensor<32x1xi32, #blocked1> 2026-02-21T09:45:36.3489717Z %22 = tt.broadcast %21 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:36.3489986Z %23 = tt.splat %arg0 : !tt.ptr -> tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:36.3490318Z %24 = tt.expand_dims %13 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x512xi32, #blocked2> 2026-02-21T09:45:36.3490676Z %25 = tt.splat %arg1 : !tt.ptr -> tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T09:45:36.3491019Z %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:45:36.3491451Z %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:45:36.3491857Z %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:36.3492112Z %29 = arith.cmpi eq, %28, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:36.3492307Z %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked> 2026-02-21T09:45:36.3492530Z %31 = arith.cmpi eq, %28, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:36.3492728Z %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked> 2026-02-21T09:45:36.3492996Z %33 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst_6) -> (tensor<32x512xf32, #mma>) : i32 { 2026-02-21T09:45:36.3493216Z %60 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:45:36.3493392Z %61 = tt.splat %60 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:36.3493610Z %62 = arith.addi %61, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:36.3493882Z %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:36.3494157Z %64 = tt.broadcast %63 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:36.3494346Z %65 = arith.addi %22, %64 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:36.3494567Z %66 = tt.addptr %23, %65 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:36.3494771Z %67 = tt.load %66 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:36.3495059Z %68 = ttg.convert_layout %67 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:36.3495459Z %69 = arith.extf %68 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:36.3495744Z %70 = arith.muli %arg3, %c8192_i32 : i32 2026-02-21T09:45:36.3495892Z %71 = tt.splat %70 : i32 -> tensor<1x512xi32, #blocked2> 2026-02-21T09:45:36.3496049Z %72 = arith.addi %71, %24 : tensor<1x512xi32, #blocked2> 2026-02-21T09:45:36.3496247Z %73 = tt.addptr %25, %72 : tensor<1x512x!tt.ptr, #blocked2>, tensor<1x512xi32, #blocked2> 2026-02-21T09:45:36.3496446Z %74 = tt.load %73 : tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T09:45:36.3496692Z %75 = ttg.convert_layout %74 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3496975Z %76 = arith.shli %75, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3497208Z %77 = arith.shrsi %76, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3497448Z %78 = arith.shrsi %75, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3497737Z %79 = tt.expand_dims %77 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:45:36.3498077Z %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:45:36.3498363Z %81 = tt.broadcast %79 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3498626Z %82 = arith.select %30, %81, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3498867Z %83 = tt.broadcast %80 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3499095Z %84 = arith.select %32, %83, %82 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3499339Z %85 = tt.reshape %84 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3> 2026-02-21T09:45:36.3499562Z %86 = arith.sitofp %85 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3> 2026-02-21T09:45:36.3499808Z %87 = ttg.local_alloc %86 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem> 2026-02-21T09:45:36.3500135Z %88 = ttg.local_load %87 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:36.3500606Z %89 = tt.dot %69, %88, %arg4, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma> 2026-02-21T09:45:36.3500953Z %90 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:45:36.3501083Z %91 = arith.muli %90, %c2_i32 : i32 2026-02-21T09:45:36.3501249Z %92 = tt.splat %91 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:36.3501471Z %93 = arith.addi %92, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:36.3501742Z %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:36.3502016Z %95 = tt.broadcast %94 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:36.3502212Z %96 = arith.addi %22, %95 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:36.3502407Z %97 = tt.addptr %23, %96 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:36.3502629Z %98 = tt.load %97 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:36.3502884Z %99 = ttg.convert_layout %98 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:36.3503299Z %100 = arith.extf %99 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:36.3503581Z %101 = arith.muli %90, %c8192_i32 : i32 2026-02-21T09:45:36.3503720Z %102 = tt.splat %101 : i32 -> tensor<1x512xi32, #blocked2> 2026-02-21T09:45:36.3503879Z %103 = arith.addi %102, %24 : tensor<1x512xi32, #blocked2> 2026-02-21T09:45:36.3504075Z %104 = tt.addptr %25, %103 : tensor<1x512x!tt.ptr, #blocked2>, tensor<1x512xi32, #blocked2> 2026-02-21T09:45:36.3504277Z %105 = tt.load %104 : tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T09:45:36.3504526Z %106 = ttg.convert_layout %105 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3504806Z %107 = arith.shli %106, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3505043Z %108 = arith.shrsi %107, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3505275Z %109 = arith.shrsi %106, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3505568Z %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:45:36.3505905Z %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:45:36.3506189Z %112 = tt.broadcast %110 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3506432Z %113 = arith.select %30, %112, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3506690Z %114 = tt.broadcast %111 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3506925Z %115 = arith.select %32, %114, %113 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3507158Z %116 = tt.reshape %115 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3> 2026-02-21T09:45:36.3507393Z %117 = arith.sitofp %116 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3> 2026-02-21T09:45:36.3507649Z %118 = ttg.local_alloc %117 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem> 2026-02-21T09:45:36.3507972Z %119 = ttg.local_load %118 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:36.3508440Z %120 = tt.dot %100, %119, %89, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma> 2026-02-21T09:45:36.3508781Z %121 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:45:36.3508903Z %122 = arith.muli %121, %c2_i32 : i32 2026-02-21T09:45:36.3509071Z %123 = tt.splat %122 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:36.3509295Z %124 = arith.addi %123, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:36.3509570Z %125 = tt.expand_dims %124 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:36.3509845Z %126 = tt.broadcast %125 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:36.3510039Z %127 = arith.addi %22, %126 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:36.3510238Z %128 = tt.addptr %23, %127 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:36.3510478Z %129 = tt.load %128 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:36.3510741Z %130 = ttg.convert_layout %129 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:36.3511155Z %131 = arith.extf %130 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:36.3511433Z %132 = arith.muli %121, %c8192_i32 : i32 2026-02-21T09:45:36.3511576Z %133 = tt.splat %132 : i32 -> tensor<1x512xi32, #blocked2> 2026-02-21T09:45:36.3511731Z %134 = arith.addi %133, %24 : tensor<1x512xi32, #blocked2> 2026-02-21T09:45:36.3511930Z %135 = tt.addptr %25, %134 : tensor<1x512x!tt.ptr, #blocked2>, tensor<1x512xi32, #blocked2> 2026-02-21T09:45:36.3512133Z %136 = tt.load %135 : tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T09:45:36.3512374Z %137 = ttg.convert_layout %136 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3512655Z %138 = arith.shli %137, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3512888Z %139 = arith.shrsi %138, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3513123Z %140 = arith.shrsi %137, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3513410Z %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:45:36.3513746Z %142 = tt.expand_dims %140 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:45:36.3514032Z %143 = tt.broadcast %141 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3514270Z %144 = arith.select %30, %143, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3514528Z %145 = tt.broadcast %142 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3514761Z %146 = arith.select %32, %145, %144 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3514991Z %147 = tt.reshape %146 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3> 2026-02-21T09:45:36.3515233Z %148 = arith.sitofp %147 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3> 2026-02-21T09:45:36.3515484Z %149 = ttg.local_alloc %148 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem> 2026-02-21T09:45:36.3515807Z %150 = ttg.local_load %149 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:36.3516276Z %151 = tt.dot %131, %150, %120, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma> 2026-02-21T09:45:36.3516614Z %152 = arith.addi %arg3, %c3_i32 : i32 2026-02-21T09:45:36.3516737Z %153 = arith.muli %152, %c2_i32 : i32 2026-02-21T09:45:36.3516907Z %154 = tt.splat %153 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:36.3517133Z %155 = arith.addi %154, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:36.3517408Z %156 = tt.expand_dims %155 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:36.3517681Z %157 = tt.broadcast %156 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:36.3517873Z %158 = arith.addi %22, %157 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:36.3518077Z %159 = tt.addptr %23, %158 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:36.3518297Z %160 = tt.load %159 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:36.3518561Z %161 = ttg.convert_layout %160 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:36.3518968Z %162 = arith.extf %161 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:36.3519253Z %163 = arith.muli %152, %c8192_i32 : i32 2026-02-21T09:45:36.3519396Z %164 = tt.splat %163 : i32 -> tensor<1x512xi32, #blocked2> 2026-02-21T09:45:36.3519550Z %165 = arith.addi %164, %24 : tensor<1x512xi32, #blocked2> 2026-02-21T09:45:36.3519746Z %166 = tt.addptr %25, %165 : tensor<1x512x!tt.ptr, #blocked2>, tensor<1x512xi32, #blocked2> 2026-02-21T09:45:36.3519946Z %167 = tt.load %166 : tensor<1x512x!tt.ptr, #blocked2> 2026-02-21T09:45:36.3520188Z %168 = ttg.convert_layout %167 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3520470Z %169 = arith.shli %168, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3520707Z %170 = arith.shrsi %169, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3520941Z %171 = arith.shrsi %168, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:36.3521229Z %172 = tt.expand_dims %170 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:45:36.3521567Z %173 = tt.expand_dims %171 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked> 2026-02-21T09:45:36.3521852Z %174 = tt.broadcast %172 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3522091Z %175 = arith.select %30, %174, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3522351Z %176 = tt.broadcast %173 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3522642Z %177 = arith.select %32, %176, %175 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked> 2026-02-21T09:45:36.3522877Z %178 = tt.reshape %177 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3> 2026-02-21T09:45:36.3523130Z %179 = arith.sitofp %178 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3> 2026-02-21T09:45:36.3523379Z %180 = ttg.local_alloc %179 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem> 2026-02-21T09:45:36.3523703Z %181 = ttg.local_load %180 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:36.3524171Z %182 = tt.dot %162, %181, %151, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma> 2026-02-21T09:45:36.3524516Z scf.yield %182 : tensor<32x512xf32, #mma> 2026-02-21T09:45:36.3524643Z } {tt.num_stages = 1 : i32} 2026-02-21T09:45:36.3524796Z %34 = arith.truncf %33 : tensor<32x512xf32, #mma> to tensor<32x512xbf16, #mma> 2026-02-21T09:45:36.3524963Z %35 = arith.extsi %14 : i32 to i64 2026-02-21T09:45:36.3525077Z %36 = arith.extsi %9 : i32 to i64 2026-02-21T09:45:36.3525231Z %37 = tt.splat %arg2 : !tt.ptr -> tensor<32x512x!tt.ptr, #mma> 2026-02-21T09:45:36.3525428Z %38 = tt.splat %35 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:36.3525698Z %39 = arith.extsi %16 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:36.3525965Z %40 = arith.addi %38, %39 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:36.3526234Z %41 = tt.expand_dims %40 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T09:45:36.3526466Z %42 = arith.muli %41, %cst_3 : tensor<32x1xi64, #mma> 2026-02-21T09:45:36.3526653Z %43 = tt.broadcast %42 : tensor<32x1xi64, #mma> -> tensor<32x512xi64, #mma> 2026-02-21T09:45:36.3526855Z %44 = tt.splat %36 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:36.3527128Z %45 = arith.extsi %11 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:36.3527396Z %46 = arith.addi %44, %45 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:36.3527655Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:45:36.3527906Z %48 = tt.broadcast %47 : tensor<1x512xi64, #mma> -> tensor<32x512xi64, #mma> 2026-02-21T09:45:36.3528080Z %49 = arith.addi %43, %48 : tensor<32x512xi64, #mma> 2026-02-21T09:45:36.3528264Z %50 = tt.addptr %37, %49 : tensor<32x512x!tt.ptr, #mma>, tensor<32x512xi64, #mma> 2026-02-21T09:45:36.3528454Z %51 = arith.cmpi sge, %41, %cst_2 : tensor<32x1xi64, #mma> 2026-02-21T09:45:36.3528615Z %52 = arith.cmpi slt, %41, %cst_1 : tensor<32x1xi64, #mma> 2026-02-21T09:45:36.3528765Z %53 = arith.andi %51, %52 : tensor<32x1xi1, #mma> 2026-02-21T09:45:36.3528930Z %54 = tt.broadcast %53 : tensor<32x1xi1, #mma> -> tensor<32x512xi1, #mma> 2026-02-21T09:45:36.3529105Z %55 = arith.cmpi sge, %47, %cst_0 : tensor<1x512xi64, #mma> 2026-02-21T09:45:36.3529266Z %56 = arith.cmpi slt, %47, %cst : tensor<1x512xi64, #mma> 2026-02-21T09:45:36.3529417Z %57 = arith.andi %55, %56 : tensor<1x512xi1, #mma> 2026-02-21T09:45:36.3529581Z %58 = tt.broadcast %57 : tensor<1x512xi1, #mma> -> tensor<32x512xi1, #mma> 2026-02-21T09:45:36.3529751Z %59 = arith.andi %54, %58 : tensor<32x512xi1, #mma> 2026-02-21T09:45:36.3529903Z tt.store %50, %34, %59 : tensor<32x512x!tt.ptr, #mma> 2026-02-21T09:45:36.3530060Z tt.return 2026-02-21T09:45:36.3530141Z } 2026-02-21T09:45:36.3530223Z } 2026-02-21T09:45:36.3530266Z 2026-02-21T09:45:36.3530302Z {-# 2026-02-21T09:45:36.3530382Z external_resources: { 2026-02-21T09:45:36.3530481Z mlir_reproducer: { 2026-02-21T09:45:36.3531478Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:45:36.3532486Z disable_threading: false, 2026-02-21T09:45:36.3532593Z verify_each: true 2026-02-21T09:45:36.3532683Z } 2026-02-21T09:45:36.3532756Z } 2026-02-21T09:45:36.3532824Z #-} 2026-02-21T09:45:36.3533102Z /tmp/torchinductor_root/ly/cly5ppem7bczabonqsmt6tvx5rfi6ahp2vmwbcypndonimgii6tk.py:12:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:45:36.3533785Z /tmp/torchinductor_root/ly/cly5ppem7bczabonqsmt6tvx5rfi6ahp2vmwbcypndonimgii6tk.py:12:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:45:36.3534335Z [67s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:45:36.3535076Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 512], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:45:36.3535743Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:45:36.3535908Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:45:38.0702895Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:45:38.0711427Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:45:38.0712168Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:45:38.0712799Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:45:38.0713363Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [8, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:45:38.0714004Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:45:38.0714951Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:45:38.0715729Z %cst = arith.constant dense<0.000000e+00> : tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0716044Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:45:38.0716279Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:45:38.0716528Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:45:38.0716779Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T09:45:38.0717028Z %c216_i32 = arith.constant 216 : i32 2026-02-21T09:45:38.0717428Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:45:38.0717655Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:45:38.0717880Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:45:38.0718122Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:45:38.0718473Z %cst_0 = arith.constant dense<0> : tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0718773Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:45:38.0718991Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:45:38.0719225Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:45:38.0719462Z %c255_i32 = arith.constant 255 : i32 2026-02-21T09:45:38.0719831Z %cst_1 = arith.constant dense<0> : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0720271Z %cst_2 = arith.constant dense<0> : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0720570Z %c-1_i32 = arith.constant -1 : i32 2026-02-21T09:45:38.0720859Z %cst_3 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1> 2026-02-21T09:45:38.0721193Z %cst_4 = arith.constant dense<4> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0721534Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:38.0721803Z %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:38.0722088Z %cst_7 = arith.constant dense<8192> : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0722360Z %cst_8 = arith.constant dense<0> : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0722713Z %cst_9 = arith.constant dense<16384> : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0722990Z %cst_10 = arith.constant dense<0> : tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0723260Z %cst_11 = arith.constant dense<8192> : tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0723496Z %0 = tt.get_program_id x : i32 2026-02-21T09:45:38.0723688Z %1 = arith.muli %0, %c216_i32 : i32 2026-02-21T09:45:38.0723876Z %2 = arith.addi %1, %c216_i32 : i32 2026-02-21T09:45:38.0729219Z %3 = arith.minsi %2, %c131072_i32 : i32 2026-02-21T09:45:38.0729566Z %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:38.0730070Z %5 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:38.0730561Z %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:38.0731051Z %7 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:38.0731467Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0731817Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:38.0732107Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0732486Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:45:38.0732993Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:45:38.0733483Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:38.0733794Z %14 = arith.cmpi eq, %13, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:38.0734033Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked> 2026-02-21T09:45:38.0734360Z %16 = arith.cmpi eq, %13, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:38.0734596Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked> 2026-02-21T09:45:38.0734878Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<32x32x!tt.ptr, #mma> 2026-02-21T09:45:38.0735207Z %19 = arith.extsi %5 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:38.0735616Z %20 = arith.extsi %7 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:38.0735920Z %21 = arith.subi %3, %1 : i32 2026-02-21T09:45:38.0736053Z %22 = arith.remsi %21, %c4_i32 : i32 2026-02-21T09:45:38.0736201Z %23 = arith.subi %21, %22 : i32 2026-02-21T09:45:38.0736338Z %24 = arith.addi %1, %23 : i32 2026-02-21T09:45:38.0736489Z scf.for %arg3 = %1 to %24 step %c4_i32 : i32 { 2026-02-21T09:45:38.0736655Z %29 = arith.divsi %arg3, %c1024_i32 : i32 2026-02-21T09:45:38.0736808Z %30 = arith.muli %29, %c2_i32 : i32 2026-02-21T09:45:38.0736959Z %31 = arith.subi %c256_i32, %30 : i32 2026-02-21T09:45:38.0737106Z %32 = arith.minsi %31, %c2_i32 : i32 2026-02-21T09:45:38.0737263Z %33 = arith.remsi %arg3, %c1024_i32 : i32 2026-02-21T09:45:38.0737409Z %34 = arith.remsi %33, %32 : i32 2026-02-21T09:45:38.0737552Z %35 = arith.addi %30, %34 : i32 2026-02-21T09:45:38.0737684Z %36 = arith.divsi %33, %32 : i32 2026-02-21T09:45:38.0737831Z %37 = arith.muli %35, %c32_i32 : i32 2026-02-21T09:45:38.0738095Z %38 = tt.splat %37 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:38.0738456Z %39 = arith.addi %38, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:38.0738712Z %40 = arith.muli %36, %c32_i32 : i32 2026-02-21T09:45:38.0738918Z %41 = tt.splat %40 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:38.0739215Z %42 = arith.addi %41, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:38.0739560Z %43 = tt.expand_dims %42 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:45:38.0739884Z %44 = arith.muli %43, %cst_3 : tensor<32x1xi32, #blocked1> 2026-02-21T09:45:38.0740124Z %45 = tt.broadcast %44 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0740545Z %46 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0740975Z %47 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T09:45:38.0741195Z %200 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:45:38.0741373Z %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0741606Z %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0741884Z %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:38.0742167Z %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0742370Z %205 = arith.addi %45, %204 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0742573Z %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0742786Z %207 = tt.load %206 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:38.0743055Z %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0743467Z %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0743782Z %210 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:45:38.0743966Z %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0744203Z %212 = arith.addi %211, %46 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0744531Z %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0744848Z %214 = tt.load %213 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0745088Z %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0745329Z %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0745576Z %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0745868Z %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0746211Z %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0746503Z %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0746743Z %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0746985Z %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0747220Z %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0747473Z %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:38.0747705Z %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:38.0748002Z %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0748496Z %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0748855Z %228 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:45:38.0748984Z %229 = arith.muli %228, %c2_i32 : i32 2026-02-21T09:45:38.0749164Z %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0749391Z %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0749677Z %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:38.0749963Z %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0750161Z %234 = arith.addi %45, %233 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0750366Z %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0750574Z %236 = tt.load %235 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:38.0750847Z %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0751252Z %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0751540Z %239 = arith.muli %228, %c8192_i32 : i32 2026-02-21T09:45:38.0751744Z %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0751974Z %241 = arith.addi %240, %46 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0752299Z %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0752628Z %243 = tt.load %242 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0752862Z %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0753103Z %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0753341Z %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0753642Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0753982Z %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0754267Z %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0754512Z %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0754750Z %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0754988Z %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0755225Z %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:38.0755453Z %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:38.0755783Z %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0756261Z %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0756613Z scf.yield %256 : tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0756745Z } {tt.num_stages = 1 : i32} 2026-02-21T09:45:38.0756904Z %48 = arith.truncf %47 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma> 2026-02-21T09:45:38.0757075Z %49 = arith.extsi %40 : i32 to i64 2026-02-21T09:45:38.0757195Z %50 = arith.extsi %37 : i32 to i64 2026-02-21T09:45:38.0757360Z %51 = tt.splat %49 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:38.0757570Z %52 = arith.addi %51, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:38.0757828Z %53 = tt.expand_dims %52 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0758069Z %54 = arith.muli %53, %cst_7 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0758243Z %55 = tt.broadcast %54 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0758447Z %56 = tt.splat %50 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:38.0758648Z %57 = arith.addi %56, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:38.0758914Z %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0759170Z %59 = tt.broadcast %58 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0759345Z %60 = arith.addi %55, %59 : tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0759550Z %61 = tt.addptr %18, %60 : tensor<32x32x!tt.ptr, #mma>, tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0759747Z %62 = arith.cmpi sge, %53, %cst_8 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0759916Z %63 = arith.cmpi slt, %53, %cst_9 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0760075Z %64 = arith.andi %62, %63 : tensor<32x1xi1, #mma> 2026-02-21T09:45:38.0760255Z %65 = tt.broadcast %64 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0760440Z %66 = arith.cmpi sge, %58, %cst_10 : tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0760602Z %67 = arith.cmpi slt, %58, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0760759Z %68 = arith.andi %66, %67 : tensor<1x32xi1, #mma> 2026-02-21T09:45:38.0760925Z %69 = tt.broadcast %68 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0761102Z %70 = arith.andi %65, %69 : tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0761258Z tt.store %61, %48, %70 : tensor<32x32x!tt.ptr, #mma> 2026-02-21T09:45:38.0761408Z %71 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:45:38.0761536Z %72 = arith.divsi %71, %c1024_i32 : i32 2026-02-21T09:45:38.0761660Z %73 = arith.muli %72, %c2_i32 : i32 2026-02-21T09:45:38.0761784Z %74 = arith.subi %c256_i32, %73 : i32 2026-02-21T09:45:38.0761905Z %75 = arith.minsi %74, %c2_i32 : i32 2026-02-21T09:45:38.0762033Z %76 = arith.remsi %71, %c1024_i32 : i32 2026-02-21T09:45:38.0762155Z %77 = arith.remsi %76, %75 : i32 2026-02-21T09:45:38.0762274Z %78 = arith.addi %73, %77 : i32 2026-02-21T09:45:38.0762392Z %79 = arith.divsi %76, %75 : i32 2026-02-21T09:45:38.0762508Z %80 = arith.muli %78, %c32_i32 : i32 2026-02-21T09:45:38.0762770Z %81 = tt.splat %80 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:38.0763068Z %82 = arith.addi %81, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:38.0763303Z %83 = arith.muli %79, %c32_i32 : i32 2026-02-21T09:45:38.0763479Z %84 = tt.splat %83 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:38.0763716Z %85 = arith.addi %84, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:38.0763999Z %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:45:38.0764249Z %87 = arith.muli %86, %cst_3 : tensor<32x1xi32, #blocked1> 2026-02-21T09:45:38.0764441Z %88 = tt.broadcast %87 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0764793Z %89 = tt.expand_dims %82 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0765184Z %90 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T09:45:38.0765406Z %200 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:45:38.0765583Z %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0765815Z %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0766097Z %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:38.0766378Z %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0766578Z %205 = arith.addi %88, %204 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0766783Z %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0766997Z %207 = tt.load %206 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:38.0767297Z %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0767700Z %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0768007Z %210 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:45:38.0768189Z %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0768422Z %212 = arith.addi %211, %89 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0768736Z %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0769045Z %214 = tt.load %213 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0769288Z %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0769525Z %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0769769Z %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0770075Z %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0770414Z %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0770706Z %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0770949Z %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0771207Z %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0771448Z %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0771698Z %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:38.0771927Z %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:38.0772227Z %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0772698Z %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0773049Z %228 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:45:38.0773177Z %229 = arith.muli %228, %c2_i32 : i32 2026-02-21T09:45:38.0773358Z %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0773584Z %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0773873Z %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:38.0774161Z %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0774357Z %234 = arith.addi %88, %233 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0774565Z %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0782935Z %236 = tt.load %235 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:38.0783225Z %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0783696Z %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0783987Z %239 = arith.muli %228, %c8192_i32 : i32 2026-02-21T09:45:38.0784170Z %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0784429Z %241 = arith.addi %240, %89 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0784743Z %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0785059Z %243 = tt.load %242 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0785298Z %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0785542Z %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0785788Z %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0786084Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0786426Z %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0786714Z %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0786955Z %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0787198Z %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0787444Z %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0787679Z %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:38.0787927Z %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:38.0788223Z %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0788689Z %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0789040Z scf.yield %256 : tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0789175Z } {tt.num_stages = 1 : i32} 2026-02-21T09:45:38.0789338Z %91 = arith.truncf %90 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma> 2026-02-21T09:45:38.0789508Z %92 = arith.extsi %83 : i32 to i64 2026-02-21T09:45:38.0789633Z %93 = arith.extsi %80 : i32 to i64 2026-02-21T09:45:38.0789793Z %94 = tt.splat %92 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:38.0790006Z %95 = arith.addi %94, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:38.0790270Z %96 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0790511Z %97 = arith.muli %96, %cst_7 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0790690Z %98 = tt.broadcast %97 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0790890Z %99 = tt.splat %93 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:38.0791101Z %100 = arith.addi %99, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:38.0791366Z %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0791649Z %102 = tt.broadcast %101 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0791834Z %103 = arith.addi %98, %102 : tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0792023Z %104 = tt.addptr %18, %103 : tensor<32x32x!tt.ptr, #mma>, tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0792254Z %105 = arith.cmpi sge, %96, %cst_8 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0792420Z %106 = arith.cmpi slt, %96, %cst_9 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0792583Z %107 = arith.andi %105, %106 : tensor<32x1xi1, #mma> 2026-02-21T09:45:38.0792761Z %108 = tt.broadcast %107 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0792951Z %109 = arith.cmpi sge, %101, %cst_10 : tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0793123Z %110 = arith.cmpi slt, %101, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0793285Z %111 = arith.andi %109, %110 : tensor<1x32xi1, #mma> 2026-02-21T09:45:38.0793464Z %112 = tt.broadcast %111 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0793643Z %113 = arith.andi %108, %112 : tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0793809Z tt.store %104, %91, %113 : tensor<32x32x!tt.ptr, #mma> 2026-02-21T09:45:38.0793966Z %114 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:45:38.0794095Z %115 = arith.divsi %114, %c1024_i32 : i32 2026-02-21T09:45:38.0794225Z %116 = arith.muli %115, %c2_i32 : i32 2026-02-21T09:45:38.0794350Z %117 = arith.subi %c256_i32, %116 : i32 2026-02-21T09:45:38.0794476Z %118 = arith.minsi %117, %c2_i32 : i32 2026-02-21T09:45:38.0794601Z %119 = arith.remsi %114, %c1024_i32 : i32 2026-02-21T09:45:38.0794728Z %120 = arith.remsi %119, %118 : i32 2026-02-21T09:45:38.0794846Z %121 = arith.addi %116, %120 : i32 2026-02-21T09:45:38.0794967Z %122 = arith.divsi %119, %118 : i32 2026-02-21T09:45:38.0795108Z %123 = arith.muli %121, %c32_i32 : i32 2026-02-21T09:45:38.0795321Z %124 = tt.splat %123 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:38.0795653Z %125 = arith.addi %124, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:38.0795869Z %126 = arith.muli %122, %c32_i32 : i32 2026-02-21T09:45:38.0796045Z %127 = tt.splat %126 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:38.0796276Z %128 = arith.addi %127, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:38.0796560Z %129 = tt.expand_dims %128 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:45:38.0796823Z %130 = arith.muli %129, %cst_3 : tensor<32x1xi32, #blocked1> 2026-02-21T09:45:38.0797023Z %131 = tt.broadcast %130 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0797385Z %132 = tt.expand_dims %125 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0797785Z %133 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T09:45:38.0798006Z %200 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:45:38.0798186Z %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0798415Z %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0798696Z %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:38.0798979Z %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0799216Z %205 = arith.addi %131, %204 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0799423Z %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0799631Z %207 = tt.load %206 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:38.0799922Z %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0800333Z %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0800619Z %210 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:45:38.0800805Z %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0801038Z %212 = arith.addi %211, %132 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0801361Z %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0801678Z %214 = tt.load %213 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0801916Z %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0802156Z %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0802397Z %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0802742Z %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0803108Z %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0803396Z %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0803640Z %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0803900Z %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0804139Z %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0804374Z %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:38.0804601Z %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:38.0804902Z %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0805373Z %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0805722Z %228 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:45:38.0805855Z %229 = arith.muli %228, %c2_i32 : i32 2026-02-21T09:45:38.0806030Z %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0806265Z %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0806547Z %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:38.0806829Z %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0807030Z %234 = arith.addi %131, %233 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0807256Z %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0807469Z %236 = tt.load %235 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:38.0807737Z %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0808162Z %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0808452Z %239 = arith.muli %228, %c8192_i32 : i32 2026-02-21T09:45:38.0808630Z %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0808870Z %241 = arith.addi %240, %132 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0809189Z %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0809500Z %243 = tt.load %242 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0809739Z %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0809977Z %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0810219Z %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0810514Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0810855Z %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0811166Z %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0811405Z %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0811646Z %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0811898Z %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0812132Z %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:38.0812358Z %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:38.0812653Z %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0813122Z %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0813479Z scf.yield %256 : tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0813608Z } {tt.num_stages = 1 : i32} 2026-02-21T09:45:38.0813773Z %134 = arith.truncf %133 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma> 2026-02-21T09:45:38.0813948Z %135 = arith.extsi %126 : i32 to i64 2026-02-21T09:45:38.0814075Z %136 = arith.extsi %123 : i32 to i64 2026-02-21T09:45:38.0814242Z %137 = tt.splat %135 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:38.0814457Z %138 = arith.addi %137, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:38.0814727Z %139 = tt.expand_dims %138 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0814967Z %140 = arith.muli %139, %cst_7 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0815150Z %141 = tt.broadcast %140 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0815372Z %142 = tt.splat %136 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:38.0815585Z %143 = arith.addi %142, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:38.0815850Z %144 = tt.expand_dims %143 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0816122Z %145 = tt.broadcast %144 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0816308Z %146 = arith.addi %141, %145 : tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0816495Z %147 = tt.addptr %18, %146 : tensor<32x32x!tt.ptr, #mma>, tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0816700Z %148 = arith.cmpi sge, %139, %cst_8 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0816870Z %149 = arith.cmpi slt, %139, %cst_9 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0817030Z %150 = arith.andi %148, %149 : tensor<32x1xi1, #mma> 2026-02-21T09:45:38.0817210Z %151 = tt.broadcast %150 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0817401Z %152 = arith.cmpi sge, %144, %cst_10 : tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0817575Z %153 = arith.cmpi slt, %144, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0817737Z %154 = arith.andi %152, %153 : tensor<1x32xi1, #mma> 2026-02-21T09:45:38.0817913Z %155 = tt.broadcast %154 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0818094Z %156 = arith.andi %151, %155 : tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0818254Z tt.store %147, %134, %156 : tensor<32x32x!tt.ptr, #mma> 2026-02-21T09:45:38.0818410Z %157 = arith.addi %arg3, %c3_i32 : i32 2026-02-21T09:45:38.0818538Z %158 = arith.divsi %157, %c1024_i32 : i32 2026-02-21T09:45:38.0818669Z %159 = arith.muli %158, %c2_i32 : i32 2026-02-21T09:45:38.0818792Z %160 = arith.subi %c256_i32, %159 : i32 2026-02-21T09:45:38.0818934Z %161 = arith.minsi %160, %c2_i32 : i32 2026-02-21T09:45:38.0819063Z %162 = arith.remsi %157, %c1024_i32 : i32 2026-02-21T09:45:38.0819187Z %163 = arith.remsi %162, %161 : i32 2026-02-21T09:45:38.0819330Z %164 = arith.addi %159, %163 : i32 2026-02-21T09:45:38.0819447Z %165 = arith.divsi %162, %161 : i32 2026-02-21T09:45:38.0819570Z %166 = arith.muli %164, %c32_i32 : i32 2026-02-21T09:45:38.0819781Z %167 = tt.splat %166 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:38.0820092Z %168 = arith.addi %167, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:38.0820310Z %169 = arith.muli %165, %c32_i32 : i32 2026-02-21T09:45:38.0820481Z %170 = tt.splat %169 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:38.0820709Z %171 = arith.addi %170, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:38.0820993Z %172 = tt.expand_dims %171 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:45:38.0821255Z %173 = arith.muli %172, %cst_3 : tensor<32x1xi32, #blocked1> 2026-02-21T09:45:38.0821455Z %174 = tt.broadcast %173 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0821811Z %175 = tt.expand_dims %168 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0822209Z %176 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>) : i32 { 2026-02-21T09:45:38.0822424Z %200 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:45:38.0822603Z %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0822849Z %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0823127Z %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:38.0823412Z %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0823622Z %205 = arith.addi %174, %204 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0823829Z %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0824041Z %207 = tt.load %206 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:38.0824309Z %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0824718Z %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0825012Z %210 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:45:38.0825198Z %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0825437Z %212 = arith.addi %211, %175 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0825756Z %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0826073Z %214 = tt.load %213 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0826316Z %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0826552Z %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0826809Z %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0827098Z %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0827451Z %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0827737Z %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0827980Z %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0828219Z %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0828448Z %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0828681Z %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:38.0828908Z %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:38.0829204Z %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0829674Z %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0830020Z %228 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:45:38.0830143Z %229 = arith.muli %228, %c2_i32 : i32 2026-02-21T09:45:38.0830316Z %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0830539Z %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0830833Z %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:38.0831110Z %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0831305Z %234 = arith.addi %174, %233 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0831524Z %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0831728Z %236 = tt.load %235 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:38.0831997Z %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0832395Z %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0832677Z %239 = arith.muli %228, %c8192_i32 : i32 2026-02-21T09:45:38.0832859Z %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0833090Z %241 = arith.addi %240, %175 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0833403Z %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0833714Z %243 = tt.load %242 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0833942Z %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0834183Z %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0834420Z %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0834737Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0835072Z %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0835373Z %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0835612Z %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0835846Z %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0836080Z %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0836310Z %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:38.0836531Z %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:38.0836829Z %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0837291Z %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0837638Z scf.yield %256 : tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0837764Z } {tt.num_stages = 1 : i32} 2026-02-21T09:45:38.0837919Z %177 = arith.truncf %176 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma> 2026-02-21T09:45:38.0838089Z %178 = arith.extsi %169 : i32 to i64 2026-02-21T09:45:38.0838209Z %179 = arith.extsi %166 : i32 to i64 2026-02-21T09:45:38.0838371Z %180 = tt.splat %178 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:38.0838580Z %181 = arith.addi %180, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:38.0838863Z %182 = tt.expand_dims %181 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0839106Z %183 = arith.muli %182, %cst_7 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0839280Z %184 = tt.broadcast %183 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0839506Z %185 = tt.splat %179 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:38.0839709Z %186 = arith.addi %185, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:38.0839970Z %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0840226Z %188 = tt.broadcast %187 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0840403Z %189 = arith.addi %184, %188 : tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0840588Z %190 = tt.addptr %18, %189 : tensor<32x32x!tt.ptr, #mma>, tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0840780Z %191 = arith.cmpi sge, %182, %cst_8 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0840948Z %192 = arith.cmpi slt, %182, %cst_9 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0841104Z %193 = arith.andi %191, %192 : tensor<32x1xi1, #mma> 2026-02-21T09:45:38.0841274Z %194 = tt.broadcast %193 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0841461Z %195 = arith.cmpi sge, %187, %cst_10 : tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0841624Z %196 = arith.cmpi slt, %187, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0841783Z %197 = arith.andi %195, %196 : tensor<1x32xi1, #mma> 2026-02-21T09:45:38.0841950Z %198 = tt.broadcast %197 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0842123Z %199 = arith.andi %194, %198 : tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0842296Z tt.store %190, %177, %199 : tensor<32x32x!tt.ptr, #mma> 2026-02-21T09:45:38.0842445Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:45:38.0842604Z %25 = arith.subi %3, %24 : i32 2026-02-21T09:45:38.0842742Z %26 = arith.muli %25, %c256_i32 : i32 2026-02-21T09:45:38.0842861Z %27 = arith.subi %24, %c1_i32 : i32 2026-02-21T09:45:38.0843332Z %28:8 = scf.for %arg3 = %c0_i32 to %26 step %c1_i32 iter_args(%arg4 = %c-1_i32, %arg5 = %27, %arg6 = %c0_i32, %arg7 = %cst, %arg8 = %c0_i32, %arg9 = %c0_i32, %arg10 = %cst_2, %arg11 = %cst_1) -> (i32, i32, i32, tensor<32x32xf32, #mma>, i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>) : i32 { 2026-02-21T09:45:38.0843801Z %29 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:45:38.0843929Z %30 = arith.cmpi eq, %arg4, %c255_i32 : i32 2026-02-21T09:45:38.0844059Z %31 = arith.select %30, %c0_i32, %29 : i32 2026-02-21T09:45:38.0844185Z %32 = arith.cmpi eq, %31, %c0_i32 : i32 2026-02-21T09:45:38.0844309Z %33 = arith.select %32, %c0_i32, %arg6 : i32 2026-02-21T09:45:38.0844541Z %34:5 = scf.if %32 -> (i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32) { 2026-02-21T09:45:38.0844771Z %95 = arith.addi %arg5, %c1_i32 : i32 2026-02-21T09:45:38.0844894Z %96 = arith.divsi %95, %c1024_i32 : i32 2026-02-21T09:45:38.0845017Z %97 = arith.muli %96, %c2_i32 : i32 2026-02-21T09:45:38.0845134Z %98 = arith.subi %c256_i32, %97 : i32 2026-02-21T09:45:38.0845253Z %99 = arith.minsi %98, %c2_i32 : i32 2026-02-21T09:45:38.0845370Z %100 = arith.remsi %95, %c1024_i32 : i32 2026-02-21T09:45:38.0845491Z %101 = arith.remsi %100, %99 : i32 2026-02-21T09:45:38.0845608Z %102 = arith.addi %97, %101 : i32 2026-02-21T09:45:38.0845721Z %103 = arith.divsi %100, %99 : i32 2026-02-21T09:45:38.0845839Z %104 = arith.muli %102, %c32_i32 : i32 2026-02-21T09:45:38.0846071Z %105 = tt.splat %104 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:38.0846373Z %106 = arith.addi %105, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:38.0846586Z %107 = arith.muli %103, %c32_i32 : i32 2026-02-21T09:45:38.0846774Z %108 = tt.splat %107 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:38.0847002Z %109 = arith.addi %108, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:38.0847279Z %110 = tt.expand_dims %109 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:45:38.0847535Z %111 = arith.muli %110, %cst_3 : tensor<32x1xi32, #blocked1> 2026-02-21T09:45:38.0847734Z %112 = tt.broadcast %111 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0848096Z %113 = tt.expand_dims %106 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0848521Z scf.yield %104, %107, %112, %113, %95 : i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32 2026-02-21T09:45:38.0848755Z } else { 2026-02-21T09:45:38.0848981Z scf.yield %arg8, %arg9, %arg10, %arg11, %arg5 : i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32 2026-02-21T09:45:38.0849228Z } 2026-02-21T09:45:38.0849313Z %35 = arith.muli %33, %c2_i32 : i32 2026-02-21T09:45:38.0849477Z %36 = tt.splat %35 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0849691Z %37 = arith.addi %36, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0849981Z %38 = tt.expand_dims %37 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:38.0850249Z %39 = tt.broadcast %38 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0850455Z %40 = arith.addi %34#2, %39 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0850651Z %41 = tt.addptr %9, %40 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0850850Z %42 = tt.load %41 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:38.0851110Z %43 = ttg.convert_layout %42 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0851502Z %44 = arith.extf %43 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0851775Z %45 = arith.muli %33, %c8192_i32 : i32 2026-02-21T09:45:38.0851946Z %46 = tt.splat %45 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0852164Z %47 = arith.addi %46, %34#3 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0852467Z %48 = tt.addptr %10, %47 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0852771Z %49 = tt.load %48 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0852997Z %50 = arith.shli %49, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0853227Z %51 = arith.shrsi %50, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0853455Z %52 = arith.shrsi %49, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0853735Z %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0854076Z %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0854352Z %55 = tt.broadcast %53 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0854583Z %56 = arith.select %15, %55, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0854825Z %57 = tt.broadcast %54 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0855046Z %58 = arith.select %17, %57, %56 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0855263Z %59 = tt.reshape %58 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:38.0855476Z %60 = arith.sitofp %59 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:38.0855762Z %61 = ttg.convert_layout %60 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0856240Z %62 = tt.dot %44, %61, %arg7, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0856582Z %63 = arith.addi %33, %c1_i32 : i32 2026-02-21T09:45:38.0856702Z %64 = arith.muli %63, %c2_i32 : i32 2026-02-21T09:45:38.0856866Z %65 = tt.splat %64 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0857079Z %66 = arith.addi %65, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:38.0857348Z %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:38.0857615Z %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0857817Z %69 = arith.addi %34#2, %68 : tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0858012Z %70 = tt.addptr %9, %69 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:45:38.0858224Z %71 = tt.load %70 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:45:38.0858480Z %72 = ttg.convert_layout %71 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0858872Z %73 = arith.extf %72 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0859146Z %74 = arith.muli %63, %c8192_i32 : i32 2026-02-21T09:45:38.0859313Z %75 = tt.splat %74 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0859536Z %76 = arith.addi %75, %34#3 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0859841Z %77 = tt.addptr %10, %76 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0860145Z %78 = tt.load %77 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0860372Z %79 = arith.shli %78, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0860598Z %80 = arith.shrsi %79, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0860826Z %81 = arith.shrsi %78, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0861103Z %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0861429Z %83 = tt.expand_dims %81 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:38.0861703Z %84 = tt.broadcast %82 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0861945Z %85 = arith.select %15, %84, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0862175Z %86 = tt.broadcast %83 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0862396Z %87 = arith.select %17, %86, %85 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:38.0862626Z %88 = tt.reshape %87 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:38.0862838Z %89 = arith.sitofp %88 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:38.0863122Z %90 = ttg.convert_layout %89 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:38.0863572Z %91 = tt.dot %73, %90, %62, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0863904Z %92 = arith.addi %33, %c2_i32 : i32 2026-02-21T09:45:38.0864029Z %93 = arith.cmpi eq, %31, %c255_i32 : i32 2026-02-21T09:45:38.0864175Z %94 = arith.select %93, %cst, %91 : tensor<32x32xf32, #mma> 2026-02-21T09:45:38.0864310Z scf.if %93 { 2026-02-21T09:45:38.0864447Z %95 = arith.truncf %91 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma> 2026-02-21T09:45:38.0864610Z %96 = arith.extsi %34#1 : i32 to i64 2026-02-21T09:45:38.0864731Z %97 = arith.extsi %34#0 : i32 to i64 2026-02-21T09:45:38.0864888Z %98 = tt.splat %96 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:38.0865089Z %99 = arith.addi %98, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:38.0865350Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0865598Z %101 = arith.muli %100, %cst_7 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0865773Z %102 = tt.broadcast %101 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0865993Z %103 = tt.splat %97 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:38.0866199Z %104 = arith.addi %103, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:38.0866464Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0866715Z %106 = tt.broadcast %105 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0866900Z %107 = arith.addi %102, %106 : tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0867088Z %108 = tt.addptr %18, %107 : tensor<32x32x!tt.ptr, #mma>, tensor<32x32xi64, #mma> 2026-02-21T09:45:38.0867283Z %109 = arith.cmpi sge, %100, %cst_8 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0867450Z %110 = arith.cmpi slt, %100, %cst_9 : tensor<32x1xi64, #mma> 2026-02-21T09:45:38.0867605Z %111 = arith.andi %109, %110 : tensor<32x1xi1, #mma> 2026-02-21T09:45:38.0867780Z %112 = tt.broadcast %111 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0867965Z %113 = arith.cmpi sge, %105, %cst_10 : tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0868135Z %114 = arith.cmpi slt, %105, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:45:38.0868295Z %115 = arith.andi %113, %114 : tensor<1x32xi1, #mma> 2026-02-21T09:45:38.0868465Z %116 = tt.broadcast %115 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0868643Z %117 = arith.andi %112, %116 : tensor<32x32xi1, #mma> 2026-02-21T09:45:38.0868800Z tt.store %108, %95, %117 : tensor<32x32x!tt.ptr, #mma> 2026-02-21T09:45:38.0868934Z } 2026-02-21T09:45:38.0869215Z scf.yield %31, %34#4, %92, %94, %34#0, %34#1, %34#2, %34#3 : i32, i32, i32, tensor<32x32xf32, #mma>, i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:38.0869495Z } 2026-02-21T09:45:38.0869572Z tt.return 2026-02-21T09:45:38.0869650Z } 2026-02-21T09:45:38.0869724Z } 2026-02-21T09:45:38.0869767Z 2026-02-21T09:45:38.0869812Z {-# 2026-02-21T09:45:38.0869894Z external_resources: { 2026-02-21T09:45:38.0869991Z mlir_reproducer: { 2026-02-21T09:45:38.0870994Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:45:38.0871980Z disable_threading: false, 2026-02-21T09:45:38.0872087Z verify_each: true 2026-02-21T09:45:38.0872181Z } 2026-02-21T09:45:38.0872252Z } 2026-02-21T09:45:38.0872326Z #-} 2026-02-21T09:45:38.0872607Z /tmp/torchinductor_root/4g/c4gqla7w3u5bg7mh47a4i7i3ad5wu42gcpqh6igdtxb2rmf43rrj.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:45:38.0873280Z /tmp/torchinductor_root/4g/c4gqla7w3u5bg7mh47a4i7i3ad5wu42gcpqh6igdtxb2rmf43rrj.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:45:38.0873825Z [68s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:45:38.0874608Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 32], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[4, 2], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:45:38.0875317Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:45:38.0875487Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:45:39.4818015Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:45:39.4822025Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:45:39.4822915Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:45:39.4823624Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:45:39.4824302Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:45:39.4824891Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:45:39.4825318Z #smem = #ttg.shared_memory 2026-02-21T09:45:39.4825855Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:45:39.4826942Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:45:39.4827865Z %cst = arith.constant dense<0.000000e+00> : tensor<128x32xf32, #mma> 2026-02-21T09:45:39.4828490Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:45:39.4828757Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:45:39.4829029Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:45:39.4829290Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:45:39.4829551Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T09:45:39.4829961Z %cst_0 = arith.constant dense<0> : tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4830316Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:45:39.4830566Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:45:39.4830822Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:45:39.4831082Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:45:39.4831339Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:45:39.4831592Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:45:39.4831850Z %c512_i64 = arith.constant 512 : i64 2026-02-21T09:45:39.4832102Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T09:45:39.4832410Z %cst_1 = arith.constant dense<0> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4832838Z %cst_2 = arith.constant dense<8192> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4833255Z %cst_3 = arith.constant dense<0> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4833614Z %cst_4 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:45:39.4833964Z %cst_5 = arith.constant dense<4> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4834313Z %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:39.4834591Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:39.4834869Z %cst_8 = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:45:39.4835104Z %0 = tt.get_program_id x : i32 2026-02-21T09:45:39.4835286Z %1 = arith.divsi %0, %c2048_i32 : i32 2026-02-21T09:45:39.4835513Z %2 = arith.muli %1, %c16_i32 : i32 2026-02-21T09:45:39.4835697Z %3 = arith.subi %c256_i32, %2 : i32 2026-02-21T09:45:39.4835885Z %4 = arith.minsi %3, %c16_i32 : i32 2026-02-21T09:45:39.4836115Z %5 = arith.remsi %0, %c2048_i32 : i32 2026-02-21T09:45:39.4836303Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:45:39.4836484Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:45:39.4836657Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:45:39.4836828Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T09:45:39.4837220Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:39.4837732Z %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:39.4838118Z %12 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:39.4838457Z %13 = arith.addi %12, %11 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:45:39.4838723Z %14 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:45:39.4839046Z %15 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:39.4839508Z %16 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:39.4839913Z %17 = tt.splat %14 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:39.4840257Z %18 = tt.splat %14 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:39.4840601Z %19 = arith.addi %17, %15 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:45:39.4840935Z %20 = arith.addi %18, %16 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:45:39.4841330Z %21 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:39.4841853Z %22 = tt.expand_dims %19 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:45:39.4842202Z %23 = arith.muli %22, %cst_4 : tensor<128x1xi32, #blocked1> 2026-02-21T09:45:39.4842442Z %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x2xi32, #blocked1> 2026-02-21T09:45:39.4842789Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<128x2x!tt.ptr, #blocked1> 2026-02-21T09:45:39.4842990Z %26 = arith.extsi %9 : i32 to i64 2026-02-21T09:45:39.4843223Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4843574Z %28 = tt.splat %26 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:39.4844088Z %29 = arith.extsi %10 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:39.4844583Z %30 = arith.addi %28, %29 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:45:39.4845058Z %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4845504Z %32 = arith.cmpi sge, %31, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4845799Z %33 = arith.cmpi slt, %31, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4846080Z %34 = arith.andi %32, %33 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4846428Z %35 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:45:39.4846958Z %36 = tt.expand_dims %35 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:45:39.4847470Z %37 = tt.expand_dims %36 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:39.4847778Z %38 = arith.cmpi eq, %37, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:39.4848014Z %39 = tt.broadcast %38 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked> 2026-02-21T09:45:39.4848254Z %40 = arith.cmpi eq, %37, %cst_7 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:45:39.4848485Z %41 = tt.broadcast %40 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked> 2026-02-21T09:45:39.4848805Z %42 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst) -> (tensor<128x32xf32, #mma>) : i32 { 2026-02-21T09:45:39.4849067Z %52 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:45:39.4849275Z %53 = tt.splat %52 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:39.4849537Z %54 = arith.addi %53, %21 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:39.4849871Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:39.4850209Z %56 = tt.broadcast %55 : tensor<1x2xi32, #blocked1> -> tensor<128x2xi32, #blocked1> 2026-02-21T09:45:39.4850444Z %57 = arith.addi %24, %56 : tensor<128x2xi32, #blocked1> 2026-02-21T09:45:39.4850688Z %58 = tt.addptr %25, %57 : tensor<128x2x!tt.ptr, #blocked1>, tensor<128x2xi32, #blocked1> 2026-02-21T09:45:39.4850935Z %59 = tt.load %58 : tensor<128x2x!tt.ptr, #blocked1> 2026-02-21T09:45:39.4851206Z %60 = ttg.local_alloc %59 : (tensor<128x2xbf16, #blocked1>) -> !ttg.memdesc<128x2xbf16, #shared, #smem> 2026-02-21T09:45:39.4851640Z %61 = ttg.local_load %60 : !ttg.memdesc<128x2xbf16, #shared, #smem> -> tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:39.4852143Z %62 = arith.extf %61 : tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:39.4852506Z %63 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:45:39.4852630Z %64 = arith.muli %63, %c8192_i64 : i64 2026-02-21T09:45:39.4852839Z %65 = tt.splat %64 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4853057Z %66 = arith.addi %65, %31 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4853359Z %67 = tt.addptr %27, %66 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4853675Z %68 = tt.load %67, %34, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4853913Z %69 = arith.shli %68, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4854144Z %70 = arith.shrsi %69, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4854373Z %71 = arith.shrsi %68, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4854657Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:39.4854982Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:39.4855254Z %74 = tt.broadcast %72 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4855484Z %75 = arith.select %39, %74, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4855726Z %76 = tt.broadcast %73 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4855951Z %77 = arith.select %41, %76, %75 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4856189Z %78 = tt.reshape %77 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:39.4856405Z %79 = arith.sitofp %78 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:39.4856691Z %80 = ttg.convert_layout %79 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:39.4857150Z %81 = tt.dot %62, %80, %arg4, inputPrecision = tf32 : tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:45:39.4857496Z %82 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:45:39.4857614Z %83 = arith.muli %82, %c2_i32 : i32 2026-02-21T09:45:39.4857778Z %84 = tt.splat %83 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:39.4857994Z %85 = arith.addi %84, %21 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:39.4858263Z %86 = tt.expand_dims %85 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:39.4858536Z %87 = tt.broadcast %86 : tensor<1x2xi32, #blocked1> -> tensor<128x2xi32, #blocked1> 2026-02-21T09:45:39.4858724Z %88 = arith.addi %24, %87 : tensor<128x2xi32, #blocked1> 2026-02-21T09:45:39.4858919Z %89 = tt.addptr %25, %88 : tensor<128x2x!tt.ptr, #blocked1>, tensor<128x2xi32, #blocked1> 2026-02-21T09:45:39.4859121Z %90 = tt.load %89 : tensor<128x2x!tt.ptr, #blocked1> 2026-02-21T09:45:39.4859338Z %91 = ttg.local_alloc %90 : (tensor<128x2xbf16, #blocked1>) -> !ttg.memdesc<128x2xbf16, #shared, #smem> 2026-02-21T09:45:39.4859691Z %92 = ttg.local_load %91 : !ttg.memdesc<128x2xbf16, #shared, #smem> -> tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:39.4860094Z %93 = arith.extf %92 : tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:39.4860386Z %94 = arith.extsi %82 : i32 to i64 2026-02-21T09:45:39.4860504Z %95 = arith.muli %94, %c8192_i64 : i64 2026-02-21T09:45:39.4860670Z %96 = tt.splat %95 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4860892Z %97 = arith.addi %96, %31 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4861194Z %98 = tt.addptr %27, %97 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4861453Z %99 = arith.cmpi slt, %94, %c512_i64 : i64 2026-02-21T09:45:39.4861628Z %100 = tt.splat %99 : i1 -> tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4861850Z %101 = arith.andi %100, %34 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4862088Z %102 = tt.load %98, %101, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4862328Z %103 = arith.shli %102, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4862568Z %104 = arith.shrsi %103, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4862802Z %105 = arith.shrsi %102, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4863085Z %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:39.4863440Z %107 = tt.expand_dims %105 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:39.4863719Z %108 = tt.broadcast %106 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4863971Z %109 = arith.select %39, %108, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4864202Z %110 = tt.broadcast %107 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4864429Z %111 = arith.select %41, %110, %109 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4864655Z %112 = tt.reshape %111 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:39.4864871Z %113 = arith.sitofp %112 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:39.4865171Z %114 = ttg.convert_layout %113 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:39.4865633Z %115 = tt.dot %93, %114, %81, inputPrecision = tf32 : tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:45:39.4865972Z %116 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:45:39.4866093Z %117 = arith.muli %116, %c2_i32 : i32 2026-02-21T09:45:39.4866260Z %118 = tt.splat %117 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:39.4866480Z %119 = arith.addi %118, %21 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:39.4866755Z %120 = tt.expand_dims %119 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:39.4867028Z %121 = tt.broadcast %120 : tensor<1x2xi32, #blocked1> -> tensor<128x2xi32, #blocked1> 2026-02-21T09:45:39.4867222Z %122 = arith.addi %24, %121 : tensor<128x2xi32, #blocked1> 2026-02-21T09:45:39.4867439Z %123 = tt.addptr %25, %122 : tensor<128x2x!tt.ptr, #blocked1>, tensor<128x2xi32, #blocked1> 2026-02-21T09:45:39.4867650Z %124 = tt.load %123 : tensor<128x2x!tt.ptr, #blocked1> 2026-02-21T09:45:39.4867883Z %125 = ttg.local_alloc %124 : (tensor<128x2xbf16, #blocked1>) -> !ttg.memdesc<128x2xbf16, #shared, #smem> 2026-02-21T09:45:39.4868217Z %126 = ttg.local_load %125 : !ttg.memdesc<128x2xbf16, #shared, #smem> -> tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:39.4868646Z %127 = arith.extf %126 : tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:39.4868922Z %128 = arith.extsi %116 : i32 to i64 2026-02-21T09:45:39.4869045Z %129 = arith.muli %128, %c8192_i64 : i64 2026-02-21T09:45:39.4869217Z %130 = tt.splat %129 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4869442Z %131 = arith.addi %130, %31 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4869754Z %132 = tt.addptr %27, %131 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4870013Z %133 = arith.cmpi slt, %128, %c512_i64 : i64 2026-02-21T09:45:39.4870187Z %134 = tt.splat %133 : i1 -> tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4870422Z %135 = arith.andi %134, %34 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4870662Z %136 = tt.load %132, %135, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4870912Z %137 = arith.shli %136, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4871149Z %138 = arith.shrsi %137, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4871413Z %139 = arith.shrsi %136, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4871705Z %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:39.4872053Z %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:39.4872339Z %142 = tt.broadcast %140 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4872581Z %143 = arith.select %39, %142, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4872817Z %144 = tt.broadcast %141 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4873055Z %145 = arith.select %41, %144, %143 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4873283Z %146 = tt.reshape %145 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:39.4873514Z %147 = arith.sitofp %146 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:39.4873811Z %148 = ttg.convert_layout %147 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:39.4874273Z %149 = tt.dot %127, %148, %115, inputPrecision = tf32 : tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:45:39.4874625Z %150 = arith.addi %arg3, %c3_i32 : i32 2026-02-21T09:45:39.4874753Z %151 = arith.muli %150, %c2_i32 : i32 2026-02-21T09:45:39.4874922Z %152 = tt.splat %151 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:39.4875148Z %153 = arith.addi %152, %21 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:45:39.4875445Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:45:39.4875727Z %155 = tt.broadcast %154 : tensor<1x2xi32, #blocked1> -> tensor<128x2xi32, #blocked1> 2026-02-21T09:45:39.4875926Z %156 = arith.addi %24, %155 : tensor<128x2xi32, #blocked1> 2026-02-21T09:45:39.4876147Z %157 = tt.addptr %25, %156 : tensor<128x2x!tt.ptr, #blocked1>, tensor<128x2xi32, #blocked1> 2026-02-21T09:45:39.4876360Z %158 = tt.load %157 : tensor<128x2x!tt.ptr, #blocked1> 2026-02-21T09:45:39.4876584Z %159 = ttg.local_alloc %158 : (tensor<128x2xbf16, #blocked1>) -> !ttg.memdesc<128x2xbf16, #shared, #smem> 2026-02-21T09:45:39.4876921Z %160 = ttg.local_load %159 : !ttg.memdesc<128x2xbf16, #shared, #smem> -> tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:39.4877331Z %161 = arith.extf %160 : tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:39.4877618Z %162 = arith.extsi %150 : i32 to i64 2026-02-21T09:45:39.4877747Z %163 = arith.muli %162, %c8192_i64 : i64 2026-02-21T09:45:39.4877926Z %164 = tt.splat %163 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4878163Z %165 = arith.addi %164, %31 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4878475Z %166 = tt.addptr %27, %165 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4878749Z %167 = arith.cmpi slt, %162, %c512_i64 : i64 2026-02-21T09:45:39.4878931Z %168 = tt.splat %167 : i1 -> tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4879157Z %169 = arith.andi %168, %34 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4879423Z %170 = tt.load %166, %169, %cst_1 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4879667Z %171 = arith.shli %170, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4879924Z %172 = arith.shrsi %171, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4880165Z %173 = arith.shrsi %170, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:45:39.4880451Z %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:39.4880786Z %175 = tt.expand_dims %173 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:45:39.4881065Z %176 = tt.broadcast %174 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4881306Z %177 = arith.select %39, %176, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4881547Z %178 = tt.broadcast %175 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4881779Z %179 = arith.select %41, %178, %177 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:45:39.4882012Z %180 = tt.reshape %179 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:45:39.4882234Z %181 = arith.sitofp %180 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:45:39.4882537Z %182 = ttg.convert_layout %181 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:45:39.4883046Z %183 = tt.dot %161, %182, %149, inputPrecision = tf32 : tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:45:39.4883396Z scf.yield %183 : tensor<128x32xf32, #mma> 2026-02-21T09:45:39.4883585Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:45:39.4883775Z %43 = arith.truncf %42 : tensor<128x32xf32, #mma> to tensor<128x32xbf16, #mma> 2026-02-21T09:45:39.4884040Z %44 = tt.expand_dims %20 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:45:39.4884299Z %45 = arith.muli %44, %cst_8 : tensor<128x1xi32, #mma> 2026-02-21T09:45:39.4884523Z %46 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma> 2026-02-21T09:45:39.4884779Z %47 = tt.broadcast %45 : tensor<128x1xi32, #mma> -> tensor<128x32xi32, #mma> 2026-02-21T09:45:39.4884978Z %48 = tt.broadcast %46 : tensor<1x32xi32, #mma> -> tensor<128x32xi32, #mma> 2026-02-21T09:45:39.4885157Z %49 = arith.addi %47, %48 : tensor<128x32xi32, #mma> 2026-02-21T09:45:39.4885333Z %50 = tt.splat %arg2 : !tt.ptr -> tensor<128x32x!tt.ptr, #mma> 2026-02-21T09:45:39.4885548Z %51 = tt.addptr %50, %49 : tensor<128x32x!tt.ptr, #mma>, tensor<128x32xi32, #mma> 2026-02-21T09:45:39.4885743Z tt.store %51, %43 : tensor<128x32x!tt.ptr, #mma> 2026-02-21T09:45:39.4885875Z tt.return 2026-02-21T09:45:39.4885962Z } 2026-02-21T09:45:39.4886037Z } 2026-02-21T09:45:39.4886086Z 2026-02-21T09:45:39.4886120Z {-# 2026-02-21T09:45:39.4886208Z external_resources: { 2026-02-21T09:45:39.4886309Z mlir_reproducer: { 2026-02-21T09:45:39.4887317Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:45:39.4888310Z disable_threading: false, 2026-02-21T09:45:39.4888438Z verify_each: true 2026-02-21T09:45:39.4888534Z } 2026-02-21T09:45:39.4888608Z } 2026-02-21T09:45:39.4888684Z #-} 2026-02-21T09:45:39.4888967Z /tmp/torchinductor_root/ur/curprybqiqqa3re3xtfhh4eybc7zmkddnatmkgwbjlfqorv2budm.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:45:39.4889652Z /tmp/torchinductor_root/ur/curprybqiqqa3re3xtfhh4eybc7zmkddnatmkgwbjlfqorv2budm.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:45:39.4890200Z [70s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:45:39.4890926Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 32], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:45:39.4891596Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:45:39.4891771Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:46:17.8999207Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:46:17.9000797Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:46:17.9002049Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:46:17.9002969Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:46:17.9003747Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [8, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:46:17.9004639Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:46:17.9005664Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:46:17.9006524Z %cst = arith.constant dense<0.000000e+00> : tensor<128x16xf32, #mma> 2026-02-21T09:46:17.9006875Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:46:17.9007140Z %c38912_i32 = arith.constant 38912 : i32 2026-02-21T09:46:17.9007423Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:46:17.9007675Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T09:46:17.9007994Z %cst_0 = arith.constant dense<0> : tensor<2x2x16xi8, #blocked> 2026-02-21T09:46:17.9008327Z %c65536_i32 = arith.constant 65536 : i32 2026-02-21T09:46:17.9008577Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:46:17.9008826Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:46:17.9009070Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:46:17.9009317Z %c255_i32 = arith.constant 255 : i32 2026-02-21T09:46:17.9009561Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:46:17.9009966Z %cst_1 = arith.constant dense<0> : tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9010453Z %cst_2 = arith.constant dense<0> : tensor<128x4xi32, #blocked1> 2026-02-21T09:46:17.9010783Z %c-1_i32 = arith.constant -1 : i32 2026-02-21T09:46:17.9011029Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:46:17.9011425Z %cst_3 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:46:17.9011902Z %cst_4 = arith.constant dense<8192> : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9012557Z %cst_5 = arith.constant dense<4> : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9013022Z %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:46:17.9013390Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:46:17.9013755Z %cst_8 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:46:17.9014117Z %cst_9 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:46:17.9014477Z %cst_10 = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:46:17.9014782Z %cst_11 = arith.constant dense<0> : tensor<1x16xi64, #mma> 2026-02-21T09:46:17.9015028Z %cst_12 = arith.constant dense<8192> : tensor<1x16xi64, #mma> 2026-02-21T09:46:17.9015250Z %0 = tt.get_program_id x : i32 2026-02-21T09:46:17.9015551Z %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:46:17.9015981Z %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:46:17.9016455Z %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:46:17.9016918Z %4 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:46:17.9017378Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:46:17.9017839Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:46:17.9018200Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:46:17.9018610Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9019072Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:46:17.9019728Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:46:17.9020341Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:46:17.9020719Z %12 = arith.cmpi eq, %11, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:46:17.9021016Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked> 2026-02-21T09:46:17.9021318Z %14 = arith.cmpi eq, %11, %cst_7 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:46:17.9021593Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked> 2026-02-21T09:46:17.9021909Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<128x16x!tt.ptr, #mma> 2026-02-21T09:46:17.9022318Z %17 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:46:17.9022820Z %18 = arith.extsi %4 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:46:17.9023170Z %19 = arith.subi %c65536_i32, %0 : i32 2026-02-21T09:46:17.9023364Z %20 = arith.ceildivsi %19, %c38912_i32 : i32 2026-02-21T09:46:17.9023553Z %21 = arith.muli %20, %c256_i32 : i32 2026-02-21T09:46:17.9023732Z %22 = arith.subi %0, %c38912_i32 : i32 2026-02-21T09:46:17.9024500Z %23:8 = scf.for %arg3 = %c0_i32 to %21 step %c1_i32 iter_args(%arg4 = %c-1_i32, %arg5 = %22, %arg6 = %c0_i32, %arg7 = %cst, %arg8 = %c0_i32, %arg9 = %c0_i32, %arg10 = %cst_2, %arg11 = %cst_1) -> (i32, i32, i32, tensor<128x16xf32, #mma>, i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>) : i32 { 2026-02-21T09:46:17.9025168Z %24 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:46:17.9025317Z %25 = arith.cmpi eq, %arg4, %c255_i32 : i32 2026-02-21T09:46:17.9025468Z %26 = arith.select %25, %c0_i32, %24 : i32 2026-02-21T09:46:17.9025613Z %27 = arith.cmpi eq, %26, %c0_i32 : i32 2026-02-21T09:46:17.9025758Z %28 = arith.select %27, %c0_i32, %arg6 : i32 2026-02-21T09:46:17.9026034Z %29:5 = scf.if %27 -> (i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32) { 2026-02-21T09:46:17.9026302Z %64 = arith.addi %arg5, %c38912_i32 : i32 2026-02-21T09:46:17.9026452Z %65 = arith.divsi %64, %c2048_i32 : i32 2026-02-21T09:46:17.9026595Z %66 = arith.muli %65, %c4_i32 : i32 2026-02-21T09:46:17.9026740Z %67 = arith.subi %c128_i32, %66 : i32 2026-02-21T09:46:17.9026877Z %68 = arith.minsi %67, %c4_i32 : i32 2026-02-21T09:46:17.9027022Z %69 = arith.remsi %64, %c2048_i32 : i32 2026-02-21T09:46:17.9027165Z %70 = arith.remsi %69, %68 : i32 2026-02-21T09:46:17.9027297Z %71 = arith.addi %66, %70 : i32 2026-02-21T09:46:17.9027432Z %72 = arith.divsi %69, %68 : i32 2026-02-21T09:46:17.9027568Z %73 = arith.muli %71, %c128_i32 : i32 2026-02-21T09:46:17.9027772Z %74 = tt.splat %73 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:46:17.9028039Z %75 = arith.addi %74, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:46:17.9028243Z %76 = arith.muli %72, %c16_i32 : i32 2026-02-21T09:46:17.9028489Z %77 = tt.splat %76 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:46:17.9028855Z %78 = arith.addi %77, %3 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:46:17.9029227Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:46:17.9029528Z %80 = arith.muli %79, %cst_3 : tensor<128x1xi32, #blocked1> 2026-02-21T09:46:17.9029775Z %81 = tt.broadcast %80 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:46:17.9030194Z %82 = tt.expand_dims %78 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9030698Z %83 = tt.broadcast %82 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9031129Z scf.yield %73, %76, %81, %83, %64 : i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32 2026-02-21T09:46:17.9031405Z } else { 2026-02-21T09:46:17.9031671Z scf.yield %arg8, %arg9, %arg10, %arg11, %arg5 : i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32 2026-02-21T09:46:17.9031969Z } 2026-02-21T09:46:17.9032169Z %30 = tt.splat %28 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:46:17.9032547Z %31 = arith.addi %30, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:46:17.9032791Z %32 = arith.muli %28, %c2_i32 : i32 2026-02-21T09:46:17.9032978Z %33 = tt.splat %32 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:46:17.9033230Z %34 = arith.addi %33, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:46:17.9033563Z %35 = tt.expand_dims %34 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:46:17.9033883Z %36 = tt.broadcast %35 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:46:17.9034133Z %37 = arith.addi %29#2, %36 : tensor<128x4xi32, #blocked1> 2026-02-21T09:46:17.9034332Z %38 = tt.addptr %7, %37 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:46:17.9034536Z %39 = tt.load %38 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:46:17.9034799Z %40 = ttg.convert_layout %39 : tensor<128x4xbf16, #blocked1> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:46:17.9035198Z %41 = arith.extf %40 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:46:17.9035652Z %42 = tt.expand_dims %31 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9042543Z %43 = arith.muli %42, %cst_4 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9042906Z %44 = tt.broadcast %43 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9043205Z %45 = arith.addi %44, %29#3 : tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9043510Z %46 = tt.addptr %8, %45 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9043809Z %47 = tt.load %46 : tensor<2x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9044035Z %48 = arith.shli %47, %cst_5 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9044318Z %49 = arith.shrsi %48, %cst_5 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9044547Z %50 = arith.shrsi %47, %cst_5 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9044828Z %51 = tt.expand_dims %49 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T09:46:17.9045175Z %52 = tt.expand_dims %50 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked> 2026-02-21T09:46:17.9045449Z %53 = tt.broadcast %51 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T09:46:17.9045677Z %54 = arith.select %13, %53, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T09:46:17.9045907Z %55 = tt.broadcast %52 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked> 2026-02-21T09:46:17.9046134Z %56 = arith.select %15, %55, %54 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked> 2026-02-21T09:46:17.9046355Z %57 = tt.reshape %56 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2> 2026-02-21T09:46:17.9046569Z %58 = arith.sitofp %57 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2> 2026-02-21T09:46:17.9046853Z %59 = ttg.convert_layout %58 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:46:17.9047314Z %60 = tt.dot %41, %59, %arg7, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma> 2026-02-21T09:46:17.9047655Z %61 = arith.addi %28, %c2_i32 : i32 2026-02-21T09:46:17.9047780Z %62 = arith.cmpi eq, %26, %c255_i32 : i32 2026-02-21T09:46:17.9047928Z %63 = arith.select %62, %cst, %60 : tensor<128x16xf32, #mma> 2026-02-21T09:46:17.9048064Z scf.if %62 { 2026-02-21T09:46:17.9048220Z %64 = arith.truncf %60 : tensor<128x16xf32, #mma> to tensor<128x16xbf16, #mma> 2026-02-21T09:46:17.9048388Z %65 = arith.extsi %29#0 : i32 to i64 2026-02-21T09:46:17.9048507Z %66 = arith.extsi %29#1 : i32 to i64 2026-02-21T09:46:17.9048688Z %67 = tt.splat %65 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:46:17.9048894Z %68 = arith.addi %67, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:46:17.9049158Z %69 = tt.expand_dims %68 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:46:17.9049392Z %70 = arith.muli %69, %cst_8 : tensor<128x1xi64, #mma> 2026-02-21T09:46:17.9049568Z %71 = tt.broadcast %70 : tensor<128x1xi64, #mma> -> tensor<128x16xi64, #mma> 2026-02-21T09:46:17.9049770Z %72 = tt.splat %66 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:46:17.9049971Z %73 = arith.addi %72, %18 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:46:17.9050228Z %74 = tt.expand_dims %73 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi64, #mma> 2026-02-21T09:46:17.9050482Z %75 = tt.broadcast %74 : tensor<1x16xi64, #mma> -> tensor<128x16xi64, #mma> 2026-02-21T09:46:17.9050660Z %76 = arith.addi %71, %75 : tensor<128x16xi64, #mma> 2026-02-21T09:46:17.9050844Z %77 = tt.addptr %16, %76 : tensor<128x16x!tt.ptr, #mma>, tensor<128x16xi64, #mma> 2026-02-21T09:46:17.9051041Z %78 = arith.cmpi sge, %69, %cst_9 : tensor<128x1xi64, #mma> 2026-02-21T09:46:17.9051206Z %79 = arith.cmpi slt, %69, %cst_10 : tensor<128x1xi64, #mma> 2026-02-21T09:46:17.9051361Z %80 = arith.andi %78, %79 : tensor<128x1xi1, #mma> 2026-02-21T09:46:17.9051533Z %81 = tt.broadcast %80 : tensor<128x1xi1, #mma> -> tensor<128x16xi1, #mma> 2026-02-21T09:46:17.9051712Z %82 = arith.cmpi sge, %74, %cst_11 : tensor<1x16xi64, #mma> 2026-02-21T09:46:17.9051896Z %83 = arith.cmpi slt, %74, %cst_12 : tensor<1x16xi64, #mma> 2026-02-21T09:46:17.9052049Z %84 = arith.andi %82, %83 : tensor<1x16xi1, #mma> 2026-02-21T09:46:17.9052213Z %85 = tt.broadcast %84 : tensor<1x16xi1, #mma> -> tensor<128x16xi1, #mma> 2026-02-21T09:46:17.9052385Z %86 = arith.andi %81, %85 : tensor<128x16xi1, #mma> 2026-02-21T09:46:17.9052553Z tt.store %77, %64, %86 : tensor<128x16x!tt.ptr, #mma> 2026-02-21T09:46:17.9052685Z } 2026-02-21T09:46:17.9052947Z scf.yield %26, %29#4, %61, %63, %29#0, %29#1, %29#2, %29#3 : i32, i32, i32, tensor<128x16xf32, #mma>, i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:17.9053264Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32} 2026-02-21T09:46:17.9053399Z tt.return 2026-02-21T09:46:17.9053478Z } 2026-02-21T09:46:17.9053559Z } 2026-02-21T09:46:17.9053603Z 2026-02-21T09:46:17.9053633Z {-# 2026-02-21T09:46:17.9053717Z external_resources: { 2026-02-21T09:46:17.9053815Z mlir_reproducer: { 2026-02-21T09:46:17.9054806Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:46:17.9055787Z disable_threading: false, 2026-02-21T09:46:17.9055894Z verify_each: true 2026-02-21T09:46:17.9055983Z } 2026-02-21T09:46:17.9056059Z } 2026-02-21T09:46:17.9056126Z #-} 2026-02-21T09:46:17.9056421Z /tmp/torchinductor_root/2i/c2i3iz5awefnwj6vhr5ioojv2job23trwyh547leswpj2jravbs2.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:46:17.9057093Z /tmp/torchinductor_root/2i/c2i3iz5awefnwj6vhr5ioojv2job23trwyh547leswpj2jravbs2.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:46:17.9057657Z [108s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:46:17.9058442Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, False], range_num_stages=[1, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T09:46:17.9059158Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:46:17.9059327Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:46:21.2035688Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:46:21.2037249Z #blocked = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:46:21.2038212Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:46:21.2039077Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:46:21.2039899Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:46:21.2041184Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:46:21.2041881Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:46:21.2042401Z #smem = #ttg.shared_memory 2026-02-21T09:46:21.2043270Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:46:21.2044566Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:46:21.2045745Z %cst = arith.constant dense<16384> : tensor<256x1xi64, #mma> 2026-02-21T09:46:21.2045966Z %cst_0 = arith.constant dense<0> : tensor<256x1xi64, #mma> 2026-02-21T09:46:21.2046179Z %cst_1 = arith.constant dense<8192> : tensor<256x1xi64, #mma> 2026-02-21T09:46:21.2046392Z %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #mma> 2026-02-21T09:46:21.2046597Z %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #mma> 2026-02-21T09:46:21.2046807Z %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked> 2026-02-21T09:46:21.2047014Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked> 2026-02-21T09:46:21.2047227Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked> 2026-02-21T09:46:21.2047443Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:46:21.2047659Z %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:46:21.2047881Z %cst_9 = arith.constant dense<0.000000e+00> : tensor<256x64xf32, #mma> 2026-02-21T09:46:21.2048083Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:46:21.2048231Z %c510_i32 = arith.constant 510 : i32 2026-02-21T09:46:21.2048460Z %cst_10 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:21.2048800Z %cst_11 = arith.constant dense<0> : tensor<1x64xi64, #blocked> 2026-02-21T09:46:21.2049018Z %cst_12 = arith.constant dense<8192> : tensor<1x64xi64, #blocked> 2026-02-21T09:46:21.2049324Z %cst_13 = arith.constant dense<1024> : tensor<256x1xi32, #blocked2> 2026-02-21T09:46:21.2049510Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:46:21.2049657Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:46:21.2049804Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:46:21.2049944Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:46:21.2050121Z %cst_14 = arith.constant dense<0> : tensor<2x64xi8, #blocked> 2026-02-21T09:46:21.2050333Z %cst_15 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked1> 2026-02-21T09:46:21.2050522Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:46:21.2050666Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:46:21.2050892Z %cst_16 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:46:21.2051132Z %0 = tt.get_program_id x : i32 2026-02-21T09:46:21.2051274Z %1 = arith.divsi %0, %c4096_i32 : i32 2026-02-21T09:46:21.2051420Z %2 = arith.muli %1, %c64_i32 : i32 2026-02-21T09:46:21.2051561Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T09:46:21.2051704Z %4 = arith.minsi %3, %c64_i32 : i32 2026-02-21T09:46:21.2051844Z %5 = arith.remsi %0, %c4096_i32 : i32 2026-02-21T09:46:21.2051988Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:46:21.2052121Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:46:21.2052254Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:46:21.2052386Z %9 = arith.muli %7, %c64_i32 : i32 2026-02-21T09:46:21.2052529Z %10 = arith.muli %8, %c256_i32 : i32 2026-02-21T09:46:21.2052786Z %11 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:46:21.2053143Z %12 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:46:21.2053488Z %13 = tt.splat %10 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:46:21.2053765Z %14 = arith.addi %13, %11 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:46:21.2054082Z %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:46:21.2054518Z %16 = tt.expand_dims %14 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<256x1xi32, #blocked2> 2026-02-21T09:46:21.2054836Z %17 = arith.muli %16, %cst_13 : tensor<256x1xi32, #blocked2> 2026-02-21T09:46:21.2055078Z %18 = tt.broadcast %17 : tensor<256x1xi32, #blocked2> -> tensor<256x4xi32, #blocked2> 2026-02-21T09:46:21.2055350Z %19 = tt.splat %arg0 : !tt.ptr -> tensor<256x4x!tt.ptr, #blocked2> 2026-02-21T09:46:21.2055555Z %20 = arith.extsi %9 : i32 to i64 2026-02-21T09:46:21.2055728Z %21 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked> 2026-02-21T09:46:21.2055963Z %22 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:21.2056280Z %23 = arith.extsi %22 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:21.2056570Z %24 = tt.splat %20 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:46:21.2056825Z %25 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:46:21.2057148Z %26 = arith.extsi %25 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:46:21.2057444Z %27 = arith.addi %24, %26 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:46:21.2057735Z %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x64xi64, #blocked> 2026-02-21T09:46:21.2058006Z %29 = tt.broadcast %28 : tensor<1x64xi64, #blocked> -> tensor<2x64xi64, #blocked> 2026-02-21T09:46:21.2058225Z %30 = arith.cmpi sge, %28, %cst_11 : tensor<1x64xi64, #blocked> 2026-02-21T09:46:21.2058393Z %31 = arith.cmpi slt, %28, %cst_12 : tensor<1x64xi64, #blocked> 2026-02-21T09:46:21.2058557Z %32 = arith.andi %30, %31 : tensor<1x64xi1, #blocked> 2026-02-21T09:46:21.2058737Z %33 = tt.broadcast %32 : tensor<1x64xi1, #blocked> -> tensor<2x64xi1, #blocked> 2026-02-21T09:46:21.2059023Z %34 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:46:21.2059455Z %35 = tt.expand_dims %34 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:46:21.2059864Z %36 = tt.expand_dims %35 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:46:21.2060120Z %37 = arith.cmpi eq, %36, %cst_8 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:46:21.2060317Z %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x64xi1, #blocked1> 2026-02-21T09:46:21.2060515Z %39 = arith.cmpi eq, %36, %cst_7 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:46:21.2060709Z %40 = tt.broadcast %39 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x64xi1, #blocked1> 2026-02-21T09:46:21.2060925Z %41 = ttg.local_alloc : () -> !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable> 2026-02-21T09:46:21.2061202Z %42 = tt.expand_dims %15 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:46:21.2061471Z %43 = tt.broadcast %42 : tensor<1x4xi32, #blocked2> -> tensor<256x4xi32, #blocked2> 2026-02-21T09:46:21.2061664Z %44 = arith.addi %18, %43 : tensor<256x4xi32, #blocked2> 2026-02-21T09:46:21.2061881Z %45 = tt.addptr %19, %44 : tensor<256x4x!tt.ptr, #blocked2>, tensor<256x4xi32, #blocked2> 2026-02-21T09:46:21.2062086Z %46 = tt.load %45 : tensor<256x4x!tt.ptr, #blocked2> 2026-02-21T09:46:21.2062372Z %47 = ttg.memdesc_index %41[%c0_i32] : !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> 2026-02-21T09:46:21.2062766Z ttg.local_store %46, %47 : tensor<256x4xbf16, #blocked2> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> 2026-02-21T09:46:21.2063198Z %48:3 = scf.for %arg3 = %c0_i32 to %c510_i32 step %c2_i32 iter_args(%arg4 = %cst_9, %arg5 = %c0_i32, %arg6 = %47) -> (tensor<256x64xf32, #mma>, i32, !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>) : i32 { 2026-02-21T09:46:21.2063529Z %103 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:46:21.2063651Z %104 = arith.muli %103, %c2_i32 : i32 2026-02-21T09:46:21.2063824Z %105 = tt.splat %104 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:46:21.2064049Z %106 = arith.addi %105, %15 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:46:21.2064329Z %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:46:21.2064611Z %108 = tt.broadcast %107 : tensor<1x4xi32, #blocked2> -> tensor<256x4xi32, #blocked2> 2026-02-21T09:46:21.2064807Z %109 = arith.addi %18, %108 : tensor<256x4xi32, #blocked2> 2026-02-21T09:46:21.2065012Z %110 = tt.addptr %19, %109 : tensor<256x4x!tt.ptr, #blocked2>, tensor<256x4xi32, #blocked2> 2026-02-21T09:46:21.2065218Z %111 = tt.load %110 : tensor<256x4x!tt.ptr, #blocked2> 2026-02-21T09:46:21.2065521Z %112 = ttg.local_load %arg6 : !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:46:21.2065977Z %113 = arith.extf %112 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:46:21.2066271Z %114 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:46:21.2066440Z %115 = tt.splat %114 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:21.2066659Z %116 = arith.addi %115, %23 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:21.2066933Z %117 = tt.expand_dims %116 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:46:21.2067177Z %118 = arith.muli %117, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:46:21.2067359Z %119 = tt.broadcast %118 : tensor<2x1xi64, #blocked> -> tensor<2x64xi64, #blocked> 2026-02-21T09:46:21.2067548Z %120 = arith.addi %119, %29 : tensor<2x64xi64, #blocked> 2026-02-21T09:46:21.2067740Z %121 = tt.addptr %21, %120 : tensor<2x64x!tt.ptr, #blocked>, tensor<2x64xi64, #blocked> 2026-02-21T09:46:21.2067945Z %122 = arith.cmpi sge, %117, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:46:21.2068113Z %123 = arith.cmpi slt, %117, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:46:21.2068274Z %124 = arith.andi %122, %123 : tensor<2x1xi1, #blocked> 2026-02-21T09:46:21.2068457Z %125 = tt.broadcast %124 : tensor<2x1xi1, #blocked> -> tensor<2x64xi1, #blocked> 2026-02-21T09:46:21.2068641Z %126 = arith.andi %125, %33 : tensor<2x64xi1, #blocked> 2026-02-21T09:46:21.2068807Z %127 = tt.load %121, %126, %cst_14 : tensor<2x64x!tt.ptr, #blocked> 2026-02-21T09:46:21.2069063Z %128 = ttg.convert_layout %127 : tensor<2x64xi8, #blocked> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:46:21.2069345Z %129 = arith.shli %128, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:46:21.2069610Z %130 = arith.shrsi %129, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:46:21.2069868Z %131 = arith.shrsi %128, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:46:21.2070157Z %132 = tt.expand_dims %130 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1> 2026-02-21T09:46:21.2070507Z %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1> 2026-02-21T09:46:21.2070793Z %134 = tt.broadcast %132 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1> 2026-02-21T09:46:21.2071037Z %135 = arith.select %38, %134, %cst_15 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1> 2026-02-21T09:46:21.2071280Z %136 = tt.broadcast %133 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1> 2026-02-21T09:46:21.2071515Z %137 = arith.select %40, %136, %135 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1> 2026-02-21T09:46:21.2071746Z %138 = tt.reshape %137 : tensor<2x2x64xi8, #blocked1> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:46:21.2071966Z %139 = arith.sitofp %138 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:46:21.2072256Z %140 = ttg.convert_layout %139 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:46:21.2072729Z %141 = tt.dot %113, %140, %arg4, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x64xf32, #mma> 2026-02-21T09:46:21.2073082Z %142 = arith.addi %arg5, %c1_i32 : i32 2026-02-21T09:46:21.2073207Z %143 = arith.cmpi slt, %142, %c1_i32 : i32 2026-02-21T09:46:21.2073337Z %144 = arith.select %143, %142, %c0_i32 : i32 2026-02-21T09:46:21.2073621Z %145 = ttg.memdesc_index %41[%144] : !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> 2026-02-21T09:46:21.2073981Z ttg.local_store %111, %145 : tensor<256x4xbf16, #blocked2> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> 2026-02-21T09:46:21.2074314Z scf.yield %141, %144, %145 : tensor<256x64xf32, #mma>, i32, !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> 2026-02-21T09:46:21.2074563Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:46:21.2074874Z %49 = ttg.local_load %48#2 : !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:46:21.2075296Z %50 = arith.extf %49 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:46:21.2075620Z %51 = arith.addi %23, %cst_10 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:46:21.2075896Z %52 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:46:21.2076133Z %53 = arith.muli %52, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:46:21.2076317Z %54 = tt.broadcast %53 : tensor<2x1xi64, #blocked> -> tensor<2x64xi64, #blocked> 2026-02-21T09:46:21.2076499Z %55 = arith.addi %54, %29 : tensor<2x64xi64, #blocked> 2026-02-21T09:46:21.2076687Z %56 = tt.addptr %21, %55 : tensor<2x64x!tt.ptr, #blocked>, tensor<2x64xi64, #blocked> 2026-02-21T09:46:21.2076884Z %57 = arith.cmpi sge, %52, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:46:21.2077048Z %58 = arith.cmpi slt, %52, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:46:21.2077205Z %59 = arith.andi %57, %58 : tensor<2x1xi1, #blocked> 2026-02-21T09:46:21.2077376Z %60 = tt.broadcast %59 : tensor<2x1xi1, #blocked> -> tensor<2x64xi1, #blocked> 2026-02-21T09:46:21.2077557Z %61 = arith.andi %60, %33 : tensor<2x64xi1, #blocked> 2026-02-21T09:46:21.2077735Z %62 = tt.load %56, %61, %cst_14 : tensor<2x64x!tt.ptr, #blocked> 2026-02-21T09:46:21.2077983Z %63 = ttg.convert_layout %62 : tensor<2x64xi8, #blocked> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:46:21.2078258Z %64 = arith.shli %63, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:46:21.2078502Z %65 = arith.shrsi %64, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:46:21.2078734Z %66 = arith.shrsi %63, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:46:21.2079015Z %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1> 2026-02-21T09:46:21.2079343Z %68 = tt.expand_dims %66 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1> 2026-02-21T09:46:21.2079621Z %69 = tt.broadcast %67 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1> 2026-02-21T09:46:21.2079852Z %70 = arith.select %38, %69, %cst_15 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1> 2026-02-21T09:46:21.2080086Z %71 = tt.broadcast %68 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1> 2026-02-21T09:46:21.2080314Z %72 = arith.select %40, %71, %70 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1> 2026-02-21T09:46:21.2080535Z %73 = tt.reshape %72 : tensor<2x2x64xi8, #blocked1> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:46:21.2080748Z %74 = arith.sitofp %73 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:46:21.2081033Z %75 = ttg.convert_layout %74 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:46:21.2081504Z %76 = tt.dot %50, %75, %48#0, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x64xf32, #mma> 2026-02-21T09:46:21.2081885Z ttg.local_dealloc %41 : !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable> 2026-02-21T09:46:21.2082107Z %77 = arith.truncf %76 : tensor<256x64xf32, #mma> to tensor<256x64xbf16, #mma> 2026-02-21T09:46:21.2082270Z %78 = arith.extsi %10 : i32 to i64 2026-02-21T09:46:21.2082423Z %79 = tt.splat %arg2 : !tt.ptr -> tensor<256x64x!tt.ptr, #mma> 2026-02-21T09:46:21.2082671Z %80 = tt.splat %78 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:46:21.2082944Z %81 = arith.extsi %12 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:46:21.2083210Z %82 = arith.addi %80, %81 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:46:21.2083466Z %83 = tt.expand_dims %82 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma> 2026-02-21T09:46:21.2083700Z %84 = arith.muli %83, %cst_1 : tensor<256x1xi64, #mma> 2026-02-21T09:46:21.2083872Z %85 = tt.broadcast %84 : tensor<256x1xi64, #mma> -> tensor<256x64xi64, #mma> 2026-02-21T09:46:21.2084072Z %86 = tt.splat %20 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:46:21.2084306Z %87 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:46:21.2084603Z %88 = arith.extsi %87 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:46:21.2084866Z %89 = arith.addi %86, %88 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:46:21.2085116Z %90 = tt.expand_dims %89 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi64, #mma> 2026-02-21T09:46:21.2085365Z %91 = tt.broadcast %90 : tensor<1x64xi64, #mma> -> tensor<256x64xi64, #mma> 2026-02-21T09:46:21.2085558Z %92 = arith.addi %85, %91 : tensor<256x64xi64, #mma> 2026-02-21T09:46:21.2085739Z %93 = tt.addptr %79, %92 : tensor<256x64x!tt.ptr, #mma>, tensor<256x64xi64, #mma> 2026-02-21T09:46:21.2085935Z %94 = arith.cmpi sge, %83, %cst_0 : tensor<256x1xi64, #mma> 2026-02-21T09:46:21.2086095Z %95 = arith.cmpi slt, %83, %cst : tensor<256x1xi64, #mma> 2026-02-21T09:46:21.2086257Z %96 = arith.andi %94, %95 : tensor<256x1xi1, #mma> 2026-02-21T09:46:21.2086422Z %97 = tt.broadcast %96 : tensor<256x1xi1, #mma> -> tensor<256x64xi1, #mma> 2026-02-21T09:46:21.2086601Z %98 = arith.cmpi sge, %90, %cst_3 : tensor<1x64xi64, #mma> 2026-02-21T09:46:21.2086758Z %99 = arith.cmpi slt, %90, %cst_2 : tensor<1x64xi64, #mma> 2026-02-21T09:46:21.2086908Z %100 = arith.andi %98, %99 : tensor<1x64xi1, #mma> 2026-02-21T09:46:21.2087074Z %101 = tt.broadcast %100 : tensor<1x64xi1, #mma> -> tensor<256x64xi1, #mma> 2026-02-21T09:46:21.2087253Z %102 = arith.andi %97, %101 : tensor<256x64xi1, #mma> 2026-02-21T09:46:21.2087410Z tt.store %93, %77, %102 : tensor<256x64x!tt.ptr, #mma> 2026-02-21T09:46:21.2087548Z tt.return 2026-02-21T09:46:21.2087629Z } 2026-02-21T09:46:21.2087704Z } 2026-02-21T09:46:21.2087746Z 2026-02-21T09:46:21.2087780Z {-# 2026-02-21T09:46:21.2087860Z external_resources: { 2026-02-21T09:46:21.2087963Z mlir_reproducer: { 2026-02-21T09:46:21.2088950Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:46:21.2089982Z disable_threading: false, 2026-02-21T09:46:21.2090089Z verify_each: true 2026-02-21T09:46:21.2090178Z } 2026-02-21T09:46:21.2090270Z } 2026-02-21T09:46:21.2090341Z #-} 2026-02-21T09:46:21.2090620Z /tmp/torchinductor_root/hi/chidcxn4ljnvy3bnthwz6udzxevspp27es5to4zt73hthpghpum5.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:46:21.2091305Z /tmp/torchinductor_root/hi/chidcxn4ljnvy3bnthwz6udzxevspp27es5to4zt73hthpghpum5.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:46:21.2091851Z [112s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:46:21.2092575Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 256, 64], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T09:46:21.2093237Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:46:21.2093409Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:47:02.0608467Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:47:02.0611767Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T09:47:02.0612130Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:47:02.0612735Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:47:02.0613031Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:47:02.0613328Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:47:02.0613661Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:47:02.0613840Z #smem = #ttg.shared_memory 2026-02-21T09:47:02.0614073Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:47:02.0614540Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:47:02.0614943Z %cst = arith.constant dense<8192> : tensor<16x1xi32, #mma> 2026-02-21T09:47:02.0615123Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:02.0615298Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:02.0615478Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma> 2026-02-21T09:47:02.0615664Z %cst_3 = arith.constant dense<0> : tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0615845Z %cst_4 = arith.constant dense<8192> : tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0616023Z %cst_5 = arith.constant dense<1024> : tensor<16x1xi32, #blocked2> 2026-02-21T09:47:02.0616175Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:47:02.0616296Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:47:02.0616413Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:47:02.0616529Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:47:02.0616649Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T09:47:02.0616828Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:47:02.0616970Z %cst_6 = arith.constant dense<0> : tensor<1x256xi8, #blocked1> 2026-02-21T09:47:02.0617113Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:47:02.0617311Z %c512_i64 = arith.constant 512 : i64 2026-02-21T09:47:02.0617429Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T09:47:02.0617576Z %cst_7 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0617718Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:47:02.0617829Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:47:02.0617940Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:47:02.0618053Z %c1216_i32 = arith.constant 1216 : i32 2026-02-21T09:47:02.0618238Z %cst_8 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0618432Z %0 = tt.get_program_id x : i32 2026-02-21T09:47:02.0618628Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:47:02.0618901Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:02.0619173Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:02.0619446Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:02.0619706Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:47:02.0619971Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x2x!tt.ptr, #blocked2> 2026-02-21T09:47:02.0620172Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x256x!tt.ptr, #blocked1> 2026-02-21T09:47:02.0620454Z %8 = arith.extsi %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:02.0620835Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:47:02.0621246Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:47:02.0621666Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:02.0621919Z %12 = arith.cmpi eq, %11, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:02.0622112Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:47:02.0622309Z %14 = arith.cmpi eq, %11, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:02.0622495Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked> 2026-02-21T09:47:02.0622707Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:47:02.0622894Z scf.for %arg3 = %0 to %c32768_i32 step %c1216_i32 : i32 { 2026-02-21T09:47:02.0623045Z %17 = arith.remsi %arg3, %c1024_i32 : i32 2026-02-21T09:47:02.0623174Z %18 = arith.divsi %arg3, %c1024_i32 : i32 2026-02-21T09:47:02.0623294Z %19 = arith.muli %17, %c16_i32 : i32 2026-02-21T09:47:02.0623463Z %20 = tt.splat %19 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:47:02.0623675Z %21 = tt.splat %19 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:02.0623883Z %22 = arith.addi %20, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:47:02.0624095Z %23 = arith.addi %21, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:02.0624254Z %24 = arith.muli %18, %c256_i32 : i32 2026-02-21T09:47:02.0624436Z %25 = tt.splat %24 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:02.0624638Z %26 = arith.addi %25, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:02.0624904Z %27 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T09:47:02.0625172Z %28 = arith.muli %27, %cst_5 : tensor<16x1xi32, #blocked2> 2026-02-21T09:47:02.0625362Z %29 = tt.broadcast %28 : tensor<16x1xi32, #blocked2> -> tensor<16x2xi32, #blocked2> 2026-02-21T09:47:02.0625534Z %30 = arith.extsi %24 : i32 to i64 2026-02-21T09:47:02.0625697Z %31 = tt.splat %30 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:02.0625917Z %32 = arith.addi %31, %8 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:02.0626193Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0626448Z %34 = arith.cmpi sge, %33, %cst_3 : tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0626625Z %35 = arith.cmpi slt, %33, %cst_4 : tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0626793Z %36 = arith.andi %34, %35 : tensor<1x256xi1, #blocked1> 2026-02-21T09:47:02.0627025Z %37 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x256xf32, #mma>) : i32 { 2026-02-21T09:47:02.0627246Z %46 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:47:02.0627413Z %47 = tt.splat %46 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:47:02.0627630Z %48 = arith.addi %47, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:47:02.0627899Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x2xi32, #blocked2> 2026-02-21T09:47:02.0628174Z %50 = tt.broadcast %49 : tensor<1x2xi32, #blocked2> -> tensor<16x2xi32, #blocked2> 2026-02-21T09:47:02.0628413Z %51 = arith.addi %29, %50 : tensor<16x2xi32, #blocked2> 2026-02-21T09:47:02.0628608Z %52 = tt.addptr %6, %51 : tensor<16x2x!tt.ptr, #blocked2>, tensor<16x2xi32, #blocked2> 2026-02-21T09:47:02.0628816Z %53 = tt.load %52 : tensor<16x2x!tt.ptr, #blocked2> 2026-02-21T09:47:02.0629100Z %54 = ttg.convert_layout %53 : tensor<16x2xbf16, #blocked2> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:02.0629496Z %55 = arith.extf %54 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:02.0629778Z %56 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:47:02.0629900Z %57 = arith.muli %56, %c8192_i64 : i64 2026-02-21T09:47:02.0630039Z %58 = tt.splat %57 : i64 -> tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0630192Z %59 = arith.addi %58, %33 : tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0630394Z %60 = tt.addptr %7, %59 : tensor<1x256x!tt.ptr, #blocked1>, tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0630599Z %61 = tt.load %60, %36, %cst_6 : tensor<1x256x!tt.ptr, #blocked1> 2026-02-21T09:47:02.0630858Z %62 = ttg.convert_layout %61 : tensor<1x256xi8, #blocked1> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0631138Z %63 = arith.shli %62, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0631367Z %64 = arith.shrsi %63, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0631598Z %65 = arith.shrsi %62, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0631880Z %66 = tt.expand_dims %64 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:47:02.0632235Z %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:47:02.0632519Z %68 = tt.broadcast %66 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0632777Z %69 = arith.select %13, %68, %cst_7 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0633014Z %70 = tt.broadcast %67 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0633241Z %71 = arith.select %15, %70, %69 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0633466Z %72 = tt.reshape %71 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3> 2026-02-21T09:47:02.0633689Z %73 = arith.sitofp %72 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3> 2026-02-21T09:47:02.0633935Z %74 = ttg.local_alloc %73 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:47:02.0634262Z %75 = ttg.local_load %74 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:02.0634737Z %76 = tt.dot %55, %75, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:47:02.0635083Z %77 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:47:02.0635206Z %78 = arith.muli %77, %c2_i32 : i32 2026-02-21T09:47:02.0635371Z %79 = tt.splat %78 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:47:02.0635589Z %80 = arith.addi %79, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:47:02.0635863Z %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x2xi32, #blocked2> 2026-02-21T09:47:02.0636152Z %82 = tt.broadcast %81 : tensor<1x2xi32, #blocked2> -> tensor<16x2xi32, #blocked2> 2026-02-21T09:47:02.0636342Z %83 = arith.addi %29, %82 : tensor<16x2xi32, #blocked2> 2026-02-21T09:47:02.0636536Z %84 = tt.addptr %6, %83 : tensor<16x2x!tt.ptr, #blocked2>, tensor<16x2xi32, #blocked2> 2026-02-21T09:47:02.0636738Z %85 = tt.load %84 : tensor<16x2x!tt.ptr, #blocked2> 2026-02-21T09:47:02.0637013Z %86 = ttg.convert_layout %85 : tensor<16x2xbf16, #blocked2> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:02.0637406Z %87 = arith.extf %86 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:02.0637681Z %88 = arith.extsi %77 : i32 to i64 2026-02-21T09:47:02.0637800Z %89 = arith.muli %88, %c8192_i64 : i64 2026-02-21T09:47:02.0637938Z %90 = tt.splat %89 : i64 -> tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0638093Z %91 = arith.addi %90, %33 : tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0638286Z %92 = tt.addptr %7, %91 : tensor<1x256x!tt.ptr, #blocked1>, tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0638477Z %93 = arith.cmpi slt, %88, %c512_i64 : i64 2026-02-21T09:47:02.0638622Z %94 = tt.splat %93 : i1 -> tensor<1x256xi1, #blocked1> 2026-02-21T09:47:02.0638778Z %95 = arith.andi %94, %36 : tensor<1x256xi1, #blocked1> 2026-02-21T09:47:02.0638943Z %96 = tt.load %92, %95, %cst_6 : tensor<1x256x!tt.ptr, #blocked1> 2026-02-21T09:47:02.0639201Z %97 = ttg.convert_layout %96 : tensor<1x256xi8, #blocked1> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0639486Z %98 = arith.shli %97, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0639719Z %99 = arith.shrsi %98, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0639986Z %100 = arith.shrsi %97, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0640277Z %101 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:47:02.0640658Z %102 = tt.expand_dims %100 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:47:02.0640954Z %103 = tt.broadcast %101 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0641202Z %104 = arith.select %13, %103, %cst_7 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0641448Z %105 = tt.broadcast %102 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0641692Z %106 = arith.select %15, %105, %104 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0641929Z %107 = tt.reshape %106 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3> 2026-02-21T09:47:02.0642160Z %108 = arith.sitofp %107 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3> 2026-02-21T09:47:02.0642418Z %109 = ttg.local_alloc %108 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:47:02.0642842Z %110 = ttg.local_load %109 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:02.0643315Z %111 = tt.dot %87, %110, %76, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:47:02.0643659Z %112 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:47:02.0643790Z %113 = arith.muli %112, %c2_i32 : i32 2026-02-21T09:47:02.0643971Z %114 = tt.splat %113 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:47:02.0644221Z %115 = arith.addi %114, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:47:02.0644502Z %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x2xi32, #blocked2> 2026-02-21T09:47:02.0644801Z %117 = tt.broadcast %116 : tensor<1x2xi32, #blocked2> -> tensor<16x2xi32, #blocked2> 2026-02-21T09:47:02.0645002Z %118 = arith.addi %29, %117 : tensor<16x2xi32, #blocked2> 2026-02-21T09:47:02.0645208Z %119 = tt.addptr %6, %118 : tensor<16x2x!tt.ptr, #blocked2>, tensor<16x2xi32, #blocked2> 2026-02-21T09:47:02.0645415Z %120 = tt.load %119 : tensor<16x2x!tt.ptr, #blocked2> 2026-02-21T09:47:02.0645692Z %121 = ttg.convert_layout %120 : tensor<16x2xbf16, #blocked2> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:02.0646098Z %122 = arith.extf %121 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:02.0646385Z %123 = arith.extsi %112 : i32 to i64 2026-02-21T09:47:02.0646517Z %124 = arith.muli %123, %c8192_i64 : i64 2026-02-21T09:47:02.0646664Z %125 = tt.splat %124 : i64 -> tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0646834Z %126 = arith.addi %125, %33 : tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0647037Z %127 = tt.addptr %7, %126 : tensor<1x256x!tt.ptr, #blocked1>, tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0647239Z %128 = arith.cmpi slt, %123, %c512_i64 : i64 2026-02-21T09:47:02.0647387Z %129 = tt.splat %128 : i1 -> tensor<1x256xi1, #blocked1> 2026-02-21T09:47:02.0647552Z %130 = arith.andi %129, %36 : tensor<1x256xi1, #blocked1> 2026-02-21T09:47:02.0647728Z %131 = tt.load %127, %130, %cst_6 : tensor<1x256x!tt.ptr, #blocked1> 2026-02-21T09:47:02.0648014Z %132 = ttg.convert_layout %131 : tensor<1x256xi8, #blocked1> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0648304Z %133 = arith.shli %132, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0648563Z %134 = arith.shrsi %133, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0648806Z %135 = arith.shrsi %132, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0649103Z %136 = tt.expand_dims %134 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:47:02.0649442Z %137 = tt.expand_dims %135 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:47:02.0649733Z %138 = tt.broadcast %136 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0649980Z %139 = arith.select %13, %138, %cst_7 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0650225Z %140 = tt.broadcast %137 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0650468Z %141 = arith.select %15, %140, %139 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0650701Z %142 = tt.reshape %141 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3> 2026-02-21T09:47:02.0650933Z %143 = arith.sitofp %142 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3> 2026-02-21T09:47:02.0651186Z %144 = ttg.local_alloc %143 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:47:02.0651515Z %145 = ttg.local_load %144 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:02.0652007Z %146 = tt.dot %122, %145, %111, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:47:02.0652354Z %147 = arith.addi %arg4, %c3_i32 : i32 2026-02-21T09:47:02.0652487Z %148 = arith.muli %147, %c2_i32 : i32 2026-02-21T09:47:02.0652665Z %149 = tt.splat %148 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:47:02.0652905Z %150 = arith.addi %149, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:47:02.0653188Z %151 = tt.expand_dims %150 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x2xi32, #blocked2> 2026-02-21T09:47:02.0653465Z %152 = tt.broadcast %151 : tensor<1x2xi32, #blocked2> -> tensor<16x2xi32, #blocked2> 2026-02-21T09:47:02.0653663Z %153 = arith.addi %29, %152 : tensor<16x2xi32, #blocked2> 2026-02-21T09:47:02.0653867Z %154 = tt.addptr %6, %153 : tensor<16x2x!tt.ptr, #blocked2>, tensor<16x2xi32, #blocked2> 2026-02-21T09:47:02.0654080Z %155 = tt.load %154 : tensor<16x2x!tt.ptr, #blocked2> 2026-02-21T09:47:02.0654351Z %156 = ttg.convert_layout %155 : tensor<16x2xbf16, #blocked2> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:02.0654758Z %157 = arith.extf %156 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:02.0655048Z %158 = arith.extsi %147 : i32 to i64 2026-02-21T09:47:02.0655173Z %159 = arith.muli %158, %c8192_i64 : i64 2026-02-21T09:47:02.0655322Z %160 = tt.splat %159 : i64 -> tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0655488Z %161 = arith.addi %160, %33 : tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0655691Z %162 = tt.addptr %7, %161 : tensor<1x256x!tt.ptr, #blocked1>, tensor<1x256xi64, #blocked1> 2026-02-21T09:47:02.0655909Z %163 = arith.cmpi slt, %158, %c512_i64 : i64 2026-02-21T09:47:02.0656057Z %164 = tt.splat %163 : i1 -> tensor<1x256xi1, #blocked1> 2026-02-21T09:47:02.0656220Z %165 = arith.andi %164, %36 : tensor<1x256xi1, #blocked1> 2026-02-21T09:47:02.0656413Z %166 = tt.load %162, %165, %cst_6 : tensor<1x256x!tt.ptr, #blocked1> 2026-02-21T09:47:02.0656677Z %167 = ttg.convert_layout %166 : tensor<1x256xi8, #blocked1> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0656968Z %168 = arith.shli %167, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0657208Z %169 = arith.shrsi %168, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0657451Z %170 = arith.shrsi %167, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:02.0657750Z %171 = tt.expand_dims %169 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:47:02.0658092Z %172 = tt.expand_dims %170 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked> 2026-02-21T09:47:02.0658389Z %173 = tt.broadcast %171 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0658633Z %174 = arith.select %13, %173, %cst_7 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0658881Z %175 = tt.broadcast %172 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0659127Z %176 = arith.select %15, %175, %174 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked> 2026-02-21T09:47:02.0659368Z %177 = tt.reshape %176 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3> 2026-02-21T09:47:02.0659602Z %178 = arith.sitofp %177 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3> 2026-02-21T09:47:02.0659881Z %179 = ttg.local_alloc %178 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem> 2026-02-21T09:47:02.0660212Z %180 = ttg.local_load %179 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:02.0660684Z %181 = tt.dot %157, %180, %146, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma> 2026-02-21T09:47:02.0661046Z scf.yield %181 : tensor<16x256xf32, #mma> 2026-02-21T09:47:02.0661176Z } 2026-02-21T09:47:02.0661307Z %38 = arith.truncf %37 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma> 2026-02-21T09:47:02.0661577Z %39 = tt.expand_dims %23 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:47:02.0661813Z %40 = arith.muli %39, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:47:02.0662047Z %41 = tt.expand_dims %26 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:47:02.0662308Z %42 = tt.broadcast %40 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:47:02.0662509Z %43 = tt.broadcast %41 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma> 2026-02-21T09:47:02.0662690Z %44 = arith.addi %42, %43 : tensor<16x256xi32, #mma> 2026-02-21T09:47:02.0662877Z %45 = tt.addptr %16, %44 : tensor<16x256x!tt.ptr, #mma>, tensor<16x256xi32, #mma> 2026-02-21T09:47:02.0663072Z tt.store %45, %38 : tensor<16x256x!tt.ptr, #mma> 2026-02-21T09:47:02.0663282Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:47:02.0663465Z tt.return 2026-02-21T09:47:02.0663552Z } 2026-02-21T09:47:02.0663632Z } 2026-02-21T09:47:02.0663682Z 2026-02-21T09:47:02.0663715Z {-# 2026-02-21T09:47:02.0663798Z external_resources: { 2026-02-21T09:47:02.0663921Z mlir_reproducer: { 2026-02-21T09:47:02.0664926Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:47:02.0665942Z disable_threading: false, 2026-02-21T09:47:02.0666055Z verify_each: true 2026-02-21T09:47:02.0666151Z } 2026-02-21T09:47:02.0666227Z } 2026-02-21T09:47:02.0666303Z #-} 2026-02-21T09:47:02.0666590Z /tmp/torchinductor_root/zh/czhziawqzz4jltg55krmykgp4tqxs6x5wtip7pfr6hq44mrh2l3m.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:47:02.0667271Z /tmp/torchinductor_root/zh/czhziawqzz4jltg55krmykgp4tqxs6x5wtip7pfr6hq44mrh2l3m.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:47:02.0667829Z [152s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:47:02.0668626Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, None], range_num_stages=[2, 0], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:47:02.0669352Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:47:02.0669525Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:47:08.6420627Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:47:08.6434031Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:47:08.6434871Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:47:08.6435526Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:47:08.6436138Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:47:08.6436695Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 32, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:47:08.6437178Z #smem = #ttg.shared_memory 2026-02-21T09:47:08.6437819Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:47:08.6438843Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:47:08.6439681Z %cst = arith.constant dense<0.000000e+00> : tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6440017Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:47:08.6440251Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:47:08.6440481Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:47:08.6440711Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:47:08.6441118Z %cst_0 = arith.constant dense<0> : tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6441418Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:47:08.6441647Z %c510_i32 = arith.constant 510 : i32 2026-02-21T09:47:08.6441870Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:47:08.6442228Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T09:47:08.6442473Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:47:08.6442810Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:47:08.6443067Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:47:08.6443299Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:47:08.6443720Z %cst_1 = arith.constant dense<4153344> : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6444270Z %cst_2 = arith.constant dense<4128768> : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6444657Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:47:08.6444878Z %c7_i32 = arith.constant 7 : i32 2026-02-21T09:47:08.6445237Z %cst_3 = arith.constant dense<10> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6445743Z %cst_4 = arith.constant dense<8> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6446237Z %cst_5 = arith.constant dense<6> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6446720Z %cst_6 = arith.constant dense<4> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6447205Z %cst_7 = arith.constant dense<2> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6447568Z %c504_i32 = arith.constant 504 : i32 2026-02-21T09:47:08.6447797Z %c5_i32 = arith.constant 5 : i32 2026-02-21T09:47:08.6448015Z %c6_i32 = arith.constant 6 : i32 2026-02-21T09:47:08.6448303Z %cst_8 = arith.constant dense<1024> : tensor<256x1xi32, #blocked1> 2026-02-21T09:47:08.6448737Z %cst_9 = arith.constant dense<4> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6449287Z %cst_10 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:08.6449537Z %cst_11 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:08.6449783Z %cst_12 = arith.constant dense<8192> : tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6450024Z %cst_13 = arith.constant dense<0> : tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6450292Z %cst_14 = arith.constant dense<16384> : tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6450528Z %cst_15 = arith.constant dense<0> : tensor<1x32xi64, #mma> 2026-02-21T09:47:08.6450767Z %cst_16 = arith.constant dense<8192> : tensor<1x32xi64, #mma> 2026-02-21T09:47:08.6450972Z %0 = tt.get_program_id x : i32 2026-02-21T09:47:08.6451133Z %1 = arith.muli %0, %c4_i32 : i32 2026-02-21T09:47:08.6451291Z %2 = arith.addi %1, %c4_i32 : i32 2026-02-21T09:47:08.6451454Z %3 = arith.minsi %2, %c16384_i32 : i32 2026-02-21T09:47:08.6451779Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:08.6452179Z %5 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:08.6452623Z %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:08.6453062Z %7 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:08.6453434Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6453781Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6454122Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6454582Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:47:08.6455175Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:47:08.6455774Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:08.6456136Z %14 = arith.cmpi eq, %13, %cst_10 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:08.6456419Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked> 2026-02-21T09:47:08.6456694Z %16 = arith.cmpi eq, %13, %cst_11 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:08.6456962Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked> 2026-02-21T09:47:08.6457260Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<256x32x!tt.ptr, #mma> 2026-02-21T09:47:08.6457648Z %19 = arith.extsi %5 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:08.6458123Z %20 = arith.extsi %7 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:08.6458445Z %21 = arith.subi %3, %1 : i32 2026-02-21T09:47:08.6458621Z %22 = arith.remsi %21, %c2_i32 : i32 2026-02-21T09:47:08.6458785Z %23 = arith.subi %21, %22 : i32 2026-02-21T09:47:08.6458941Z %24 = arith.addi %1, %23 : i32 2026-02-21T09:47:08.6459119Z scf.for %arg3 = %1 to %24 step %c2_i32 : i32 { 2026-02-21T09:47:08.6459283Z %25 = arith.divsi %arg3, %c16384_i32 : i32 2026-02-21T09:47:08.6459425Z %26 = arith.muli %25, %c64_i32 : i32 2026-02-21T09:47:08.6459553Z %27 = arith.subi %c64_i32, %26 : i32 2026-02-21T09:47:08.6459680Z %28 = arith.minsi %27, %c64_i32 : i32 2026-02-21T09:47:08.6459813Z %29 = arith.remsi %arg3, %c16384_i32 : i32 2026-02-21T09:47:08.6459950Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:47:08.6460102Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:47:08.6460226Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:47:08.6460354Z %33 = arith.muli %31, %c256_i32 : i32 2026-02-21T09:47:08.6460539Z %34 = tt.splat %33 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:08.6460806Z %35 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:08.6460996Z %36 = arith.muli %32, %c32_i32 : i32 2026-02-21T09:47:08.6461221Z %37 = tt.splat %36 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:08.6461549Z %38 = arith.addi %37, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:08.6461899Z %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1> 2026-02-21T09:47:08.6462182Z %40 = arith.muli %39, %cst_8 : tensor<256x1xi32, #blocked1> 2026-02-21T09:47:08.6462398Z %41 = tt.broadcast %40 : tensor<256x1xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6462791Z %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6463179Z %43 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6463417Z %44 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6463648Z %45 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6463948Z %46 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6464268Z %47 = tt.broadcast %46 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6464481Z %48 = arith.addi %41, %47 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6464717Z %49 = tt.addptr %9, %48 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6464942Z %50 = tt.load %49 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6465153Z %51 = arith.addi %8, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6465457Z %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6465759Z %53 = tt.broadcast %52 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6465966Z %54 = arith.addi %41, %53 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6466184Z %55 = tt.addptr %9, %54 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6466407Z %56 = tt.load %55 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6466615Z %57 = arith.addi %8, %cst_6 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6466925Z %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6467225Z %59 = tt.broadcast %58 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6467433Z %60 = arith.addi %41, %59 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6467644Z %61 = tt.addptr %9, %60 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6467865Z %62 = tt.load %61 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6468178Z %63 = ttg.memdesc_index %43[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6468598Z ttg.local_store %50, %63 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6468996Z %64 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6469394Z ttg.local_store %56, %64 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6469800Z %65 = ttg.memdesc_index %45[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6470154Z ttg.local_store %62, %65 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6470421Z %66 = arith.addi %8, %cst_5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6470699Z %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6470967Z %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6471155Z %69 = arith.addi %41, %68 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6471348Z %70 = tt.addptr %9, %69 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6471545Z %71 = tt.load %70 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6471729Z %72 = arith.addi %8, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6471997Z %73 = tt.expand_dims %72 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6472264Z %74 = tt.broadcast %73 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6472449Z %75 = arith.addi %41, %74 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6472655Z %76 = tt.addptr %9, %75 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6472853Z %77 = tt.load %76 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6473067Z %78 = arith.addi %8, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6473340Z %79 = tt.expand_dims %78 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6473606Z %80 = tt.broadcast %79 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6473787Z %81 = arith.addi %41, %80 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6473979Z %82 = tt.addptr %9, %81 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6474174Z %83 = tt.load %82 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6474455Z %84 = ttg.memdesc_index %43[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6474807Z ttg.local_store %71, %84 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6475162Z %85 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6475515Z ttg.local_store %77, %85 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6475863Z %86 = ttg.memdesc_index %45[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6476213Z ttg.local_store %83, %86 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6477208Z %87:12 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c3_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %63, %arg8 = %84, %arg9 = %64, %arg10 = %85, %arg11 = %c1_i32, %arg12 = %c4_i32, %arg13 = %65, %arg14 = %86, %arg15 = %c2_i32, %arg16 = %c5_i32) -> (tensor<256x32xf32, #mma>, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32) : i32 { 2026-02-21T09:47:08.6478104Z %415 = arith.addi %arg4, %c6_i32 : i32 2026-02-21T09:47:08.6478229Z %416 = arith.muli %415, %c2_i32 : i32 2026-02-21T09:47:08.6478401Z %417 = tt.splat %416 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6478627Z %418 = arith.addi %417, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6478908Z %419 = tt.expand_dims %418 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6479186Z %420 = tt.broadcast %419 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6479380Z %421 = arith.addi %41, %420 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6479584Z %422 = tt.addptr %9, %421 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6479793Z %423 = tt.load %422 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6480097Z %424 = ttg.local_load %arg7 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6480539Z %425 = arith.extf %424 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6480842Z %426 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:47:08.6481023Z %427 = tt.splat %426 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6481268Z %428 = arith.addi %427, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6481581Z %429 = tt.addptr %10, %428 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6481891Z %430 = tt.load %429 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6482122Z %431 = arith.shli %430, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6482357Z %432 = arith.shrsi %431, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6482661Z %433 = arith.shrsi %430, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6482951Z %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6483291Z %435 = tt.expand_dims %433 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6483577Z %436 = tt.broadcast %434 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6483814Z %437 = arith.select %15, %436, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6484053Z %438 = tt.broadcast %435 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6484283Z %439 = arith.select %17, %438, %437 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6484515Z %440 = tt.reshape %439 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6484760Z %441 = arith.sitofp %440 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6485050Z %442 = ttg.convert_layout %441 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6485524Z %443 = tt.dot %425, %442, %arg5, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6485891Z %444 = arith.addi %arg4, %c7_i32 : i32 2026-02-21T09:47:08.6486043Z %445 = arith.muli %444, %c2_i32 : i32 2026-02-21T09:47:08.6486242Z %446 = tt.splat %445 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6486464Z %447 = arith.addi %446, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6486743Z %448 = tt.expand_dims %447 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6487023Z %449 = tt.broadcast %448 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6487218Z %450 = arith.addi %41, %449 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6487421Z %451 = tt.addptr %9, %450 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6487625Z %452 = tt.load %451 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6487930Z %453 = ttg.local_load %arg9 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6488364Z %454 = arith.extf %453 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6488651Z %455 = arith.muli %arg11, %c8192_i32 : i32 2026-02-21T09:47:08.6488851Z %456 = tt.splat %455 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6489075Z %457 = arith.addi %456, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6489403Z %458 = tt.addptr %10, %457 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6489715Z %459 = tt.load %458 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6489946Z %460 = arith.shli %459, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6490182Z %461 = arith.shrsi %460, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6490416Z %462 = arith.shrsi %459, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6490707Z %463 = tt.expand_dims %461 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6491042Z %464 = tt.expand_dims %462 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6491322Z %465 = tt.broadcast %463 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6491591Z %466 = arith.select %15, %465, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6491823Z %467 = tt.broadcast %464 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6492054Z %468 = arith.select %17, %467, %466 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6492286Z %469 = tt.reshape %468 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6492509Z %470 = arith.sitofp %469 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6492827Z %471 = ttg.convert_layout %470 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6493287Z %472 = tt.dot %454, %471, %443, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6493675Z %473 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:47:08.6493802Z %474 = arith.muli %473, %c2_i32 : i32 2026-02-21T09:47:08.6493972Z %475 = tt.splat %474 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6494197Z %476 = arith.addi %475, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6494474Z %477 = tt.expand_dims %476 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6494754Z %478 = tt.broadcast %477 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6494951Z %479 = arith.addi %41, %478 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6495156Z %480 = tt.addptr %9, %479 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6495364Z %481 = tt.load %480 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6495669Z %482 = ttg.local_load %arg13 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6496109Z %483 = arith.extf %482 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6496397Z %484 = arith.muli %arg15, %c8192_i32 : i32 2026-02-21T09:47:08.6496579Z %485 = tt.splat %484 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6496827Z %486 = arith.addi %485, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6497134Z %487 = tt.addptr %10, %486 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6497461Z %488 = tt.load %487 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6497695Z %489 = arith.shli %488, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6497927Z %490 = arith.shrsi %489, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6498164Z %491 = arith.shrsi %488, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6505913Z %492 = tt.expand_dims %490 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6506331Z %493 = tt.expand_dims %491 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6506669Z %494 = tt.broadcast %492 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6506906Z %495 = arith.select %15, %494, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6507149Z %496 = tt.broadcast %493 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6507382Z %497 = arith.select %17, %496, %495 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6507611Z %498 = tt.reshape %497 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6507836Z %499 = arith.sitofp %498 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6508133Z %500 = ttg.convert_layout %499 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6508657Z %501 = tt.dot %483, %500, %472, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6509026Z %502 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:47:08.6509154Z %503 = arith.cmpi slt, %502, %c2_i32 : i32 2026-02-21T09:47:08.6509289Z %504 = arith.select %503, %502, %c0_i32 : i32 2026-02-21T09:47:08.6509557Z %505 = ttg.memdesc_index %43[%504] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6509919Z ttg.local_store %423, %505 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6510284Z %506 = ttg.memdesc_index %44[%504] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6510636Z ttg.local_store %452, %506 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6510995Z %507 = ttg.memdesc_index %45[%504] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6511351Z ttg.local_store %481, %507 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6512157Z scf.yield %501, %504, %arg8, %505, %arg10, %506, %arg12, %444, %arg14, %507, %arg16, %473 : tensor<256x32xf32, #mma>, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32 2026-02-21T09:47:08.6512833Z } 2026-02-21T09:47:08.6513075Z %88 = ttg.local_load %87#2 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6513519Z %89 = arith.extf %88 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6513855Z %90 = arith.addi %42, %cst_2 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6514164Z %91 = tt.addptr %10, %90 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6514462Z %92 = tt.load %91 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6514689Z %93 = arith.shli %92, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6514918Z %94 = arith.shrsi %93, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6515152Z %95 = arith.shrsi %92, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6515437Z %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6515762Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6516035Z %98 = tt.broadcast %96 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6516266Z %99 = arith.select %15, %98, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6516500Z %100 = tt.broadcast %97 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6521345Z %101 = arith.select %17, %100, %99 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6521588Z %102 = tt.reshape %101 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6521811Z %103 = arith.sitofp %102 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6522131Z %104 = ttg.convert_layout %103 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6522655Z %105 = tt.dot %89, %104, %87#0, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6523160Z %106 = ttg.local_load %87#4 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6523602Z %107 = arith.extf %106 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6523894Z %108 = arith.muli %87#6, %c8192_i32 : i32 2026-02-21T09:47:08.6524076Z %109 = tt.splat %108 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6524304Z %110 = arith.addi %109, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6524612Z %111 = tt.addptr %10, %110 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6524918Z %112 = tt.load %111 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6525151Z %113 = arith.shli %112, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6525409Z %114 = arith.shrsi %113, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6525650Z %115 = arith.shrsi %112, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6525961Z %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6526295Z %117 = tt.expand_dims %115 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6526579Z %118 = tt.broadcast %116 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6526818Z %119 = arith.select %15, %118, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6527054Z %120 = tt.broadcast %117 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6527290Z %121 = arith.select %17, %120, %119 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6527520Z %122 = tt.reshape %121 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6527738Z %123 = arith.sitofp %122 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6528032Z %124 = ttg.convert_layout %123 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6528491Z %125 = tt.dot %107, %124, %105, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6528986Z %126 = ttg.local_load %87#8 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6529524Z %127 = arith.extf %126 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6529833Z %128 = arith.muli %87#10, %c8192_i32 : i32 2026-02-21T09:47:08.6530013Z %129 = tt.splat %128 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6530240Z %130 = arith.addi %129, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6530567Z %131 = tt.addptr %10, %130 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6530878Z %132 = tt.load %131 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6531107Z %133 = arith.shli %132, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6531344Z %134 = arith.shrsi %133, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6531580Z %135 = arith.shrsi %132, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6531865Z %136 = tt.expand_dims %134 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6532204Z %137 = tt.expand_dims %135 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6532485Z %138 = tt.broadcast %136 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6532722Z %139 = arith.select %15, %138, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6532958Z %140 = tt.broadcast %137 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6533191Z %141 = arith.select %17, %140, %139 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6533419Z %142 = tt.reshape %141 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6533656Z %143 = arith.sitofp %142 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6533949Z %144 = ttg.convert_layout %143 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6534422Z %145 = tt.dot %127, %144, %125, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6534916Z %146 = ttg.local_load %87#3 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6535351Z %147 = arith.extf %146 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6535686Z %148 = arith.addi %42, %cst_1 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6536002Z %149 = tt.addptr %10, %148 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6536315Z %150 = tt.load %149 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6536544Z %151 = arith.shli %150, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6536805Z %152 = arith.shrsi %151, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6537042Z %153 = arith.shrsi %150, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6537330Z %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6537666Z %155 = tt.expand_dims %153 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6537963Z %156 = tt.broadcast %154 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6538199Z %157 = arith.select %15, %156, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6538435Z %158 = tt.broadcast %155 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6538682Z %159 = arith.select %17, %158, %157 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6538910Z %160 = tt.reshape %159 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6539131Z %161 = arith.sitofp %160 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6539423Z %162 = ttg.convert_layout %161 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6539885Z %163 = tt.dot %147, %162, %145, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6540382Z %164 = ttg.local_load %87#5 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6540813Z %165 = arith.extf %164 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6541098Z %166 = arith.muli %87#7, %c8192_i32 : i32 2026-02-21T09:47:08.6541276Z %167 = tt.splat %166 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6541504Z %168 = arith.addi %167, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6541832Z %169 = tt.addptr %10, %168 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6542141Z %170 = tt.load %169 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6542392Z %171 = arith.shli %170, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6542625Z %172 = arith.shrsi %171, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6542862Z %173 = arith.shrsi %170, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6543145Z %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6543481Z %175 = tt.expand_dims %173 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6543763Z %176 = tt.broadcast %174 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6543999Z %177 = arith.select %15, %176, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6544235Z %178 = tt.broadcast %175 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6544464Z %179 = arith.select %17, %178, %177 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6544695Z %180 = tt.reshape %179 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6544916Z %181 = arith.sitofp %180 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6545209Z %182 = ttg.convert_layout %181 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6545674Z %183 = tt.dot %165, %182, %163, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6546186Z %184 = ttg.local_load %87#9 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6546616Z %185 = arith.extf %184 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6546918Z %186 = arith.muli %87#11, %c8192_i32 : i32 2026-02-21T09:47:08.6547095Z %187 = tt.splat %186 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6547322Z %188 = arith.addi %187, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6547633Z %189 = tt.addptr %10, %188 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6547940Z %190 = tt.load %189 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6548170Z %191 = arith.shli %190, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6548407Z %192 = arith.shrsi %191, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6548644Z %193 = arith.shrsi %190, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6548932Z %194 = tt.expand_dims %192 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6549262Z %195 = tt.expand_dims %193 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6549544Z %196 = tt.broadcast %194 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6549796Z %197 = arith.select %15, %196, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6550029Z %198 = tt.broadcast %195 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6550263Z %199 = arith.select %17, %198, %197 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6550506Z %200 = tt.reshape %199 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6550728Z %201 = arith.sitofp %200 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6551021Z %202 = ttg.convert_layout %201 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6551478Z %203 = tt.dot %185, %202, %183, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6551871Z ttg.local_dealloc %45 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6552081Z ttg.local_dealloc %44 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6552287Z ttg.local_dealloc %43 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6552551Z %204 = scf.for %arg4 = %c510_i32 to %c512_i32 step %c1_i32 iter_args(%arg5 = %203) -> (tensor<256x32xf32, #mma>) : i32 { 2026-02-21T09:47:08.6552776Z %415 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:47:08.6552953Z %416 = tt.splat %415 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6553179Z %417 = arith.addi %416, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6553460Z %418 = tt.expand_dims %417 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6553741Z %419 = tt.broadcast %418 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6553959Z %420 = arith.addi %41, %419 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6554165Z %421 = tt.addptr %9, %420 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6554374Z %422 = tt.load %421 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6554662Z %423 = ttg.convert_layout %422 : tensor<256x2xbf16, #blocked1> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6555069Z %424 = arith.extf %423 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6555359Z %425 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:47:08.6555539Z %426 = tt.splat %425 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6555766Z %427 = arith.addi %426, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6556079Z %428 = tt.addptr %10, %427 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6556391Z %429 = tt.load %428 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6556624Z %430 = arith.shli %429, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6556861Z %431 = arith.shrsi %430, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6557097Z %432 = arith.shrsi %429, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6557389Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6557742Z %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6558025Z %435 = tt.broadcast %433 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6558285Z %436 = arith.select %15, %435, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6558524Z %437 = tt.broadcast %434 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6558761Z %438 = arith.select %17, %437, %436 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6558991Z %439 = tt.reshape %438 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6559212Z %440 = arith.sitofp %439 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6559506Z %441 = ttg.convert_layout %440 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6559976Z %442 = tt.dot %424, %441, %arg5, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6560329Z scf.yield %442 : tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6560458Z } {tt.num_stages = 1 : i32} 2026-02-21T09:47:08.6560619Z %205 = arith.truncf %204 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma> 2026-02-21T09:47:08.6560821Z %206 = arith.extsi %33 : i32 to i64 2026-02-21T09:47:08.6560939Z %207 = arith.extsi %36 : i32 to i64 2026-02-21T09:47:08.6561107Z %208 = tt.splat %206 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:08.6561338Z %209 = arith.addi %208, %19 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:08.6561606Z %210 = tt.expand_dims %209 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6561870Z %211 = arith.muli %210, %cst_12 : tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6562056Z %212 = tt.broadcast %211 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:47:08.6562265Z %213 = tt.splat %207 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:08.6562487Z %214 = arith.addi %213, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:08.6562799Z %215 = tt.expand_dims %214 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:47:08.6563058Z %216 = tt.broadcast %215 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:47:08.6563240Z %217 = arith.addi %212, %216 : tensor<256x32xi64, #mma> 2026-02-21T09:47:08.6563431Z %218 = tt.addptr %18, %217 : tensor<256x32x!tt.ptr, #mma>, tensor<256x32xi64, #mma> 2026-02-21T09:47:08.6563631Z %219 = arith.cmpi sge, %210, %cst_13 : tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6563800Z %220 = arith.cmpi slt, %210, %cst_14 : tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6563958Z %221 = arith.andi %219, %220 : tensor<256x1xi1, #mma> 2026-02-21T09:47:08.6564134Z %222 = tt.broadcast %221 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:47:08.6564320Z %223 = arith.cmpi sge, %215, %cst_15 : tensor<1x32xi64, #mma> 2026-02-21T09:47:08.6564485Z %224 = arith.cmpi slt, %215, %cst_16 : tensor<1x32xi64, #mma> 2026-02-21T09:47:08.6564642Z %225 = arith.andi %223, %224 : tensor<1x32xi1, #mma> 2026-02-21T09:47:08.6564811Z %226 = tt.broadcast %225 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:47:08.6564988Z %227 = arith.andi %222, %226 : tensor<256x32xi1, #mma> 2026-02-21T09:47:08.6565146Z tt.store %218, %205, %227 : tensor<256x32x!tt.ptr, #mma> 2026-02-21T09:47:08.6565296Z %228 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:47:08.6565421Z %229 = arith.divsi %228, %c16384_i32 : i32 2026-02-21T09:47:08.6565567Z %230 = arith.muli %229, %c64_i32 : i32 2026-02-21T09:47:08.6565684Z %231 = arith.subi %c64_i32, %230 : i32 2026-02-21T09:47:08.6565820Z %232 = arith.minsi %231, %c64_i32 : i32 2026-02-21T09:47:08.6565943Z %233 = arith.remsi %228, %c16384_i32 : i32 2026-02-21T09:47:08.6566062Z %234 = arith.remsi %233, %232 : i32 2026-02-21T09:47:08.6566178Z %235 = arith.addi %230, %234 : i32 2026-02-21T09:47:08.6566290Z %236 = arith.divsi %233, %232 : i32 2026-02-21T09:47:08.6566407Z %237 = arith.muli %235, %c256_i32 : i32 2026-02-21T09:47:08.6566579Z %238 = tt.splat %237 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:08.6566802Z %239 = arith.addi %238, %4 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:08.6566976Z %240 = arith.muli %236, %c32_i32 : i32 2026-02-21T09:47:08.6567180Z %241 = tt.splat %240 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:08.6567483Z %242 = arith.addi %241, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:08.6567806Z %243 = tt.expand_dims %239 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1> 2026-02-21T09:47:08.6568061Z %244 = arith.muli %243, %cst_8 : tensor<256x1xi32, #blocked1> 2026-02-21T09:47:08.6568258Z %245 = tt.broadcast %244 : tensor<256x1xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6568638Z %246 = tt.expand_dims %242 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6568988Z %247 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6569202Z %248 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6569428Z %249 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6569615Z %250 = arith.addi %245, %47 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6569815Z %251 = tt.addptr %9, %250 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6570049Z %252 = tt.load %251 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6570211Z %253 = arith.addi %245, %53 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6570405Z %254 = tt.addptr %9, %253 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6570610Z %255 = tt.load %254 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6570766Z %256 = arith.addi %245, %59 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6570962Z %257 = tt.addptr %9, %256 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6571164Z %258 = tt.load %257 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6571450Z %259 = ttg.memdesc_index %247[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6571817Z ttg.local_store %252, %259 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6572181Z %260 = ttg.memdesc_index %248[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6572567Z ttg.local_store %255, %260 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6572928Z %261 = ttg.memdesc_index %249[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6573314Z ttg.local_store %258, %261 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6573553Z %262 = arith.addi %245, %68 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6573788Z %263 = tt.addptr %9, %262 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6573992Z %264 = tt.load %263 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6574151Z %265 = arith.addi %245, %74 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6574344Z %266 = tt.addptr %9, %265 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6574545Z %267 = tt.load %266 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6574700Z %268 = arith.addi %245, %80 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6574895Z %269 = tt.addptr %9, %268 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6575096Z %270 = tt.load %269 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6575373Z %271 = ttg.memdesc_index %247[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6575733Z ttg.local_store %264, %271 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6576090Z %272 = ttg.memdesc_index %248[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6576446Z ttg.local_store %267, %272 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6576803Z %273 = ttg.memdesc_index %249[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6577176Z ttg.local_store %270, %273 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6578169Z %274:12 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c3_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %259, %arg8 = %271, %arg9 = %260, %arg10 = %272, %arg11 = %c1_i32, %arg12 = %c4_i32, %arg13 = %261, %arg14 = %273, %arg15 = %c2_i32, %arg16 = %c5_i32) -> (tensor<256x32xf32, #mma>, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32) : i32 { 2026-02-21T09:47:08.6579071Z %415 = arith.addi %arg4, %c6_i32 : i32 2026-02-21T09:47:08.6579195Z %416 = arith.muli %415, %c2_i32 : i32 2026-02-21T09:47:08.6579368Z %417 = tt.splat %416 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6579592Z %418 = arith.addi %417, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6579870Z %419 = tt.expand_dims %418 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6580171Z %420 = tt.broadcast %419 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6580365Z %421 = arith.addi %245, %420 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6580567Z %422 = tt.addptr %9, %421 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6580774Z %423 = tt.load %422 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6581077Z %424 = ttg.local_load %arg7 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6581538Z %425 = arith.extf %424 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6581839Z %426 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:47:08.6582020Z %427 = tt.splat %426 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6582252Z %428 = arith.addi %427, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6582561Z %429 = tt.addptr %10, %428 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6582866Z %430 = tt.load %429 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6583096Z %431 = arith.shli %430, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6583403Z %432 = arith.shrsi %431, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6583642Z %433 = arith.shrsi %430, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6583936Z %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6584272Z %435 = tt.expand_dims %433 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6584557Z %436 = tt.broadcast %434 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6584797Z %437 = arith.select %15, %436, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6585038Z %438 = tt.broadcast %435 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6585269Z %439 = arith.select %17, %438, %437 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6585516Z %440 = tt.reshape %439 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6585736Z %441 = arith.sitofp %440 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6586029Z %442 = ttg.convert_layout %441 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6586509Z %443 = tt.dot %425, %442, %arg5, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6586857Z %444 = arith.addi %arg4, %c7_i32 : i32 2026-02-21T09:47:08.6586979Z %445 = arith.muli %444, %c2_i32 : i32 2026-02-21T09:47:08.6587149Z %446 = tt.splat %445 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6587375Z %447 = arith.addi %446, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6587651Z %448 = tt.expand_dims %447 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6587933Z %449 = tt.broadcast %448 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6588130Z %450 = arith.addi %245, %449 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6588331Z %451 = tt.addptr %9, %450 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6588535Z %452 = tt.load %451 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6588837Z %453 = ttg.local_load %arg9 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6589285Z %454 = arith.extf %453 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6589569Z %455 = arith.muli %arg11, %c8192_i32 : i32 2026-02-21T09:47:08.6589749Z %456 = tt.splat %455 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6589996Z %457 = arith.addi %456, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6590309Z %458 = tt.addptr %10, %457 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6590616Z %459 = tt.load %458 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6590845Z %460 = arith.shli %459, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6591080Z %461 = arith.shrsi %460, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6591317Z %462 = arith.shrsi %459, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6591602Z %463 = tt.expand_dims %461 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6591935Z %464 = tt.expand_dims %462 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6592214Z %465 = tt.broadcast %463 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6592451Z %466 = arith.select %15, %465, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6592685Z %467 = tt.broadcast %464 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6592914Z %468 = arith.select %17, %467, %466 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6593142Z %469 = tt.reshape %468 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6593375Z %470 = arith.sitofp %469 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6593670Z %471 = ttg.convert_layout %470 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6594142Z %472 = tt.dot %454, %471, %443, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6594484Z %473 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:47:08.6594606Z %474 = arith.muli %473, %c2_i32 : i32 2026-02-21T09:47:08.6594774Z %475 = tt.splat %474 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6594995Z %476 = arith.addi %475, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6595268Z %477 = tt.expand_dims %476 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6595542Z %478 = tt.broadcast %477 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6595739Z %479 = arith.addi %245, %478 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6595940Z %480 = tt.addptr %9, %479 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6596146Z %481 = tt.load %480 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6596452Z %482 = ttg.local_load %arg13 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6596889Z %483 = arith.extf %482 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6597196Z %484 = arith.muli %arg15, %c8192_i32 : i32 2026-02-21T09:47:08.6597375Z %485 = tt.splat %484 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6597620Z %486 = arith.addi %485, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6597927Z %487 = tt.addptr %10, %486 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6598238Z %488 = tt.load %487 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6598470Z %489 = arith.shli %488, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6598704Z %490 = arith.shrsi %489, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6598940Z %491 = arith.shrsi %488, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6599230Z %492 = tt.expand_dims %490 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6599559Z %493 = tt.expand_dims %491 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6599841Z %494 = tt.broadcast %492 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6600077Z %495 = arith.select %15, %494, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6600308Z %496 = tt.broadcast %493 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6600540Z %497 = arith.select %17, %496, %495 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6600767Z %498 = tt.reshape %497 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6600988Z %499 = arith.sitofp %498 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6601305Z %500 = ttg.convert_layout %499 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6601765Z %501 = tt.dot %483, %500, %472, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6602121Z %502 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:47:08.6602246Z %503 = arith.cmpi slt, %502, %c2_i32 : i32 2026-02-21T09:47:08.6602378Z %504 = arith.select %503, %502, %c0_i32 : i32 2026-02-21T09:47:08.6602704Z %505 = ttg.memdesc_index %247[%504] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6603068Z ttg.local_store %423, %505 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6603427Z %506 = ttg.memdesc_index %248[%504] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6603787Z ttg.local_store %452, %506 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6604142Z %507 = ttg.memdesc_index %249[%504] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6604497Z ttg.local_store %481, %507 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6605303Z scf.yield %501, %504, %arg8, %505, %arg10, %506, %arg12, %444, %arg14, %507, %arg16, %473 : tensor<256x32xf32, #mma>, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32 2026-02-21T09:47:08.6605993Z } 2026-02-21T09:47:08.6606234Z %275 = ttg.local_load %274#2 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6606664Z %276 = arith.extf %275 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6606996Z %277 = arith.addi %246, %cst_2 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6607305Z %278 = tt.addptr %10, %277 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6607610Z %279 = tt.load %278 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6607841Z %280 = arith.shli %279, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6608074Z %281 = arith.shrsi %280, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6608307Z %282 = arith.shrsi %279, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6608591Z %283 = tt.expand_dims %281 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6608920Z %284 = tt.expand_dims %282 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6609201Z %285 = tt.broadcast %283 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6609453Z %286 = arith.select %15, %285, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6609687Z %287 = tt.broadcast %284 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6609916Z %288 = arith.select %17, %287, %286 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6610158Z %289 = tt.reshape %288 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6610379Z %290 = arith.sitofp %289 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6610669Z %291 = ttg.convert_layout %290 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6611127Z %292 = tt.dot %276, %291, %274#0, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6611627Z %293 = ttg.local_load %274#4 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6612055Z %294 = arith.extf %293 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6612339Z %295 = arith.muli %274#6, %c8192_i32 : i32 2026-02-21T09:47:08.6612514Z %296 = tt.splat %295 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6612740Z %297 = arith.addi %296, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6613049Z %298 = tt.addptr %10, %297 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6613352Z %299 = tt.load %298 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6613597Z %300 = arith.shli %299, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6613832Z %301 = arith.shrsi %300, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6614083Z %302 = arith.shrsi %299, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6614372Z %303 = tt.expand_dims %301 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6614705Z %304 = tt.expand_dims %302 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6614988Z %305 = tt.broadcast %303 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6615229Z %306 = arith.select %15, %305, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6615465Z %307 = tt.broadcast %304 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6615696Z %308 = arith.select %17, %307, %306 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6615921Z %309 = tt.reshape %308 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6616141Z %310 = arith.sitofp %309 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6616432Z %311 = ttg.convert_layout %310 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6616886Z %312 = tt.dot %294, %311, %292, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6617385Z %313 = ttg.local_load %274#8 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6617833Z %314 = arith.extf %313 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6618116Z %315 = arith.muli %274#10, %c8192_i32 : i32 2026-02-21T09:47:08.6618308Z %316 = tt.splat %315 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6618537Z %317 = arith.addi %316, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6618847Z %318 = tt.addptr %10, %317 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6619153Z %319 = tt.load %318 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6619379Z %320 = arith.shli %319, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6619616Z %321 = arith.shrsi %320, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6619847Z %322 = arith.shrsi %319, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6620132Z %323 = tt.expand_dims %321 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6620467Z %324 = tt.expand_dims %322 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6620744Z %325 = tt.broadcast %323 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6620979Z %326 = arith.select %15, %325, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6621210Z %327 = tt.broadcast %324 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6621457Z %328 = arith.select %17, %327, %326 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6621684Z %329 = tt.reshape %328 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6621925Z %330 = arith.sitofp %329 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6622215Z %331 = ttg.convert_layout %330 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6622670Z %332 = tt.dot %314, %331, %312, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6623168Z %333 = ttg.local_load %274#3 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6623604Z %334 = arith.extf %333 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6623939Z %335 = arith.addi %246, %cst_1 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6624254Z %336 = tt.addptr %10, %335 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6624564Z %337 = tt.load %336 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6624791Z %338 = arith.shli %337, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6625026Z %339 = arith.shrsi %338, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6625259Z %340 = arith.shrsi %337, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6625566Z %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6625900Z %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6626182Z %343 = tt.broadcast %341 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6626438Z %344 = arith.select %15, %343, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6626679Z %345 = tt.broadcast %342 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6626911Z %346 = arith.select %17, %345, %344 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6627144Z %347 = tt.reshape %346 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6627364Z %348 = arith.sitofp %347 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6627662Z %349 = ttg.convert_layout %348 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6628125Z %350 = tt.dot %334, %349, %332, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6628618Z %351 = ttg.local_load %274#5 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6629056Z %352 = arith.extf %351 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6629346Z %353 = arith.muli %274#7, %c8192_i32 : i32 2026-02-21T09:47:08.6629525Z %354 = tt.splat %353 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6629776Z %355 = arith.addi %354, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6630087Z %356 = tt.addptr %10, %355 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6630424Z %357 = tt.load %356 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6630661Z %358 = arith.shli %357, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6630898Z %359 = arith.shrsi %358, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6631141Z %360 = arith.shrsi %357, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6631434Z %361 = tt.expand_dims %359 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6631777Z %362 = tt.expand_dims %360 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6632065Z %363 = tt.broadcast %361 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6632309Z %364 = arith.select %15, %363, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6632552Z %365 = tt.broadcast %362 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6632785Z %366 = arith.select %17, %365, %364 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6633020Z %367 = tt.reshape %366 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6633249Z %368 = arith.sitofp %367 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6633543Z %369 = ttg.convert_layout %368 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6634026Z %370 = tt.dot %352, %369, %350, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6634521Z %371 = ttg.local_load %274#9 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6634976Z %372 = arith.extf %371 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6635272Z %373 = arith.muli %274#11, %c8192_i32 : i32 2026-02-21T09:47:08.6635453Z %374 = tt.splat %373 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6635692Z %375 = arith.addi %374, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6636010Z %376 = tt.addptr %10, %375 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6636322Z %377 = tt.load %376 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6636560Z %378 = arith.shli %377, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6636801Z %379 = arith.shrsi %378, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6637043Z %380 = arith.shrsi %377, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6637339Z %381 = tt.expand_dims %379 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6637677Z %382 = tt.expand_dims %380 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6637978Z %383 = tt.broadcast %381 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6638218Z %384 = arith.select %15, %383, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6638476Z %385 = tt.broadcast %382 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6638712Z %386 = arith.select %17, %385, %384 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6638942Z %387 = tt.reshape %386 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6639166Z %388 = arith.sitofp %387 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6639457Z %389 = ttg.convert_layout %388 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6639927Z %390 = tt.dot %372, %389, %370, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6640316Z ttg.local_dealloc %249 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6640533Z ttg.local_dealloc %248 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6640749Z ttg.local_dealloc %247 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6641010Z %391 = scf.for %arg4 = %c510_i32 to %c512_i32 step %c1_i32 iter_args(%arg5 = %390) -> (tensor<256x32xf32, #mma>) : i32 { 2026-02-21T09:47:08.6641239Z %415 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:47:08.6641423Z %416 = tt.splat %415 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6641653Z %417 = arith.addi %416, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6641954Z %418 = tt.expand_dims %417 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6642237Z %419 = tt.broadcast %418 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6642446Z %420 = arith.addi %245, %419 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6642746Z %421 = tt.addptr %9, %420 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6642960Z %422 = tt.load %421 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6643240Z %423 = ttg.convert_layout %422 : tensor<256x2xbf16, #blocked1> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6643652Z %424 = arith.extf %423 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6643946Z %425 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:47:08.6644138Z %426 = tt.splat %425 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6644372Z %427 = arith.addi %426, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6644695Z %428 = tt.addptr %10, %427 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6645012Z %429 = tt.load %428 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6645247Z %430 = arith.shli %429, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6645489Z %431 = arith.shrsi %430, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6645730Z %432 = arith.shrsi %429, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6646049Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6646389Z %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6646694Z %435 = tt.broadcast %433 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6646938Z %436 = arith.select %15, %435, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6647177Z %437 = tt.broadcast %434 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6647416Z %438 = arith.select %17, %437, %436 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6647652Z %439 = tt.reshape %438 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6647879Z %440 = arith.sitofp %439 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6648180Z %441 = ttg.convert_layout %440 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6648649Z %442 = tt.dot %424, %441, %arg5, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6649008Z scf.yield %442 : tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6649145Z } {tt.num_stages = 1 : i32} 2026-02-21T09:47:08.6649311Z %392 = arith.truncf %391 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma> 2026-02-21T09:47:08.6649493Z %393 = arith.extsi %237 : i32 to i64 2026-02-21T09:47:08.6649616Z %394 = arith.extsi %240 : i32 to i64 2026-02-21T09:47:08.6649788Z %395 = tt.splat %393 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:08.6650011Z %396 = arith.addi %395, %19 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:08.6650303Z %397 = tt.expand_dims %396 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6650556Z %398 = arith.muli %397, %cst_12 : tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6650760Z %399 = tt.broadcast %398 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:47:08.6650975Z %400 = tt.splat %394 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:08.6651185Z %401 = arith.addi %400, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:08.6651453Z %402 = tt.expand_dims %401 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:47:08.6651718Z %403 = tt.broadcast %402 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:47:08.6651903Z %404 = arith.addi %399, %403 : tensor<256x32xi64, #mma> 2026-02-21T09:47:08.6652101Z %405 = tt.addptr %18, %404 : tensor<256x32x!tt.ptr, #mma>, tensor<256x32xi64, #mma> 2026-02-21T09:47:08.6652306Z %406 = arith.cmpi sge, %397, %cst_13 : tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6652482Z %407 = arith.cmpi slt, %397, %cst_14 : tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6652648Z %408 = arith.andi %406, %407 : tensor<256x1xi1, #mma> 2026-02-21T09:47:08.6652826Z %409 = tt.broadcast %408 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:47:08.6653018Z %410 = arith.cmpi sge, %402, %cst_15 : tensor<1x32xi64, #mma> 2026-02-21T09:47:08.6653185Z %411 = arith.cmpi slt, %402, %cst_16 : tensor<1x32xi64, #mma> 2026-02-21T09:47:08.6653349Z %412 = arith.andi %410, %411 : tensor<1x32xi1, #mma> 2026-02-21T09:47:08.6653524Z %413 = tt.broadcast %412 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:47:08.6653709Z %414 = arith.andi %409, %413 : tensor<256x32xi1, #mma> 2026-02-21T09:47:08.6653892Z tt.store %405, %392, %414 : tensor<256x32x!tt.ptr, #mma> 2026-02-21T09:47:08.6654041Z } {tt.num_stages = 1 : i32} 2026-02-21T09:47:08.6654169Z scf.for %arg3 = %24 to %3 step %c1_i32 : i32 { 2026-02-21T09:47:08.6654326Z %25 = arith.divsi %arg3, %c16384_i32 : i32 2026-02-21T09:47:08.6654462Z %26 = arith.muli %25, %c64_i32 : i32 2026-02-21T09:47:08.6654585Z %27 = arith.subi %c64_i32, %26 : i32 2026-02-21T09:47:08.6654710Z %28 = arith.minsi %27, %c64_i32 : i32 2026-02-21T09:47:08.6654838Z %29 = arith.remsi %arg3, %c16384_i32 : i32 2026-02-21T09:47:08.6654963Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:47:08.6655083Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:47:08.6655199Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:47:08.6655319Z %33 = arith.muli %31, %c256_i32 : i32 2026-02-21T09:47:08.6655491Z %34 = tt.splat %33 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:08.6655725Z %35 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:08.6655902Z %36 = arith.muli %32, %c32_i32 : i32 2026-02-21T09:47:08.6656113Z %37 = tt.splat %36 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:08.6656415Z %38 = arith.addi %37, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:08.6656732Z %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1> 2026-02-21T09:47:08.6656989Z %40 = arith.muli %39, %cst_8 : tensor<256x1xi32, #blocked1> 2026-02-21T09:47:08.6657189Z %41 = tt.broadcast %40 : tensor<256x1xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6657545Z %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6657910Z %43 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6658127Z %44 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6658338Z %45 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6658618Z %46 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6658888Z %47 = tt.broadcast %46 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6659084Z %48 = arith.addi %41, %47 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6659282Z %49 = tt.addptr %9, %48 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6659488Z %50 = tt.load %49 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6659682Z %51 = arith.addi %8, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6659955Z %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6660229Z %53 = tt.broadcast %52 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6660419Z %54 = arith.addi %41, %53 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6660615Z %55 = tt.addptr %9, %54 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6660816Z %56 = tt.load %55 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6661005Z %57 = arith.addi %8, %cst_6 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6661279Z %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6661568Z %59 = tt.broadcast %58 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6661762Z %60 = arith.addi %41, %59 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6661973Z %61 = tt.addptr %9, %60 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6662175Z %62 = tt.load %61 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6662459Z %63 = ttg.memdesc_index %43[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6662816Z ttg.local_store %50, %63 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6663173Z %64 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6663529Z ttg.local_store %56, %64 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6663881Z %65 = ttg.memdesc_index %45[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6664236Z ttg.local_store %62, %65 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6664502Z %66 = arith.addi %8, %cst_5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6664774Z %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6665043Z %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6665228Z %69 = arith.addi %41, %68 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6665425Z %70 = tt.addptr %9, %69 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6665638Z %71 = tt.load %70 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6665825Z %72 = arith.addi %8, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6666099Z %73 = tt.expand_dims %72 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6666379Z %74 = tt.broadcast %73 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6666566Z %75 = arith.addi %41, %74 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6666756Z %76 = tt.addptr %9, %75 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6666956Z %77 = tt.load %76 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6667143Z %78 = arith.addi %8, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6667415Z %79 = tt.expand_dims %78 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6667683Z %80 = tt.broadcast %79 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6667873Z %81 = arith.addi %41, %80 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6668070Z %82 = tt.addptr %9, %81 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6668269Z %83 = tt.load %82 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6668547Z %84 = ttg.memdesc_index %43[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6668902Z ttg.local_store %71, %84 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6669270Z %85 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6669623Z ttg.local_store %77, %85 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6669989Z %86 = ttg.memdesc_index %45[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6670339Z ttg.local_store %83, %86 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6671320Z %87:12 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c3_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %63, %arg8 = %84, %arg9 = %64, %arg10 = %85, %arg11 = %c1_i32, %arg12 = %c4_i32, %arg13 = %65, %arg14 = %86, %arg15 = %c2_i32, %arg16 = %c5_i32) -> (tensor<256x32xf32, #mma>, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32) : i32 { 2026-02-21T09:47:08.6672195Z %228 = arith.addi %arg4, %c6_i32 : i32 2026-02-21T09:47:08.6672322Z %229 = arith.muli %228, %c2_i32 : i32 2026-02-21T09:47:08.6672498Z %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6672724Z %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6673001Z %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6673281Z %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6673477Z %234 = arith.addi %41, %233 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6673702Z %235 = tt.addptr %9, %234 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6673910Z %236 = tt.load %235 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6674213Z %237 = ttg.local_load %arg7 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6674665Z %238 = arith.extf %237 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6674949Z %239 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:47:08.6675131Z %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6675363Z %241 = arith.addi %240, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6675674Z %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6675983Z %243 = tt.load %242 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6676215Z %244 = arith.shli %243, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6676451Z %245 = arith.shrsi %244, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6676689Z %246 = arith.shrsi %243, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6676976Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6677333Z %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6677647Z %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6677885Z %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6678140Z %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6678373Z %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6678602Z %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6678824Z %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6679116Z %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6679587Z %256 = tt.dot %238, %255, %arg5, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6679934Z %257 = arith.addi %arg4, %c7_i32 : i32 2026-02-21T09:47:08.6680062Z %258 = arith.muli %257, %c2_i32 : i32 2026-02-21T09:47:08.6680236Z %259 = tt.splat %258 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6680458Z %260 = arith.addi %259, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6680733Z %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6681008Z %262 = tt.broadcast %261 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6681207Z %263 = arith.addi %41, %262 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6681411Z %264 = tt.addptr %9, %263 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6681633Z %265 = tt.load %264 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6681937Z %266 = ttg.local_load %arg9 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6682393Z %267 = arith.extf %266 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6682897Z %268 = arith.muli %arg11, %c8192_i32 : i32 2026-02-21T09:47:08.6683089Z %269 = tt.splat %268 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6683321Z %270 = arith.addi %269, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6683639Z %271 = tt.addptr %10, %270 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6683951Z %272 = tt.load %271 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6684186Z %273 = arith.shli %272, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6684422Z %274 = arith.shrsi %273, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6684659Z %275 = arith.shrsi %272, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6684948Z %276 = tt.expand_dims %274 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6685280Z %277 = tt.expand_dims %275 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6685563Z %278 = tt.broadcast %276 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6685877Z %279 = arith.select %15, %278, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6686110Z %280 = tt.broadcast %277 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6686369Z %281 = arith.select %17, %280, %279 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6686597Z %282 = tt.reshape %281 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6686819Z %283 = arith.sitofp %282 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6687114Z %284 = ttg.convert_layout %283 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6687581Z %285 = tt.dot %267, %284, %256, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6687930Z %286 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:47:08.6688056Z %287 = arith.muli %286, %c2_i32 : i32 2026-02-21T09:47:08.6688229Z %288 = tt.splat %287 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6688455Z %289 = arith.addi %288, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6688732Z %290 = tt.expand_dims %289 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6689014Z %291 = tt.broadcast %290 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6689214Z %292 = arith.addi %41, %291 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6689415Z %293 = tt.addptr %9, %292 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6689624Z %294 = tt.load %293 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6689963Z %295 = ttg.local_load %arg13 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6690403Z %296 = arith.extf %295 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6690712Z %297 = arith.muli %arg15, %c8192_i32 : i32 2026-02-21T09:47:08.6690890Z %298 = tt.splat %297 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6691116Z %299 = arith.addi %298, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6691424Z %300 = tt.addptr %10, %299 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6691733Z %301 = tt.load %300 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6691968Z %302 = arith.shli %301, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6692202Z %303 = arith.shrsi %302, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6692438Z %304 = arith.shrsi %301, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6692724Z %305 = tt.expand_dims %303 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6693060Z %306 = tt.expand_dims %304 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6693342Z %307 = tt.broadcast %305 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6693592Z %308 = arith.select %15, %307, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6693830Z %309 = tt.broadcast %306 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6694061Z %310 = arith.select %17, %309, %308 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6694309Z %311 = tt.reshape %310 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6694532Z %312 = arith.sitofp %311 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6694826Z %313 = ttg.convert_layout %312 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6695291Z %314 = tt.dot %296, %313, %285, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6695638Z %315 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:47:08.6695768Z %316 = arith.cmpi slt, %315, %c2_i32 : i32 2026-02-21T09:47:08.6695902Z %317 = arith.select %316, %315, %c0_i32 : i32 2026-02-21T09:47:08.6696172Z %318 = ttg.memdesc_index %43[%317] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6696538Z ttg.local_store %236, %318 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6696895Z %319 = ttg.memdesc_index %44[%317] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6697248Z ttg.local_store %265, %319 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6697604Z %320 = ttg.memdesc_index %45[%317] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6697977Z ttg.local_store %294, %320 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> 2026-02-21T09:47:08.6698760Z scf.yield %314, %317, %arg8, %318, %arg10, %319, %arg12, %257, %arg14, %320, %arg16, %286 : tensor<256x32xf32, #mma>, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32 2026-02-21T09:47:08.6699445Z } 2026-02-21T09:47:08.6699684Z %88 = ttg.local_load %87#2 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6700116Z %89 = arith.extf %88 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6700450Z %90 = arith.addi %42, %cst_2 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6700756Z %91 = tt.addptr %10, %90 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6701060Z %92 = tt.load %91 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6701285Z %93 = arith.shli %92, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6701513Z %94 = arith.shrsi %93, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6701743Z %95 = arith.shrsi %92, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6702037Z %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6702364Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6702655Z %98 = tt.broadcast %96 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6702885Z %99 = arith.select %15, %98, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6703117Z %100 = tt.broadcast %97 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6703345Z %101 = arith.select %17, %100, %99 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6703574Z %102 = tt.reshape %101 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6703797Z %103 = arith.sitofp %102 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6704088Z %104 = ttg.convert_layout %103 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6704545Z %105 = tt.dot %89, %104, %87#0, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6705039Z %106 = ttg.local_load %87#4 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6705466Z %107 = arith.extf %106 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6705750Z %108 = arith.muli %87#6, %c8192_i32 : i32 2026-02-21T09:47:08.6705927Z %109 = tt.splat %108 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6706171Z %110 = arith.addi %109, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6706478Z %111 = tt.addptr %10, %110 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6706788Z %112 = tt.load %111 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6707035Z %113 = arith.shli %112, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6707269Z %114 = arith.shrsi %113, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6707509Z %115 = arith.shrsi %112, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6707799Z %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6708136Z %117 = tt.expand_dims %115 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6708420Z %118 = tt.broadcast %116 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6708660Z %119 = arith.select %15, %118, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6708898Z %120 = tt.broadcast %117 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6709133Z %121 = arith.select %17, %120, %119 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6709360Z %122 = tt.reshape %121 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6709582Z %123 = arith.sitofp %122 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6709871Z %124 = ttg.convert_layout %123 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6710346Z %125 = tt.dot %107, %124, %105, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6710855Z %126 = ttg.local_load %87#8 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6711289Z %127 = arith.extf %126 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6711578Z %128 = arith.muli %87#10, %c8192_i32 : i32 2026-02-21T09:47:08.6711758Z %129 = tt.splat %128 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6711983Z %130 = arith.addi %129, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6712292Z %131 = tt.addptr %10, %130 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6712601Z %132 = tt.load %131 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6712833Z %133 = arith.shli %132, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6713071Z %134 = arith.shrsi %133, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6713306Z %135 = arith.shrsi %132, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6713593Z %136 = tt.expand_dims %134 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6713928Z %137 = tt.expand_dims %135 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6714233Z %138 = tt.broadcast %136 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6714470Z %139 = arith.select %15, %138, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6714704Z %140 = tt.broadcast %137 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6714952Z %141 = arith.select %17, %140, %139 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6715179Z %142 = tt.reshape %141 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6715397Z %143 = arith.sitofp %142 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6715691Z %144 = ttg.convert_layout %143 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6716148Z %145 = tt.dot %127, %144, %125, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6716639Z %146 = ttg.local_load %87#3 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6717072Z %147 = arith.extf %146 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6717403Z %148 = arith.addi %42, %cst_1 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6717716Z %149 = tt.addptr %10, %148 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6718022Z %150 = tt.load %149 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6718267Z %151 = arith.shli %150, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6718501Z %152 = arith.shrsi %151, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6718748Z %153 = arith.shrsi %150, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6719034Z %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6719366Z %155 = tt.expand_dims %153 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6719642Z %156 = tt.broadcast %154 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6719880Z %157 = arith.select %15, %156, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6720113Z %158 = tt.broadcast %155 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6720345Z %159 = arith.select %17, %158, %157 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6720573Z %160 = tt.reshape %159 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6720789Z %161 = arith.sitofp %160 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6721082Z %162 = ttg.convert_layout %161 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6721544Z %163 = tt.dot %147, %162, %145, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6722036Z %164 = ttg.local_load %87#5 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6722490Z %165 = arith.extf %164 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6722842Z %166 = arith.muli %87#7, %c8192_i32 : i32 2026-02-21T09:47:08.6723020Z %167 = tt.splat %166 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6723269Z %168 = arith.addi %167, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6723573Z %169 = tt.addptr %10, %168 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6723879Z %170 = tt.load %169 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6729094Z %171 = arith.shli %170, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6729338Z %172 = arith.shrsi %171, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6729575Z %173 = arith.shrsi %170, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6729865Z %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6730202Z %175 = tt.expand_dims %173 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6730486Z %176 = tt.broadcast %174 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6730723Z %177 = arith.select %15, %176, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6730962Z %178 = tt.broadcast %175 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6731244Z %179 = arith.select %17, %178, %177 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6731469Z %180 = tt.reshape %179 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6731688Z %181 = arith.sitofp %180 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6732004Z %182 = ttg.convert_layout %181 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6732469Z %183 = tt.dot %165, %182, %163, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6732967Z %184 = ttg.local_load %87#9 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6733397Z %185 = arith.extf %184 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6733684Z %186 = arith.muli %87#11, %c8192_i32 : i32 2026-02-21T09:47:08.6733864Z %187 = tt.splat %186 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6734091Z %188 = arith.addi %187, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6734398Z %189 = tt.addptr %10, %188 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6734700Z %190 = tt.load %189 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6734931Z %191 = arith.shli %190, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6735164Z %192 = arith.shrsi %191, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6735417Z %193 = arith.shrsi %190, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6735702Z %194 = tt.expand_dims %192 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6736034Z %195 = tt.expand_dims %193 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6736328Z %196 = tt.broadcast %194 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6736564Z %197 = arith.select %15, %196, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6736794Z %198 = tt.broadcast %195 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6737022Z %199 = arith.select %17, %198, %197 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6737246Z %200 = tt.reshape %199 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6737466Z %201 = arith.sitofp %200 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6737756Z %202 = ttg.convert_layout %201 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6738211Z %203 = tt.dot %185, %202, %183, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6738593Z ttg.local_dealloc %45 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6738803Z ttg.local_dealloc %44 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6739003Z ttg.local_dealloc %43 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> 2026-02-21T09:47:08.6739277Z %204 = scf.for %arg4 = %c510_i32 to %c512_i32 step %c1_i32 iter_args(%arg5 = %203) -> (tensor<256x32xf32, #mma>) : i32 { 2026-02-21T09:47:08.6739326Z %228 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:47:08.6739422Z %229 = tt.splat %228 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6739532Z %230 = arith.addi %229, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.6739679Z %231 = tt.expand_dims %230 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.6739772Z %232 = tt.broadcast %231 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6739839Z %233 = arith.addi %41, %232 : tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6739945Z %234 = tt.addptr %9, %233 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T09:47:08.6740010Z %235 = tt.load %234 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.6740180Z %236 = ttg.convert_layout %235 : tensor<256x2xbf16, #blocked1> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6740378Z %237 = arith.extf %236 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6740427Z %238 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:47:08.6740524Z %239 = tt.splat %238 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6740618Z %240 = arith.addi %239, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6740795Z %241 = tt.addptr %10, %240 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6740890Z %242 = tt.load %241 : tensor<1x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6741004Z %243 = arith.shli %242, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6741101Z %244 = arith.shrsi %243, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6741200Z %245 = arith.shrsi %242, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.6741362Z %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6741507Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked> 2026-02-21T09:47:08.6741602Z %248 = tt.broadcast %246 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6741703Z %249 = arith.select %15, %248, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6741796Z %250 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6741897Z %251 = arith.select %17, %250, %249 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked> 2026-02-21T09:47:08.6741985Z %252 = tt.reshape %251 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2> 2026-02-21T09:47:08.6742076Z %253 = arith.sitofp %252 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2> 2026-02-21T09:47:08.6742239Z %254 = ttg.convert_layout %253 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.6742503Z %255 = tt.dot %237, %254, %arg5, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6742553Z scf.yield %255 : tensor<256x32xf32, #mma> 2026-02-21T09:47:08.6742599Z } {tt.num_stages = 1 : i32} 2026-02-21T09:47:08.6742699Z %205 = arith.truncf %204 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma> 2026-02-21T09:47:08.6742742Z %206 = arith.extsi %33 : i32 to i64 2026-02-21T09:47:08.6742806Z %207 = arith.extsi %36 : i32 to i64 2026-02-21T09:47:08.6742897Z %208 = tt.splat %206 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:08.6742982Z %209 = arith.addi %208, %19 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:08.6743126Z %210 = tt.expand_dims %209 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6743191Z %211 = arith.muli %210, %cst_12 : tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6743274Z %212 = tt.broadcast %211 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:47:08.6743357Z %213 = tt.splat %207 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:08.6743442Z %214 = arith.addi %213, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:08.6743579Z %215 = tt.expand_dims %214 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:47:08.6743662Z %216 = tt.broadcast %215 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:47:08.6743724Z %217 = arith.addi %212, %216 : tensor<256x32xi64, #mma> 2026-02-21T09:47:08.6743819Z %218 = tt.addptr %18, %217 : tensor<256x32x!tt.ptr, #mma>, tensor<256x32xi64, #mma> 2026-02-21T09:47:08.6743887Z %219 = arith.cmpi sge, %210, %cst_13 : tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6743954Z %220 = arith.cmpi slt, %210, %cst_14 : tensor<256x1xi64, #mma> 2026-02-21T09:47:08.6744014Z %221 = arith.andi %219, %220 : tensor<256x1xi1, #mma> 2026-02-21T09:47:08.6744094Z %222 = tt.broadcast %221 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:47:08.6744161Z %223 = arith.cmpi sge, %215, %cst_15 : tensor<1x32xi64, #mma> 2026-02-21T09:47:08.6744245Z %224 = arith.cmpi slt, %215, %cst_16 : tensor<1x32xi64, #mma> 2026-02-21T09:47:08.6744300Z %225 = arith.andi %223, %224 : tensor<1x32xi1, #mma> 2026-02-21T09:47:08.6744379Z %226 = tt.broadcast %225 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:47:08.6744438Z %227 = arith.andi %222, %226 : tensor<256x32xi1, #mma> 2026-02-21T09:47:08.6744517Z tt.store %218, %205, %227 : tensor<256x32x!tt.ptr, #mma> 2026-02-21T09:47:08.6744558Z } {tt.num_stages = 1 : i32} 2026-02-21T09:47:08.6744595Z tt.return 2026-02-21T09:47:08.6744626Z } 2026-02-21T09:47:08.6744660Z } 2026-02-21T09:47:08.6744665Z 2026-02-21T09:47:08.6744697Z {-# 2026-02-21T09:47:08.6744737Z external_resources: { 2026-02-21T09:47:08.6744775Z mlir_reproducer: { 2026-02-21T09:47:08.6745712Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:47:08.6745754Z disable_threading: false, 2026-02-21T09:47:08.6745792Z verify_each: true 2026-02-21T09:47:08.6745824Z } 2026-02-21T09:47:08.6745854Z } 2026-02-21T09:47:08.6745884Z #-} 2026-02-21T09:47:08.6746120Z /tmp/torchinductor_root/gt/cgtsosf4i4p64k42w2lgukhe2w75x3xfgm3sybvafk5ra2aegofb.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:47:08.6746560Z /tmp/torchinductor_root/gt/cgtsosf4i4p64k42w2lgukhe2w75x3xfgm3sybvafk5ra2aegofb.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:47:08.6746672Z [159s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:47:08.6747312Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 256, 32], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T09:47:08.6747367Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:47:08.6747447Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:47:08.8419752Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:47:08.8422669Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:47:08.8423587Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:47:08.8424410Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:47:08.8425174Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:47:08.8426036Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:47:08.8427501Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:47:08.8428576Z %cst = arith.constant dense<0.000000e+00> : tensor<32x16xf32, #mma> 2026-02-21T09:47:08.8429023Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:47:08.8429402Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:47:08.8429734Z %c262144_i32 = arith.constant 262144 : i32 2026-02-21T09:47:08.8430064Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:47:08.8430464Z %cst_0 = arith.constant dense<0> : tensor<1x2x16xi8, #blocked> 2026-02-21T09:47:08.8430871Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:47:08.8431018Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:47:08.8431132Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:47:08.8431246Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:47:08.8431358Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:47:08.8431505Z %cst_1 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1> 2026-02-21T09:47:08.8431723Z %cst_2 = arith.constant dense<4> : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8431937Z %cst_3 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:08.8432103Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:08.8432271Z %cst_5 = arith.constant dense<8192> : tensor<32x1xi32, #mma> 2026-02-21T09:47:08.8432410Z %0 = tt.get_program_id x : i32 2026-02-21T09:47:08.8432652Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:08.8432957Z %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:08.8433225Z %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:08.8433510Z %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:08.8433765Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.8434021Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.8434256Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8434556Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:47:08.8434970Z %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:47:08.8435389Z %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:08.8435637Z %11 = arith.cmpi eq, %10, %cst_3 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:08.8435832Z %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x16xi1, #blocked> 2026-02-21T09:47:08.8436025Z %13 = arith.cmpi eq, %10, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:08.8436209Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x16xi1, #blocked> 2026-02-21T09:47:08.8436412Z %15 = tt.splat %arg2 : !tt.ptr -> tensor<32x16x!tt.ptr, #mma> 2026-02-21T09:47:08.8436591Z scf.for %arg3 = %0 to %c262144_i32 step %c2432_i32 : i32 { 2026-02-21T09:47:08.8436740Z %16 = arith.divsi %arg3, %c8192_i32 : i32 2026-02-21T09:47:08.8436861Z %17 = arith.muli %16, %c16_i32 : i32 2026-02-21T09:47:08.8436982Z %18 = arith.subi %c512_i32, %17 : i32 2026-02-21T09:47:08.8437095Z %19 = arith.minsi %18, %c16_i32 : i32 2026-02-21T09:47:08.8437213Z %20 = arith.remsi %arg3, %c8192_i32 : i32 2026-02-21T09:47:08.8437353Z %21 = arith.remsi %20, %19 : i32 2026-02-21T09:47:08.8437463Z %22 = arith.addi %17, %21 : i32 2026-02-21T09:47:08.8437573Z %23 = arith.divsi %20, %19 : i32 2026-02-21T09:47:08.8437682Z %24 = arith.muli %22, %c16_i32 : i32 2026-02-21T09:47:08.8437887Z %25 = tt.splat %24 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:08.8438150Z %26 = tt.splat %24 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:08.8438397Z %27 = arith.addi %25, %1 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:08.8438645Z %28 = arith.addi %26, %2 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:08.8438804Z %29 = arith.muli %23, %c32_i32 : i32 2026-02-21T09:47:08.8438968Z %30 = tt.splat %29 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:08.8439176Z %31 = tt.splat %29 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:08.8439382Z %32 = arith.addi %30, %3 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:08.8439586Z %33 = arith.addi %31, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:08.8439847Z %34 = tt.expand_dims %32 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:47:08.8440091Z %35 = arith.muli %34, %cst_1 : tensor<32x1xi32, #blocked1> 2026-02-21T09:47:08.8440276Z %36 = tt.broadcast %35 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:47:08.8440622Z %37 = tt.expand_dims %27 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8441021Z %38 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x16xf32, #mma>) : i32 { 2026-02-21T09:47:08.8441231Z %47 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:47:08.8441398Z %48 = tt.splat %47 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.8441630Z %49 = arith.addi %48, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.8441900Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.8442173Z %51 = tt.broadcast %50 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:47:08.8442359Z %52 = arith.addi %36, %51 : tensor<32x2xi32, #blocked1> 2026-02-21T09:47:08.8442553Z %53 = tt.addptr %6, %52 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:47:08.8442794Z %54 = tt.load %53 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.8443053Z %55 = ttg.convert_layout %54 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.8443454Z %56 = arith.extf %55 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.8443734Z %57 = arith.muli %arg4, %c8192_i32 : i32 2026-02-21T09:47:08.8443906Z %58 = tt.splat %57 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8444124Z %59 = arith.addi %58, %37 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8444424Z %60 = tt.addptr %7, %59 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8444723Z %61 = tt.load %60 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8444968Z %62 = arith.shli %61, %cst_2 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8445196Z %63 = arith.shrsi %62, %cst_2 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8445427Z %64 = arith.shrsi %61, %cst_2 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8445722Z %65 = tt.expand_dims %63 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T09:47:08.8446044Z %66 = tt.expand_dims %64 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T09:47:08.8446314Z %67 = tt.broadcast %65 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T09:47:08.8446540Z %68 = arith.select %12, %67, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T09:47:08.8446768Z %69 = tt.broadcast %66 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T09:47:08.8446989Z %70 = arith.select %14, %69, %68 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T09:47:08.8447208Z %71 = tt.reshape %70 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T09:47:08.8447420Z %72 = arith.sitofp %71 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T09:47:08.8447702Z %73 = ttg.convert_layout %72 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.8448156Z %74 = tt.dot %56, %73, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x16xf32, #mma> 2026-02-21T09:47:08.8448495Z %75 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:47:08.8448615Z %76 = arith.muli %75, %c2_i32 : i32 2026-02-21T09:47:08.8448815Z %77 = tt.splat %76 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.8449027Z %78 = arith.addi %77, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:08.8449313Z %79 = tt.expand_dims %78 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:47:08.8449579Z %80 = tt.broadcast %79 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1> 2026-02-21T09:47:08.8449763Z %81 = arith.addi %36, %80 : tensor<32x2xi32, #blocked1> 2026-02-21T09:47:08.8449954Z %82 = tt.addptr %6, %81 : tensor<32x2x!tt.ptr, #blocked1>, tensor<32x2xi32, #blocked1> 2026-02-21T09:47:08.8450152Z %83 = tt.load %82 : tensor<32x2x!tt.ptr, #blocked1> 2026-02-21T09:47:08.8450403Z %84 = ttg.convert_layout %83 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.8450795Z %85 = arith.extf %84 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.8451068Z %86 = arith.muli %75, %c8192_i32 : i32 2026-02-21T09:47:08.8451239Z %87 = tt.splat %86 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8451457Z %88 = arith.addi %87, %37 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8451753Z %89 = tt.addptr %7, %88 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8452049Z %90 = tt.load %89 : tensor<1x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8452270Z %91 = arith.shli %90, %cst_2 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8452498Z %92 = arith.shrsi %91, %cst_2 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8452743Z %93 = arith.shrsi %90, %cst_2 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:08.8453021Z %94 = tt.expand_dims %92 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T09:47:08.8453362Z %95 = tt.expand_dims %93 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked> 2026-02-21T09:47:08.8453636Z %96 = tt.broadcast %94 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T09:47:08.8453863Z %97 = arith.select %12, %96, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T09:47:08.8454088Z %98 = tt.broadcast %95 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked> 2026-02-21T09:47:08.8454307Z %99 = arith.select %14, %98, %97 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked> 2026-02-21T09:47:08.8454528Z %100 = tt.reshape %99 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2> 2026-02-21T09:47:08.8454745Z %101 = arith.sitofp %100 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2> 2026-02-21T09:47:08.8455034Z %102 = ttg.convert_layout %101 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:08.8455486Z %103 = tt.dot %85, %102, %74, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x16xf32, #mma> 2026-02-21T09:47:08.8455822Z scf.yield %103 : tensor<32x16xf32, #mma> 2026-02-21T09:47:08.8455983Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:47:08.8456178Z %39 = arith.truncf %38 : tensor<32x16xf32, #mma> to tensor<32x16xbf16, #mma> 2026-02-21T09:47:08.8456445Z %40 = tt.expand_dims %33 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi32, #mma> 2026-02-21T09:47:08.8456673Z %41 = arith.muli %40, %cst_5 : tensor<32x1xi32, #mma> 2026-02-21T09:47:08.8456894Z %42 = tt.expand_dims %28 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma> 2026-02-21T09:47:08.8457155Z %43 = tt.broadcast %41 : tensor<32x1xi32, #mma> -> tensor<32x16xi32, #mma> 2026-02-21T09:47:08.8457344Z %44 = tt.broadcast %42 : tensor<1x16xi32, #mma> -> tensor<32x16xi32, #mma> 2026-02-21T09:47:08.8457512Z %45 = arith.addi %43, %44 : tensor<32x16xi32, #mma> 2026-02-21T09:47:08.8457685Z %46 = tt.addptr %15, %45 : tensor<32x16x!tt.ptr, #mma>, tensor<32x16xi32, #mma> 2026-02-21T09:47:08.8457867Z tt.store %46, %39 : tensor<32x16x!tt.ptr, #mma> 2026-02-21T09:47:08.8457988Z } 2026-02-21T09:47:08.8458061Z tt.return 2026-02-21T09:47:08.8458138Z } 2026-02-21T09:47:08.8458206Z } 2026-02-21T09:47:08.8458248Z 2026-02-21T09:47:08.8458279Z {-# 2026-02-21T09:47:08.8458357Z external_resources: { 2026-02-21T09:47:08.8458456Z mlir_reproducer: { 2026-02-21T09:47:08.8459451Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:47:08.8460431Z disable_threading: false, 2026-02-21T09:47:08.8460534Z verify_each: true 2026-02-21T09:47:08.8460619Z } 2026-02-21T09:47:08.8460687Z } 2026-02-21T09:47:08.8460753Z #-} 2026-02-21T09:47:08.8461043Z /tmp/torchinductor_root/cl/cclt7caafp2qbph4tyfhx3ejkoceexr4wcdrgb56w7si5kzmws65.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:47:08.8461713Z /tmp/torchinductor_root/cl/cclt7caafp2qbph4tyfhx3ejkoceexr4wcdrgb56w7si5kzmws65.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:47:08.8462277Z [159s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:47:08.8463049Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:47:08.8463750Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:47:08.8463913Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:47:16.3701396Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:47:16.3703833Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:47:16.3704448Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:47:16.3704959Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:47:16.3706005Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:47:16.3706441Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:47:16.3706837Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:47:16.3707246Z #smem = #ttg.shared_memory 2026-02-21T09:47:16.3707647Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:47:16.3708407Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:47:16.3709075Z %cst = arith.constant dense<0.000000e+00> : tensor<32x64xf32, #mma> 2026-02-21T09:47:16.3709335Z %c65536_i32 = arith.constant 65536 : i32 2026-02-21T09:47:16.3709509Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:47:16.3709689Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:47:16.3709848Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:47:16.3710060Z %cst_0 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T09:47:16.3710270Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:47:16.3710449Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:47:16.3710674Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:47:16.3710842Z %c1216_i32 = arith.constant 1216 : i32 2026-02-21T09:47:16.3711063Z %cst_1 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1> 2026-02-21T09:47:16.3711382Z %cst_2 = arith.constant dense<8192> : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3711750Z %cst_3 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3712058Z %cst_4 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:16.3712308Z %cst_5 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:16.3712675Z %cst_6 = arith.constant dense<8192> : tensor<32x1xi32, #mma> 2026-02-21T09:47:16.3712878Z %0 = tt.get_program_id x : i32 2026-02-21T09:47:16.3713170Z %1 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:16.3713560Z %2 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:16.3714099Z %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:16.3714533Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:16.3714968Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:16.3715407Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:16.3715764Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<32x4x!tt.ptr, #blocked1> 2026-02-21T09:47:16.3716102Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3716540Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:47:16.3717130Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:47:16.3717697Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:16.3718059Z %12 = arith.cmpi eq, %11, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:16.3718366Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:47:16.3718655Z %14 = arith.cmpi eq, %11, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:16.3718923Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:47:16.3719260Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<32x64x!tt.ptr, #mma> 2026-02-21T09:47:16.3719469Z scf.for %arg3 = %0 to %c65536_i32 step %c1216_i32 : i32 { 2026-02-21T09:47:16.3719634Z %17 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:47:16.3719779Z %18 = arith.muli %17, %c2_i32 : i32 2026-02-21T09:47:16.3719912Z %19 = arith.subi %c512_i32, %18 : i32 2026-02-21T09:47:16.3720055Z %20 = arith.minsi %19, %c2_i32 : i32 2026-02-21T09:47:16.3720195Z %21 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:47:16.3720335Z %22 = arith.remsi %21, %20 : i32 2026-02-21T09:47:16.3720466Z %23 = arith.addi %18, %22 : i32 2026-02-21T09:47:16.3720595Z %24 = arith.divsi %21, %20 : i32 2026-02-21T09:47:16.3720726Z %25 = arith.muli %23, %c32_i32 : i32 2026-02-21T09:47:16.3720916Z %26 = tt.splat %25 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:16.3721163Z %27 = tt.splat %25 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:16.3721399Z %28 = arith.addi %26, %1 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:16.3721639Z %29 = arith.addi %27, %2 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:16.3721829Z %30 = arith.muli %24, %c64_i32 : i32 2026-02-21T09:47:16.3722056Z %31 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:16.3722340Z %32 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:16.3722700Z %33 = arith.addi %31, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:16.3723010Z %34 = arith.addi %32, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:16.3723308Z %35 = tt.expand_dims %28 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:47:16.3723588Z %36 = arith.muli %35, %cst_1 : tensor<32x1xi32, #blocked1> 2026-02-21T09:47:16.3723824Z %37 = tt.broadcast %36 : tensor<32x1xi32, #blocked1> -> tensor<32x4xi32, #blocked1> 2026-02-21T09:47:16.3724207Z %38 = tt.expand_dims %33 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3724689Z %39 = tt.broadcast %38 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3725073Z %40 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x64xf32, #mma>) : i32 { 2026-02-21T09:47:16.3725418Z %49 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:16.3725754Z %50 = arith.addi %49, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:16.3725998Z %51 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:47:16.3726190Z %52 = tt.splat %51 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:16.3726441Z %53 = arith.addi %52, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:16.3726749Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:47:16.3727064Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<32x4xi32, #blocked1> 2026-02-21T09:47:16.3727301Z %56 = arith.addi %37, %55 : tensor<32x4xi32, #blocked1> 2026-02-21T09:47:16.3727524Z %57 = tt.addptr %7, %56 : tensor<32x4x!tt.ptr, #blocked1>, tensor<32x4xi32, #blocked1> 2026-02-21T09:47:16.3727782Z %58 = tt.load %57 : tensor<32x4x!tt.ptr, #blocked1> 2026-02-21T09:47:16.3728030Z %59 = ttg.local_alloc %58 : (tensor<32x4xbf16, #blocked1>) -> !ttg.memdesc<32x4xbf16, #shared, #smem> 2026-02-21T09:47:16.3728404Z %60 = ttg.local_load %59 : !ttg.memdesc<32x4xbf16, #shared, #smem> -> tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:16.3728866Z %61 = arith.extf %60 : tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:16.3729384Z %62 = tt.expand_dims %50 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3729757Z %63 = arith.muli %62, %cst_2 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3730064Z %64 = tt.broadcast %63 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3730370Z %65 = arith.addi %64, %39 : tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3730679Z %66 = tt.addptr %8, %65 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3730986Z %67 = tt.load %66 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3731221Z %68 = arith.shli %67, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3731459Z %69 = arith.shrsi %68, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3731713Z %70 = arith.shrsi %67, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:16.3732001Z %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:47:16.3732330Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:47:16.3732628Z %73 = tt.broadcast %71 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:47:16.3732894Z %74 = arith.select %13, %73, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:47:16.3733133Z %75 = tt.broadcast %72 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:47:16.3733365Z %76 = arith.select %15, %75, %74 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:47:16.3733591Z %77 = tt.reshape %76 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:47:16.3733818Z %78 = arith.sitofp %77 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:47:16.3734065Z %79 = ttg.local_alloc %78 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:47:16.3734388Z %80 = ttg.local_load %79 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:16.3734857Z %81 = tt.dot %61, %80, %arg5, inputPrecision = tf32 : tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x64xf32, #mma> 2026-02-21T09:47:16.3735208Z scf.yield %81 : tensor<32x64xf32, #mma> 2026-02-21T09:47:16.3735344Z } {tt.loop_unroll_factor = 1 : i32} 2026-02-21T09:47:16.3735508Z %41 = arith.truncf %40 : tensor<32x64xf32, #mma> to tensor<32x64xbf16, #mma> 2026-02-21T09:47:16.3735787Z %42 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi32, #mma> 2026-02-21T09:47:16.3736026Z %43 = arith.muli %42, %cst_6 : tensor<32x1xi32, #mma> 2026-02-21T09:47:16.3736272Z %44 = tt.expand_dims %34 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:47:16.3736531Z %45 = tt.broadcast %43 : tensor<32x1xi32, #mma> -> tensor<32x64xi32, #mma> 2026-02-21T09:47:16.3736726Z %46 = tt.broadcast %44 : tensor<1x64xi32, #mma> -> tensor<32x64xi32, #mma> 2026-02-21T09:47:16.3736906Z %47 = arith.addi %45, %46 : tensor<32x64xi32, #mma> 2026-02-21T09:47:16.3737090Z %48 = tt.addptr %16, %47 : tensor<32x64x!tt.ptr, #mma>, tensor<32x64xi32, #mma> 2026-02-21T09:47:16.3737278Z tt.store %48, %41 : tensor<32x64x!tt.ptr, #mma> 2026-02-21T09:47:16.3737449Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:47:16.3737589Z tt.return 2026-02-21T09:47:16.3737681Z } 2026-02-21T09:47:16.3737763Z } 2026-02-21T09:47:16.3737813Z 2026-02-21T09:47:16.3737846Z {-# 2026-02-21T09:47:16.3737930Z external_resources: { 2026-02-21T09:47:16.3738039Z mlir_reproducer: { 2026-02-21T09:47:16.3739038Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:47:16.3740031Z disable_threading: false, 2026-02-21T09:47:16.3740143Z verify_each: true 2026-02-21T09:47:16.3740241Z } 2026-02-21T09:47:16.3740320Z } 2026-02-21T09:47:16.3740440Z #-} 2026-02-21T09:47:16.3740724Z /tmp/torchinductor_root/bh/cbhtwx7xaagwkpticpyxiicu3kro4ylitkdddkylogh3car5inkw.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:47:16.3741451Z /tmp/torchinductor_root/bh/cbhtwx7xaagwkpticpyxiicu3kro4ylitkdddkylogh3car5inkw.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:47:16.3742021Z [167s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:47:16.3742800Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 32, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[2, 0], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T09:47:16.3743503Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:47:16.3743675Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:47:37.4558991Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:47:37.4563272Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:47:37.4567840Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [64, 1], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:47:37.4569244Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:47:37.4570007Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [16, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:47:37.4570658Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:47:37.4571382Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:47:37.4571837Z #smem = #ttg.shared_memory 2026-02-21T09:47:37.4572428Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:47:37.4573572Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:47:37.4574533Z %cst = arith.constant dense<0.000000e+00> : tensor<4096x16xf32, #mma> 2026-02-21T09:47:37.4574931Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:47:37.4575212Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:47:37.4575492Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:47:37.4575863Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T09:47:37.4576221Z %cst_0 = arith.constant dense<0> : tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4576577Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:47:37.4576848Z %c12_i32 = arith.constant 12 : i32 2026-02-21T09:47:37.4577084Z %c504_i32 = arith.constant 504 : i32 2026-02-21T09:47:37.4577361Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:47:37.4577550Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:47:37.4577741Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:47:37.4578041Z %cst_1 = arith.constant dense<0> : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4578478Z %cst_2 = arith.constant dense<8192> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4579015Z %cst_3 = arith.constant dense<0> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4579388Z %cst_4 = arith.constant dense<1024> : tensor<4096x1xi32, #blocked1> 2026-02-21T09:47:37.4579747Z %cst_5 = arith.constant dense<4> : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4580092Z %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:37.4580467Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:37.4580808Z %cst_8 = arith.constant dense<8192> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4581228Z %cst_9 = arith.constant dense<0> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4581644Z %cst_10 = arith.constant dense<512> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4581992Z %cst_11 = arith.constant dense<0> : tensor<1x16xi64, #mma> 2026-02-21T09:47:37.4582277Z %cst_12 = arith.constant dense<8192> : tensor<1x16xi64, #mma> 2026-02-21T09:47:37.4582568Z %cst_13 = arith.constant dense<8192> : tensor<4096x1xi64, #mma> 2026-02-21T09:47:37.4582847Z %cst_14 = arith.constant dense<0> : tensor<4096x1xi64, #mma> 2026-02-21T09:47:37.4583132Z %cst_15 = arith.constant dense<16384> : tensor<4096x1xi64, #mma> 2026-02-21T09:47:37.4583385Z %0 = tt.get_program_id x : i32 2026-02-21T09:47:37.4583570Z %1 = arith.divsi %0, %c2048_i32 : i32 2026-02-21T09:47:37.4583767Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T09:47:37.4583957Z %3 = arith.subi %c4_i32, %2 : i32 2026-02-21T09:47:37.4584135Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T09:47:37.4584327Z %5 = arith.remsi %0, %c2048_i32 : i32 2026-02-21T09:47:37.4584517Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:47:37.4584695Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:47:37.4584875Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:47:37.4585051Z %9 = arith.muli %7, %c4096_i32 : i32 2026-02-21T09:47:37.4585437Z %10 = tt.make_range {end = 4096 : i32, start = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:37.4585915Z %11 = tt.make_range {end = 4096 : i32, start = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:37.4586369Z %12 = tt.splat %9 : i32 -> tensor<4096xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:37.4586747Z %13 = arith.addi %12, %10 : tensor<4096xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:37.4587044Z %14 = arith.muli %8, %c16_i32 : i32 2026-02-21T09:47:37.4587366Z %15 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:37.4587855Z %16 = tt.expand_dims %13 {axis = 1 : i32} : tensor<4096xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4096x1xi32, #blocked1> 2026-02-21T09:47:37.4588197Z %17 = arith.muli %16, %cst_4 : tensor<4096x1xi32, #blocked1> 2026-02-21T09:47:37.4588445Z %18 = tt.broadcast %17 : tensor<4096x1xi32, #blocked1> -> tensor<4096x8xi32, #blocked1> 2026-02-21T09:47:37.4588727Z %19 = tt.splat %arg0 : !tt.ptr -> tensor<4096x8x!tt.ptr, #blocked1> 2026-02-21T09:47:37.4588934Z %20 = arith.extsi %14 : i32 to i64 2026-02-21T09:47:37.4589170Z %21 = tt.splat %arg1 : !tt.ptr -> tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4589555Z %22 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4590088Z %23 = arith.extsi %22 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4590591Z %24 = tt.splat %20 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4591018Z %25 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4591553Z %26 = arith.extsi %25 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4592069Z %27 = arith.addi %24, %26 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4592549Z %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4593076Z %29 = tt.broadcast %28 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4593463Z %30 = arith.cmpi sge, %28, %cst_3 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4593764Z %31 = arith.cmpi slt, %28, %cst_2 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4594047Z %32 = arith.andi %30, %31 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4594419Z %33 = tt.broadcast %32 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4594862Z %34 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:47:37.4595374Z %35 = tt.expand_dims %34 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:47:37.4595878Z %36 = tt.expand_dims %35 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:37.4596211Z %37 = arith.cmpi eq, %36, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:37.4596452Z %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x16xi1, #blocked> 2026-02-21T09:47:37.4596712Z %39 = arith.cmpi eq, %36, %cst_7 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:37.4596943Z %40 = tt.broadcast %39 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x16xi1, #blocked> 2026-02-21T09:47:37.4597283Z %41 = scf.for %arg3 = %c0_i32 to %c504_i32 step %c12_i32 iter_args(%arg4 = %cst) -> (tensor<4096x16xf32, #mma>) : i32 { 2026-02-21T09:47:37.4597552Z %69 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:47:37.4597751Z %70 = tt.splat %69 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:37.4597970Z %71 = arith.addi %70, %15 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:37.4598248Z %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:47:37.4598524Z %73 = tt.broadcast %72 : tensor<1x8xi32, #blocked1> -> tensor<4096x8xi32, #blocked1> 2026-02-21T09:47:37.4598717Z %74 = arith.addi %18, %73 : tensor<4096x8xi32, #blocked1> 2026-02-21T09:47:37.4598924Z %75 = tt.addptr %19, %74 : tensor<4096x8x!tt.ptr, #blocked1>, tensor<4096x8xi32, #blocked1> 2026-02-21T09:47:37.4599136Z %76 = tt.load %75 : tensor<4096x8x!tt.ptr, #blocked1> 2026-02-21T09:47:37.4599361Z %77 = ttg.local_alloc %76 : (tensor<4096x8xbf16, #blocked1>) -> !ttg.memdesc<4096x8xbf16, #shared, #smem> 2026-02-21T09:47:37.4599696Z %78 = ttg.local_load %77 : !ttg.memdesc<4096x8xbf16, #shared, #smem> -> tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:37.4600110Z %79 = arith.extf %78 : tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:37.4600401Z %80 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:47:37.4600629Z %81 = tt.splat %80 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4600925Z %82 = arith.addi %81, %23 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4601328Z %83 = tt.expand_dims %82 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4601678Z %84 = arith.muli %83, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4601980Z %85 = tt.broadcast %84 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4602276Z %86 = arith.addi %85, %29 : tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4602654Z %87 = tt.addptr %21, %86 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4602971Z %88 = arith.cmpi sge, %83, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4603210Z %89 = arith.cmpi slt, %83, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4603440Z %90 = arith.andi %88, %89 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4603733Z %91 = tt.broadcast %90 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4604025Z %92 = arith.andi %91, %33 : tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4604263Z %93 = tt.load %87, %92, %cst_1 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4604523Z %94 = arith.shli %93, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4604758Z %95 = arith.shrsi %94, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4605014Z %96 = arith.shrsi %93, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4605296Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:47:37.4605628Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:47:37.4605904Z %99 = tt.broadcast %97 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4606141Z %100 = arith.select %38, %99, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4606380Z %101 = tt.broadcast %98 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4606610Z %102 = arith.select %40, %101, %100 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4606843Z %103 = tt.reshape %102 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2> 2026-02-21T09:47:37.4607066Z %104 = arith.sitofp %103 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2> 2026-02-21T09:47:37.4607318Z %105 = ttg.local_alloc %104 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem> 2026-02-21T09:47:37.4607646Z %106 = ttg.local_load %105 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:37.4608131Z %107 = tt.dot %79, %106, %arg4, inputPrecision = tf32 : tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<4096x16xf32, #mma> 2026-02-21T09:47:37.4608483Z %108 = arith.addi %arg3, %c4_i32 : i32 2026-02-21T09:47:37.4608638Z %109 = arith.muli %108, %c2_i32 : i32 2026-02-21T09:47:37.4608807Z %110 = tt.splat %109 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:37.4609036Z %111 = arith.addi %110, %15 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:37.4609324Z %112 = tt.expand_dims %111 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:47:37.4609601Z %113 = tt.broadcast %112 : tensor<1x8xi32, #blocked1> -> tensor<4096x8xi32, #blocked1> 2026-02-21T09:47:37.4609795Z %114 = arith.addi %18, %113 : tensor<4096x8xi32, #blocked1> 2026-02-21T09:47:37.4610001Z %115 = tt.addptr %19, %114 : tensor<4096x8x!tt.ptr, #blocked1>, tensor<4096x8xi32, #blocked1> 2026-02-21T09:47:37.4610214Z %116 = tt.load %115 : tensor<4096x8x!tt.ptr, #blocked1> 2026-02-21T09:47:37.4610437Z %117 = ttg.local_alloc %116 : (tensor<4096x8xbf16, #blocked1>) -> !ttg.memdesc<4096x8xbf16, #shared, #smem> 2026-02-21T09:47:37.4610772Z %118 = ttg.local_load %117 : !ttg.memdesc<4096x8xbf16, #shared, #smem> -> tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:37.4611178Z %119 = arith.extf %118 : tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:37.4611462Z %120 = arith.extsi %108 : i32 to i64 2026-02-21T09:47:37.4611670Z %121 = tt.splat %120 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4611962Z %122 = arith.addi %121, %23 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4612362Z %123 = tt.expand_dims %122 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4612719Z %124 = arith.muli %123, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4613023Z %125 = tt.broadcast %124 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4613339Z %126 = arith.addi %125, %29 : tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4613646Z %127 = tt.addptr %21, %126 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4613977Z %128 = arith.cmpi sge, %123, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4614223Z %129 = arith.cmpi slt, %123, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4614475Z %130 = arith.andi %128, %129 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4614772Z %131 = tt.broadcast %130 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4615092Z %132 = arith.andi %131, %33 : tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4615340Z %133 = tt.load %127, %132, %cst_1 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4615585Z %134 = arith.shli %133, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4615820Z %135 = arith.shrsi %134, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4616062Z %136 = arith.shrsi %133, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4616346Z %137 = tt.expand_dims %135 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:47:37.4616706Z %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:47:37.4616987Z %139 = tt.broadcast %137 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4617228Z %140 = arith.select %38, %139, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4617481Z %141 = tt.broadcast %138 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4617710Z %142 = arith.select %40, %141, %140 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4617943Z %143 = tt.reshape %142 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2> 2026-02-21T09:47:37.4618163Z %144 = arith.sitofp %143 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2> 2026-02-21T09:47:37.4618421Z %145 = ttg.local_alloc %144 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem> 2026-02-21T09:47:37.4618751Z %146 = ttg.local_load %145 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:37.4619221Z %147 = tt.dot %119, %146, %107, inputPrecision = tf32 : tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<4096x16xf32, #mma> 2026-02-21T09:47:37.4619596Z %148 = arith.addi %arg3, %c8_i32 : i32 2026-02-21T09:47:37.4619732Z %149 = arith.muli %148, %c2_i32 : i32 2026-02-21T09:47:37.4619924Z %150 = tt.splat %149 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:37.4620152Z %151 = arith.addi %150, %15 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:37.4620441Z %152 = tt.expand_dims %151 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:47:37.4620750Z %153 = tt.broadcast %152 : tensor<1x8xi32, #blocked1> -> tensor<4096x8xi32, #blocked1> 2026-02-21T09:47:37.4620965Z %154 = arith.addi %18, %153 : tensor<4096x8xi32, #blocked1> 2026-02-21T09:47:37.4621191Z %155 = tt.addptr %19, %154 : tensor<4096x8x!tt.ptr, #blocked1>, tensor<4096x8xi32, #blocked1> 2026-02-21T09:47:37.4621414Z %156 = tt.load %155 : tensor<4096x8x!tt.ptr, #blocked1> 2026-02-21T09:47:37.4621649Z %157 = ttg.local_alloc %156 : (tensor<4096x8xbf16, #blocked1>) -> !ttg.memdesc<4096x8xbf16, #shared, #smem> 2026-02-21T09:47:37.4621987Z %158 = ttg.local_load %157 : !ttg.memdesc<4096x8xbf16, #shared, #smem> -> tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:37.4622402Z %159 = arith.extf %158 : tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:37.4622713Z %160 = arith.extsi %148 : i32 to i64 2026-02-21T09:47:37.4622940Z %161 = tt.splat %160 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4623242Z %162 = arith.addi %161, %23 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4623635Z %163 = tt.expand_dims %162 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4623991Z %164 = arith.muli %163, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4624300Z %165 = tt.broadcast %164 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4624608Z %166 = arith.addi %165, %29 : tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4624934Z %167 = tt.addptr %21, %166 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4625254Z %168 = arith.cmpi sge, %163, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4625504Z %169 = arith.cmpi slt, %163, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4625753Z %170 = arith.andi %168, %169 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4626054Z %171 = tt.broadcast %170 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4626347Z %172 = arith.andi %171, %33 : tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4626593Z %173 = tt.load %167, %172, %cst_1 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4626842Z %174 = arith.shli %173, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4627073Z %175 = arith.shrsi %174, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4627312Z %176 = arith.shrsi %173, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4627598Z %177 = tt.expand_dims %175 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:47:37.4627937Z %178 = tt.expand_dims %176 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:47:37.4628218Z %179 = tt.broadcast %177 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4628453Z %180 = arith.select %38, %179, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4628706Z %181 = tt.broadcast %178 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4628937Z %182 = arith.select %40, %181, %180 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4629169Z %183 = tt.reshape %182 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2> 2026-02-21T09:47:37.4629413Z %184 = arith.sitofp %183 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2> 2026-02-21T09:47:37.4629663Z %185 = ttg.local_alloc %184 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem> 2026-02-21T09:47:37.4629987Z %186 = ttg.local_load %185 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:37.4630459Z %187 = tt.dot %159, %186, %147, inputPrecision = tf32 : tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<4096x16xf32, #mma> 2026-02-21T09:47:37.4630810Z scf.yield %187 : tensor<4096x16xf32, #mma> 2026-02-21T09:47:37.4630949Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:47:37.4631161Z %42 = scf.for %arg3 = %c504_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %41) -> (tensor<4096x16xf32, #mma>) : i32 { 2026-02-21T09:47:37.4631380Z %69 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:47:37.4631551Z %70 = tt.splat %69 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:37.4631770Z %71 = arith.addi %70, %15 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:37.4632046Z %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:47:37.4632317Z %73 = tt.broadcast %72 : tensor<1x8xi32, #blocked1> -> tensor<4096x8xi32, #blocked1> 2026-02-21T09:47:37.4632516Z %74 = arith.addi %18, %73 : tensor<4096x8xi32, #blocked1> 2026-02-21T09:47:37.4632726Z %75 = tt.addptr %19, %74 : tensor<4096x8x!tt.ptr, #blocked1>, tensor<4096x8xi32, #blocked1> 2026-02-21T09:47:37.4632960Z %76 = tt.load %75 : tensor<4096x8x!tt.ptr, #blocked1> 2026-02-21T09:47:37.4633184Z %77 = ttg.local_alloc %76 : (tensor<4096x8xbf16, #blocked1>) -> !ttg.memdesc<4096x8xbf16, #shared, #smem> 2026-02-21T09:47:37.4633516Z %78 = ttg.local_load %77 : !ttg.memdesc<4096x8xbf16, #shared, #smem> -> tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:37.4633941Z %79 = arith.extf %78 : tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:37.4634227Z %80 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:47:37.4634433Z %81 = tt.splat %80 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4634728Z %82 = arith.addi %81, %23 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:37.4635112Z %83 = tt.expand_dims %82 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4635464Z %84 = arith.muli %83, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4635766Z %85 = tt.broadcast %84 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4636061Z %86 = arith.addi %85, %29 : tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4636371Z %87 = tt.addptr %21, %86 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4636681Z %88 = arith.cmpi sge, %83, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4636941Z %89 = arith.cmpi slt, %83, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4637170Z %90 = arith.andi %88, %89 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4637479Z %91 = tt.broadcast %90 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4637775Z %92 = arith.andi %91, %33 : tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4638012Z %93 = tt.load %87, %92, %cst_1 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4638254Z %94 = arith.shli %93, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4638485Z %95 = arith.shrsi %94, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4638713Z %96 = arith.shrsi %93, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:37.4638996Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:47:37.4639328Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:47:37.4639603Z %99 = tt.broadcast %97 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4639840Z %100 = arith.select %38, %99, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4640076Z %101 = tt.broadcast %98 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4640309Z %102 = arith.select %40, %101, %100 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:37.4640541Z %103 = tt.reshape %102 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2> 2026-02-21T09:47:37.4640762Z %104 = arith.sitofp %103 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2> 2026-02-21T09:47:37.4641030Z %105 = ttg.local_alloc %104 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem> 2026-02-21T09:47:37.4641355Z %106 = ttg.local_load %105 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:37.4641852Z %107 = tt.dot %79, %106, %arg4, inputPrecision = tf32 : tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<4096x16xf32, #mma> 2026-02-21T09:47:37.4642210Z scf.yield %107 : tensor<4096x16xf32, #mma> 2026-02-21T09:47:37.4642343Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:47:37.4642520Z %43 = arith.truncf %42 : tensor<4096x16xf32, #mma> to tensor<4096x16xbf16, #mma> 2026-02-21T09:47:37.4642733Z %44 = arith.extsi %9 : i32 to i64 2026-02-21T09:47:37.4642897Z %45 = tt.splat %arg2 : !tt.ptr -> tensor<4096x16x!tt.ptr, #mma> 2026-02-21T09:47:37.4643107Z %46 = tt.splat %44 : i64 -> tensor<4096xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:37.4643394Z %47 = arith.extsi %11 : tensor<4096xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<4096xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:37.4643678Z %48 = arith.addi %46, %47 : tensor<4096xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:37.4643944Z %49 = tt.expand_dims %48 {axis = 1 : i32} : tensor<4096xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<4096x1xi64, #mma> 2026-02-21T09:47:37.4644190Z %50 = arith.muli %49, %cst_13 : tensor<4096x1xi64, #mma> 2026-02-21T09:47:37.4644371Z %51 = tt.broadcast %50 : tensor<4096x1xi64, #mma> -> tensor<4096x16xi64, #mma> 2026-02-21T09:47:37.4644582Z %52 = tt.splat %20 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:37.4644839Z %53 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:37.4645138Z %54 = arith.extsi %53 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:37.4645427Z %55 = arith.addi %52, %54 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:37.4645680Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi64, #mma> 2026-02-21T09:47:37.4645939Z %57 = tt.broadcast %56 : tensor<1x16xi64, #mma> -> tensor<4096x16xi64, #mma> 2026-02-21T09:47:37.4646121Z %58 = arith.addi %51, %57 : tensor<4096x16xi64, #mma> 2026-02-21T09:47:37.4646308Z %59 = tt.addptr %45, %58 : tensor<4096x16x!tt.ptr, #mma>, tensor<4096x16xi64, #mma> 2026-02-21T09:47:37.4646513Z %60 = arith.cmpi sge, %49, %cst_14 : tensor<4096x1xi64, #mma> 2026-02-21T09:47:37.4646679Z %61 = arith.cmpi slt, %49, %cst_15 : tensor<4096x1xi64, #mma> 2026-02-21T09:47:37.4646843Z %62 = arith.andi %60, %61 : tensor<4096x1xi1, #mma> 2026-02-21T09:47:37.4647018Z %63 = tt.broadcast %62 : tensor<4096x1xi1, #mma> -> tensor<4096x16xi1, #mma> 2026-02-21T09:47:37.4647201Z %64 = arith.cmpi sge, %56, %cst_11 : tensor<1x16xi64, #mma> 2026-02-21T09:47:37.4647366Z %65 = arith.cmpi slt, %56, %cst_12 : tensor<1x16xi64, #mma> 2026-02-21T09:47:37.4647518Z %66 = arith.andi %64, %65 : tensor<1x16xi1, #mma> 2026-02-21T09:47:37.4647687Z %67 = tt.broadcast %66 : tensor<1x16xi1, #mma> -> tensor<4096x16xi1, #mma> 2026-02-21T09:47:37.4647859Z %68 = arith.andi %63, %67 : tensor<4096x16xi1, #mma> 2026-02-21T09:47:37.4648017Z tt.store %59, %43, %68 : tensor<4096x16x!tt.ptr, #mma> 2026-02-21T09:47:37.4648159Z tt.return 2026-02-21T09:47:37.4648241Z } 2026-02-21T09:47:37.4648326Z } 2026-02-21T09:47:37.4648371Z 2026-02-21T09:47:37.4648407Z {-# 2026-02-21T09:47:37.4648497Z external_resources: { 2026-02-21T09:47:37.4648598Z mlir_reproducer: { 2026-02-21T09:47:37.4649626Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:47:37.4650640Z disable_threading: false, 2026-02-21T09:47:37.4650749Z verify_each: true 2026-02-21T09:47:37.4650844Z } 2026-02-21T09:47:37.4650918Z } 2026-02-21T09:47:37.4650993Z #-} 2026-02-21T09:47:37.4651277Z /tmp/torchinductor_root/vr/cvrwrnmhifgmlzk2dlcw2x5idbs7jh2w5orzxxh7f53mslhtvpcn.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:47:37.4651996Z /tmp/torchinductor_root/vr/cvrwrnmhifgmlzk2dlcw2x5idbs7jh2w5orzxxh7f53mslhtvpcn.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:47:37.4652565Z [188s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:47:37.4653291Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 4096, 16], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T09:47:37.4653948Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:47:37.4654134Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:47:55.3321338Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:47:55.3323652Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:47:55.3324574Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:47:55.3325319Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:47:55.3325982Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:47:55.3326576Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:47:55.3327148Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:47:55.3327573Z #smem = #ttg.shared_memory 2026-02-21T09:47:55.3328117Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:47:55.3329236Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:47:55.3330177Z %cst = arith.constant dense<0.000000e+00> : tensor<1024x32xf32, #mma> 2026-02-21T09:47:55.3330564Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:47:55.3330838Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:47:55.3331113Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:47:55.3331377Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:47:55.3331644Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:47:55.3332156Z %cst_0 = arith.constant dense<0> : tensor<2x2x32xi8, #blocked> 2026-02-21T09:47:55.3332494Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:47:55.3332758Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:47:55.3333172Z %cst_1 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:55.3333973Z %cst_2 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3334666Z %cst_3 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3335024Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:47:55.3335225Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:47:55.3335470Z %cst_4 = arith.constant dense<1024> : tensor<1024x1xi32, #blocked1> 2026-02-21T09:47:55.3335828Z %cst_5 = arith.constant dense<8192> : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3336240Z %cst_6 = arith.constant dense<4> : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3336590Z %cst_7 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:55.3336862Z %cst_8 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:55.3337141Z %cst_9 = arith.constant dense<8192> : tensor<1024x1xi32, #mma> 2026-02-21T09:47:55.3337372Z %0 = tt.get_program_id x : i32 2026-02-21T09:47:55.3337601Z %1 = arith.divsi %0, %c64_i32 : i32 2026-02-21T09:47:55.3337781Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T09:47:55.3337965Z %3 = arith.subi %c256_i32, %2 : i32 2026-02-21T09:47:55.3338138Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T09:47:55.3338322Z %5 = arith.remsi %0, %c64_i32 : i32 2026-02-21T09:47:55.3338540Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:47:55.3338703Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:47:55.3338881Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:47:55.3339137Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T09:47:55.3339514Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3340037Z %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:55.3340498Z %12 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3340899Z %13 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:55.3341302Z %14 = arith.addi %12, %10 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3341699Z %15 = arith.addi %13, %11 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:55.3341961Z %16 = arith.muli %8, %c1024_i32 : i32 2026-02-21T09:47:55.3342285Z %17 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:55.3342739Z %18 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:55.3343134Z %19 = tt.splat %16 : i32 -> tensor<1024xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:55.3352114Z %20 = tt.splat %16 : i32 -> tensor<1024xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:55.3352424Z %21 = arith.addi %19, %17 : tensor<1024xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:55.3352693Z %22 = arith.addi %20, %18 : tensor<1024xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:55.3353038Z %23 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3353419Z %24 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:55.3353866Z %25 = tt.expand_dims %21 {axis = 1 : i32} : tensor<1024xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<1024x1xi32, #blocked1> 2026-02-21T09:47:55.3354182Z %26 = arith.muli %25, %cst_4 : tensor<1024x1xi32, #blocked1> 2026-02-21T09:47:55.3354426Z %27 = tt.broadcast %26 : tensor<1024x1xi32, #blocked1> -> tensor<1024x4xi32, #blocked1> 2026-02-21T09:47:55.3354683Z %28 = tt.splat %arg0 : !tt.ptr -> tensor<1024x4x!tt.ptr, #blocked1> 2026-02-21T09:47:55.3355030Z %29 = tt.expand_dims %14 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3355457Z %30 = tt.broadcast %29 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3355785Z %31 = tt.splat %arg1 : !tt.ptr -> tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3356098Z %32 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:47:55.3356510Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:47:55.3356909Z %34 = tt.expand_dims %33 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:55.3357166Z %35 = arith.cmpi eq, %34, %cst_7 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:55.3357366Z %36 = tt.broadcast %35 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x32xi1, #blocked> 2026-02-21T09:47:55.3357561Z %37 = arith.cmpi eq, %34, %cst_8 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:55.3357760Z %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x32xi1, #blocked> 2026-02-21T09:47:55.3357994Z %39 = ttg.local_alloc : () -> !ttg.memdesc<2x1024x4xbf16, #shared, #smem, mutable> 2026-02-21T09:47:55.3358267Z %40 = tt.expand_dims %24 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:47:55.3358554Z %41 = tt.broadcast %40 : tensor<1x4xi32, #blocked1> -> tensor<1024x4xi32, #blocked1> 2026-02-21T09:47:55.3358753Z %42 = arith.addi %27, %41 : tensor<1024x4xi32, #blocked1> 2026-02-21T09:47:55.3358960Z %43 = tt.addptr %28, %42 : tensor<1024x4x!tt.ptr, #blocked1>, tensor<1024x4xi32, #blocked1> 2026-02-21T09:47:55.3359169Z %44 = tt.load %43 : tensor<1024x4x!tt.ptr, #blocked1> 2026-02-21T09:47:55.3359463Z %45 = ttg.memdesc_index %39[%c0_i32] : !ttg.memdesc<2x1024x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4> 2026-02-21T09:47:55.3359831Z ttg.local_store %44, %45 : tensor<1024x4xbf16, #blocked1> -> !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4> 2026-02-21T09:47:55.3360119Z %46 = arith.addi %24, %cst_1 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:55.3360399Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:47:55.3360673Z %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked1> -> tensor<1024x4xi32, #blocked1> 2026-02-21T09:47:55.3360870Z %49 = arith.addi %27, %48 : tensor<1024x4xi32, #blocked1> 2026-02-21T09:47:55.3361071Z %50 = tt.addptr %28, %49 : tensor<1024x4x!tt.ptr, #blocked1>, tensor<1024x4xi32, #blocked1> 2026-02-21T09:47:55.3361282Z %51 = tt.load %50 : tensor<1024x4x!tt.ptr, #blocked1> 2026-02-21T09:47:55.3361571Z %52 = ttg.memdesc_index %39[%c1_i32] : !ttg.memdesc<2x1024x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4> 2026-02-21T09:47:55.3361958Z ttg.local_store %51, %52 : tensor<1024x4xbf16, #blocked1> -> !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4> 2026-02-21T09:47:55.3362498Z %53:4 = scf.for %arg3 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg4 = %cst, %arg5 = %c1_i32, %arg6 = %45, %arg7 = %52) -> (tensor<1024x32xf32, #mma>, i32, !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4>, !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4>) : i32 { 2026-02-21T09:47:55.3363099Z %109 = tt.splat %arg3 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3363401Z %110 = arith.addi %109, %23 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3363626Z %111 = arith.addi %arg3, %c4_i32 : i32 2026-02-21T09:47:55.3363752Z %112 = arith.muli %111, %c2_i32 : i32 2026-02-21T09:47:55.3363929Z %113 = tt.splat %112 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:55.3364163Z %114 = arith.addi %113, %24 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:55.3364442Z %115 = tt.expand_dims %114 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:47:55.3364730Z %116 = tt.broadcast %115 : tensor<1x4xi32, #blocked1> -> tensor<1024x4xi32, #blocked1> 2026-02-21T09:47:55.3364930Z %117 = arith.addi %27, %116 : tensor<1024x4xi32, #blocked1> 2026-02-21T09:47:55.3365144Z %118 = tt.addptr %28, %117 : tensor<1024x4x!tt.ptr, #blocked1>, tensor<1024x4xi32, #blocked1> 2026-02-21T09:47:55.3365365Z %119 = tt.load %118 : tensor<1024x4x!tt.ptr, #blocked1> 2026-02-21T09:47:55.3365680Z %120 = ttg.local_load %arg6 : !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4> -> tensor<1024x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3366156Z %121 = arith.extf %120 : tensor<1024x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<1024x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3366620Z %122 = tt.expand_dims %110 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3367000Z %123 = arith.muli %122, %cst_5 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3367314Z %124 = tt.broadcast %123 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3367621Z %125 = arith.addi %124, %30 : tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3367936Z %126 = tt.addptr %31, %125 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3368251Z %127 = tt.load %126 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3368485Z %128 = arith.shli %127, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3368727Z %129 = arith.shrsi %128, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3368966Z %130 = arith.shrsi %127, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3369266Z %131 = tt.expand_dims %129 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:47:55.3369606Z %132 = tt.expand_dims %130 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:47:55.3369894Z %133 = tt.broadcast %131 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:47:55.3370139Z %134 = arith.select %36, %133, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:47:55.3370403Z %135 = tt.broadcast %132 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:47:55.3370641Z %136 = arith.select %38, %135, %134 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:47:55.3370877Z %137 = tt.reshape %136 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:47:55.3371120Z %138 = arith.sitofp %137 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:47:55.3371375Z %139 = ttg.local_alloc %138 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:47:55.3371698Z %140 = ttg.local_load %139 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3372183Z %141 = tt.dot %121, %140, %arg4, inputPrecision = tf32 : tensor<1024x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<1024x32xf32, #mma> 2026-02-21T09:47:55.3372545Z %142 = arith.addi %arg5, %c1_i32 : i32 2026-02-21T09:47:55.3372675Z %143 = arith.cmpi slt, %142, %c2_i32 : i32 2026-02-21T09:47:55.3372813Z %144 = arith.select %143, %142, %c0_i32 : i32 2026-02-21T09:47:55.3373088Z %145 = ttg.memdesc_index %39[%144] : !ttg.memdesc<2x1024x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4> 2026-02-21T09:47:55.3373462Z ttg.local_store %119, %145 : tensor<1024x4xbf16, #blocked1> -> !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4> 2026-02-21T09:47:55.3373880Z scf.yield %141, %144, %arg7, %145 : tensor<1024x32xf32, #mma>, i32, !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4>, !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4> 2026-02-21T09:47:55.3374216Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:47:55.3374477Z %54 = arith.addi %23, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3374855Z %55 = ttg.local_load %53#2 : !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4> -> tensor<1024x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3375307Z %56 = arith.extf %55 : tensor<1024x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<1024x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3375766Z %57 = tt.expand_dims %54 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3376117Z %58 = arith.muli %57, %cst_5 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3376417Z %59 = tt.broadcast %58 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3376718Z %60 = arith.addi %59, %30 : tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3377018Z %61 = tt.addptr %31, %60 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3377327Z %62 = tt.load %61 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3377556Z %63 = arith.shli %62, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3377786Z %64 = arith.shrsi %63, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3378019Z %65 = arith.shrsi %62, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3378299Z %66 = tt.expand_dims %64 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:47:55.3378634Z %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:47:55.3378929Z %68 = tt.broadcast %66 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:47:55.3379162Z %69 = arith.select %36, %68, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:47:55.3379392Z %70 = tt.broadcast %67 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:47:55.3379638Z %71 = arith.select %38, %70, %69 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:47:55.3379864Z %72 = tt.reshape %71 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:47:55.3380080Z %73 = arith.sitofp %72 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:47:55.3380321Z %74 = ttg.local_alloc %73 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:47:55.3380642Z %75 = ttg.local_load %74 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3381107Z %76 = tt.dot %56, %75, %53#0, inputPrecision = tf32 : tensor<1024x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<1024x32xf32, #mma> 2026-02-21T09:47:55.3381540Z %77 = arith.addi %23, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3381936Z %78 = ttg.local_load %53#3 : !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4> -> tensor<1024x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3382375Z %79 = arith.extf %78 : tensor<1024x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<1024x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3382851Z %80 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3383206Z %81 = arith.muli %80, %cst_5 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3383521Z %82 = tt.broadcast %81 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3383819Z %83 = arith.addi %82, %30 : tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3384122Z %84 = tt.addptr %31, %83 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3384423Z %85 = tt.load %84 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3384651Z %86 = arith.shli %85, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3384884Z %87 = arith.shrsi %86, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3385112Z %88 = arith.shrsi %85, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3385392Z %89 = tt.expand_dims %87 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:47:55.3385718Z %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:47:55.3385997Z %91 = tt.broadcast %89 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:47:55.3386233Z %92 = arith.select %36, %91, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:47:55.3386461Z %93 = tt.broadcast %90 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:47:55.3386688Z %94 = arith.select %38, %93, %92 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:47:55.3386909Z %95 = tt.reshape %94 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:47:55.3387145Z %96 = arith.sitofp %95 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:47:55.3387392Z %97 = ttg.local_alloc %96 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:47:55.3387707Z %98 = ttg.local_load %97 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3388185Z %99 = tt.dot %79, %98, %76, inputPrecision = tf32 : tensor<1024x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<1024x32xf32, #mma> 2026-02-21T09:47:55.3388568Z ttg.local_dealloc %39 : !ttg.memdesc<2x1024x4xbf16, #shared, #smem, mutable> 2026-02-21T09:47:55.3388788Z %100 = arith.truncf %99 : tensor<1024x32xf32, #mma> to tensor<1024x32xbf16, #mma> 2026-02-21T09:47:55.3389067Z %101 = tt.expand_dims %22 {axis = 1 : i32} : tensor<1024xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<1024x1xi32, #mma> 2026-02-21T09:47:55.3389310Z %102 = arith.muli %101, %cst_9 : tensor<1024x1xi32, #mma> 2026-02-21T09:47:55.3389546Z %103 = tt.expand_dims %15 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma> 2026-02-21T09:47:55.3389805Z %104 = tt.broadcast %102 : tensor<1024x1xi32, #mma> -> tensor<1024x32xi32, #mma> 2026-02-21T09:47:55.3390018Z %105 = tt.broadcast %103 : tensor<1x32xi32, #mma> -> tensor<1024x32xi32, #mma> 2026-02-21T09:47:55.3390205Z %106 = arith.addi %104, %105 : tensor<1024x32xi32, #mma> 2026-02-21T09:47:55.3390383Z %107 = tt.splat %arg2 : !tt.ptr -> tensor<1024x32x!tt.ptr, #mma> 2026-02-21T09:47:55.3390609Z %108 = tt.addptr %107, %106 : tensor<1024x32x!tt.ptr, #mma>, tensor<1024x32xi32, #mma> 2026-02-21T09:47:55.3390812Z tt.store %108, %100 : tensor<1024x32x!tt.ptr, #mma> 2026-02-21T09:47:55.3390956Z tt.return 2026-02-21T09:47:55.3391054Z } 2026-02-21T09:47:55.3391143Z } 2026-02-21T09:47:55.3391188Z 2026-02-21T09:47:55.3391227Z {-# 2026-02-21T09:47:55.3391310Z external_resources: { 2026-02-21T09:47:55.3391434Z mlir_reproducer: { 2026-02-21T09:47:55.3392434Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:47:55.3393426Z disable_threading: false, 2026-02-21T09:47:55.3393540Z verify_each: true 2026-02-21T09:47:55.3393637Z } 2026-02-21T09:47:55.3393715Z } 2026-02-21T09:47:55.3393788Z #-} 2026-02-21T09:47:55.3394071Z /tmp/torchinductor_root/3j/c3jhspwcswupy7forased75qyulwdzwqthgnq6ccgfjrupw4dxy6.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:47:55.3394761Z /tmp/torchinductor_root/3j/c3jhspwcswupy7forased75qyulwdzwqthgnq6ccgfjrupw4dxy6.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:47:55.3395312Z [206s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:47:55.3396068Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 1024, 32], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:47:55.3396727Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:47:55.3396899Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:47:55.3945148Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:47:55.3948826Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:47:55.3949168Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:47:55.3949495Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:47:55.3949797Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [8, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:47:55.3950065Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:47:55.3950306Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:47:55.3950491Z #smem = #ttg.shared_memory 2026-02-21T09:47:55.3950728Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:47:55.3951203Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:47:55.3951588Z %cst = arith.constant dense<0.000000e+00> : tensor<32x16xf32, #mma> 2026-02-21T09:47:55.3951756Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:47:55.3951917Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:47:55.3952042Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:47:55.3952157Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:47:55.3952301Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:47:55.3952425Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:47:55.3952550Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T09:47:55.3952706Z %cst_0 = arith.constant dense<0> : tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:55.3952854Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:47:55.3953042Z %cst_1 = arith.constant dense<0> : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3953300Z %cst_2 = arith.constant dense<8192> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3953558Z %cst_3 = arith.constant dense<0> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3953816Z %cst_4 = arith.constant dense<512> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3954069Z %cst_5 = arith.constant dense<0> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3954322Z %cst_6 = arith.constant dense<8192> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3954544Z %cst_7 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1> 2026-02-21T09:47:55.3954762Z %cst_8 = arith.constant dense<4> : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3954978Z %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:55.3955154Z %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:55.3955338Z %cst_11 = arith.constant dense<8192> : tensor<32x1xi32, #mma> 2026-02-21T09:47:55.3955485Z %0 = tt.get_program_id x : i32 2026-02-21T09:47:55.3955605Z %1 = arith.divsi %0, %c2048_i32 : i32 2026-02-21T09:47:55.3955726Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T09:47:55.3955877Z %3 = arith.subi %c512_i32, %2 : i32 2026-02-21T09:47:55.3956020Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T09:47:55.3956148Z %5 = arith.remsi %0, %c2048_i32 : i32 2026-02-21T09:47:55.3956271Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:47:55.3956386Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:47:55.3956519Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:47:55.3956631Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T09:47:55.3956841Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:55.3957117Z %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:55.3957366Z %12 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:55.3957580Z %13 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:55.3957800Z %14 = arith.addi %12, %10 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:47:55.3958016Z %15 = arith.addi %13, %11 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:47:55.3958184Z %16 = arith.muli %8, %c16_i32 : i32 2026-02-21T09:47:55.3958425Z %17 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3958735Z %18 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:55.3958975Z %19 = tt.splat %16 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:55.3959187Z %20 = arith.addi %19, %18 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:47:55.3959428Z %21 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:55.3959763Z %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> 2026-02-21T09:47:55.3960013Z %23 = arith.muli %22, %cst_7 : tensor<32x1xi32, #blocked1> 2026-02-21T09:47:55.3960233Z %24 = tt.broadcast %23 : tensor<32x1xi32, #blocked1> -> tensor<32x8xi32, #blocked1> 2026-02-21T09:47:55.3960451Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr, #blocked1> 2026-02-21T09:47:55.3960617Z %26 = arith.extsi %16 : i32 to i64 2026-02-21T09:47:55.3960818Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3961125Z %28 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3961561Z %29 = arith.extsi %28 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3961967Z %30 = tt.splat %26 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3962367Z %31 = arith.extsi %17 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3962838Z %32 = arith.addi %30, %31 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3963229Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3963651Z %34 = tt.broadcast %33 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3963969Z %35 = arith.cmpi sge, %33, %cst_3 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3964241Z %36 = arith.cmpi slt, %33, %cst_2 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3964473Z %37 = arith.andi %35, %36 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3964773Z %38 = tt.broadcast %37 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3965166Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:47:55.3965580Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:47:55.3965986Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:55.3966239Z %42 = arith.cmpi eq, %41, %cst_9 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:55.3966437Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x16xi1, #blocked> 2026-02-21T09:47:55.3966636Z %44 = arith.cmpi eq, %41, %cst_10 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:47:55.3966830Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x16xi1, #blocked> 2026-02-21T09:47:55.3967093Z %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg4 = %cst) -> (tensor<32x16xf32, #mma>) : i32 { 2026-02-21T09:47:55.3967307Z %56 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:47:55.3967480Z %57 = tt.splat %56 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:55.3967698Z %58 = arith.addi %57, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:55.3967991Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:47:55.3968274Z %60 = tt.broadcast %59 : tensor<1x8xi32, #blocked1> -> tensor<32x8xi32, #blocked1> 2026-02-21T09:47:55.3968482Z %61 = arith.addi %24, %60 : tensor<32x8xi32, #blocked1> 2026-02-21T09:47:55.3968684Z %62 = tt.addptr %25, %61 : tensor<32x8x!tt.ptr, #blocked1>, tensor<32x8xi32, #blocked1> 2026-02-21T09:47:55.3968890Z %63 = tt.load %62 : tensor<32x8x!tt.ptr, #blocked1> 2026-02-21T09:47:55.3969112Z %64 = ttg.local_alloc %63 : (tensor<32x8xbf16, #blocked1>) -> !ttg.memdesc<32x8xbf16, #shared, #smem> 2026-02-21T09:47:55.3969447Z %65 = ttg.local_load %64 : !ttg.memdesc<32x8xbf16, #shared, #smem> -> tensor<32x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3969854Z %66 = arith.extf %65 : tensor<32x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3970142Z %67 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:47:55.3970351Z %68 = tt.splat %67 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3970650Z %69 = arith.addi %68, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3971037Z %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3971388Z %71 = arith.muli %70, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3971695Z %72 = tt.broadcast %71 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3971997Z %73 = arith.addi %72, %34 : tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3972325Z %74 = tt.addptr %27, %73 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3972642Z %75 = arith.cmpi sge, %70, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3972884Z %76 = arith.cmpi slt, %70, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3973132Z %77 = arith.andi %75, %76 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3973454Z %78 = tt.broadcast %77 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3973751Z %79 = arith.andi %78, %38 : tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3973998Z %80 = tt.load %74, %79, %cst_1 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3974248Z %81 = arith.shli %80, %cst_8 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3974479Z %82 = arith.shrsi %81, %cst_8 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3974719Z %83 = arith.shrsi %80, %cst_8 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3975002Z %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:47:55.3975337Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:47:55.3975621Z %86 = tt.broadcast %84 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:55.3975854Z %87 = arith.select %43, %86, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:55.3976087Z %88 = tt.broadcast %85 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:55.3976330Z %89 = arith.select %45, %88, %87 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:55.3976556Z %90 = tt.reshape %89 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2> 2026-02-21T09:47:55.3976791Z %91 = arith.sitofp %90 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2> 2026-02-21T09:47:55.3977035Z %92 = ttg.local_alloc %91 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem> 2026-02-21T09:47:55.3977359Z %93 = ttg.local_load %92 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3977833Z %94 = tt.dot %66, %93, %arg4, inputPrecision = tf32 : tensor<32x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x16xf32, #mma> 2026-02-21T09:47:55.3978178Z %95 = arith.addi %arg3, %c4_i32 : i32 2026-02-21T09:47:55.3978305Z %96 = arith.muli %95, %c2_i32 : i32 2026-02-21T09:47:55.3978474Z %97 = tt.splat %96 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:55.3978701Z %98 = arith.addi %97, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:47:55.3978977Z %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:47:55.3979258Z %100 = tt.broadcast %99 : tensor<1x8xi32, #blocked1> -> tensor<32x8xi32, #blocked1> 2026-02-21T09:47:55.3979462Z %101 = arith.addi %24, %100 : tensor<32x8xi32, #blocked1> 2026-02-21T09:47:55.3979666Z %102 = tt.addptr %25, %101 : tensor<32x8x!tt.ptr, #blocked1>, tensor<32x8xi32, #blocked1> 2026-02-21T09:47:55.3979881Z %103 = tt.load %102 : tensor<32x8x!tt.ptr, #blocked1> 2026-02-21T09:47:55.3980110Z %104 = ttg.local_alloc %103 : (tensor<32x8xbf16, #blocked1>) -> !ttg.memdesc<32x8xbf16, #shared, #smem> 2026-02-21T09:47:55.3980465Z %105 = ttg.local_load %104 : !ttg.memdesc<32x8xbf16, #shared, #smem> -> tensor<32x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3980878Z %106 = arith.extf %105 : tensor<32x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3981169Z %107 = arith.extsi %95 : i32 to i64 2026-02-21T09:47:55.3981386Z %108 = tt.splat %107 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3981691Z %109 = arith.addi %108, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:47:55.3982090Z %110 = tt.expand_dims %109 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3982456Z %111 = arith.muli %110, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3982764Z %112 = tt.broadcast %111 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3983079Z %113 = arith.addi %112, %34 : tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3983394Z %114 = tt.addptr %27, %113 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3983715Z %115 = arith.cmpi sge, %110, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3983961Z %116 = arith.cmpi slt, %110, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3984200Z %117 = arith.andi %115, %116 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3984520Z %118 = tt.broadcast %117 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3984823Z %119 = arith.andi %118, %38 : tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3985082Z %120 = tt.load %114, %119, %cst_1 : tensor<4x16x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3985338Z %121 = arith.shli %120, %cst_8 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3985580Z %122 = arith.shrsi %121, %cst_8 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3985819Z %123 = arith.shrsi %120, %cst_8 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:47:55.3986114Z %124 = tt.expand_dims %122 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:47:55.3986453Z %125 = tt.expand_dims %123 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked> 2026-02-21T09:47:55.3986745Z %126 = tt.broadcast %124 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:55.3986992Z %127 = arith.select %43, %126, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:55.3987230Z %128 = tt.broadcast %125 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:55.3987468Z %129 = arith.select %45, %128, %127 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked> 2026-02-21T09:47:55.3987701Z %130 = tt.reshape %129 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2> 2026-02-21T09:47:55.3987927Z %131 = arith.sitofp %130 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2> 2026-02-21T09:47:55.3988188Z %132 = ttg.local_alloc %131 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem> 2026-02-21T09:47:55.3988532Z %133 = ttg.local_load %132 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:47:55.3989002Z %134 = tt.dot %106, %133, %94, inputPrecision = tf32 : tensor<32x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x16xf32, #mma> 2026-02-21T09:47:55.3989364Z scf.yield %134 : tensor<32x16xf32, #mma> 2026-02-21T09:47:55.3989496Z } {tt.num_stages = 1 : i32} 2026-02-21T09:47:55.3989653Z %47 = arith.truncf %46 : tensor<32x16xf32, #mma> to tensor<32x16xbf16, #mma> 2026-02-21T09:47:55.3989912Z %48 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi32, #mma> 2026-02-21T09:47:55.3990155Z %49 = arith.muli %48, %cst_11 : tensor<32x1xi32, #mma> 2026-02-21T09:47:55.3990380Z %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma> 2026-02-21T09:47:55.3990637Z %51 = tt.broadcast %49 : tensor<32x1xi32, #mma> -> tensor<32x16xi32, #mma> 2026-02-21T09:47:55.3990841Z %52 = tt.broadcast %50 : tensor<1x16xi32, #mma> -> tensor<32x16xi32, #mma> 2026-02-21T09:47:55.3991016Z %53 = arith.addi %51, %52 : tensor<32x16xi32, #mma> 2026-02-21T09:47:55.3991193Z %54 = tt.splat %arg2 : !tt.ptr -> tensor<32x16x!tt.ptr, #mma> 2026-02-21T09:47:55.3991402Z %55 = tt.addptr %54, %53 : tensor<32x16x!tt.ptr, #mma>, tensor<32x16xi32, #mma> 2026-02-21T09:47:55.3991594Z tt.store %55, %47 : tensor<32x16x!tt.ptr, #mma> 2026-02-21T09:47:55.3991725Z tt.return 2026-02-21T09:47:55.3991810Z } 2026-02-21T09:47:55.3991888Z } 2026-02-21T09:47:55.3991932Z 2026-02-21T09:47:55.3991963Z {-# 2026-02-21T09:47:55.3992050Z external_resources: { 2026-02-21T09:47:55.3992149Z mlir_reproducer: { 2026-02-21T09:47:55.3993182Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:47:55.3994201Z disable_threading: false, 2026-02-21T09:47:55.3994308Z verify_each: true 2026-02-21T09:47:55.3994403Z } 2026-02-21T09:47:55.3994477Z } 2026-02-21T09:47:55.3994553Z #-} 2026-02-21T09:47:55.3994838Z /tmp/torchinductor_root/lm/clmld3y3zmlww2g53qdkhjga4fhrgryynl2vnxsuowkkhbf4naug.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:47:55.3995549Z /tmp/torchinductor_root/lm/clmld3y3zmlww2g53qdkhjga4fhrgryynl2vnxsuowkkhbf4naug.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:47:55.3996111Z [206s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:47:55.3996843Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 32, 16], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:47:55.3997504Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:47:55.3997679Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:48:03.6817391Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:48:03.6818930Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}> 2026-02-21T09:48:03.6819980Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:48:03.6820820Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:48:03.6821591Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:48:03.6822290Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 32, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:48:03.6822932Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:48:03.6823437Z #smem = #ttg.shared_memory 2026-02-21T09:48:03.6823712Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:48:03.6824188Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:48:03.6824610Z %cst = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:48:03.6824781Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:48:03.6824902Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:48:03.6825026Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:48:03.6825144Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:48:03.6825298Z %cst_0 = arith.constant dense<0> : tensor<1x2x128xi8, #blocked> 2026-02-21T09:48:03.6825452Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:48:03.6825663Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:48:03.6825784Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:48:03.6825904Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:48:03.6826163Z %cst_1 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:48:03.6826382Z %cst_2 = arith.constant dense<4> : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6826608Z %cst_3 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:03.6826783Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:03.6826988Z %cst_5 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:48:03.6827161Z %cst_6 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:48:03.6827328Z %cst_7 = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:48:03.6827504Z %cst_8 = arith.constant dense<0> : tensor<1x128xi64, #mma> 2026-02-21T09:48:03.6827680Z %cst_9 = arith.constant dense<8192> : tensor<1x128xi64, #mma> 2026-02-21T09:48:03.6827830Z %0 = tt.get_program_id x : i32 2026-02-21T09:48:03.6827949Z %1 = arith.divsi %0, %c256_i32 : i32 2026-02-21T09:48:03.6828070Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:48:03.6828191Z %3 = arith.subi %c64_i32, %2 : i32 2026-02-21T09:48:03.6828308Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:48:03.6828426Z %5 = arith.remsi %0, %c256_i32 : i32 2026-02-21T09:48:03.6828541Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:48:03.6828656Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:48:03.6828769Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:48:03.6828879Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:48:03.6829089Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:03.6829374Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:03.6829723Z %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:03.6830041Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:03.6830344Z %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:03.6830646Z %15 = arith.addi %14, %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:03.6830857Z %16 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:48:03.6831031Z %17 = tt.splat %16 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:03.6831259Z %18 = arith.addi %17, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:03.6831507Z %19 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:03.6831820Z %20 = tt.expand_dims %18 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:48:03.6832074Z %21 = arith.muli %20, %cst_1 : tensor<128x1xi32, #blocked1> 2026-02-21T09:48:03.6832316Z %22 = tt.broadcast %21 : tensor<128x1xi32, #blocked1> -> tensor<128x2xi32, #blocked1> 2026-02-21T09:48:03.6832541Z %23 = tt.splat %arg0 : !tt.ptr -> tensor<128x2x!tt.ptr, #blocked1> 2026-02-21T09:48:03.6832886Z %24 = tt.expand_dims %15 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6833264Z %25 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6833596Z %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:48:03.6834005Z %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:48:03.6834422Z %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:03.6834676Z %29 = arith.cmpi eq, %28, %cst_3 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:03.6834873Z %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x128xi1, #blocked> 2026-02-21T09:48:03.6835073Z %31 = arith.cmpi eq, %28, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:03.6835262Z %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x128xi1, #blocked> 2026-02-21T09:48:03.6835529Z %33 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg4 = %cst) -> (tensor<128x128xf32, #mma>) : i32 { 2026-02-21T09:48:03.6835749Z %60 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:48:03.6835917Z %61 = tt.splat %60 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:03.6836136Z %62 = arith.addi %61, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:03.6836405Z %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:48:03.6836676Z %64 = tt.broadcast %63 : tensor<1x2xi32, #blocked1> -> tensor<128x2xi32, #blocked1> 2026-02-21T09:48:03.6836869Z %65 = arith.addi %22, %64 : tensor<128x2xi32, #blocked1> 2026-02-21T09:48:03.6837063Z %66 = tt.addptr %23, %65 : tensor<128x2x!tt.ptr, #blocked1>, tensor<128x2xi32, #blocked1> 2026-02-21T09:48:03.6837265Z %67 = tt.load %66 : tensor<128x2x!tt.ptr, #blocked1> 2026-02-21T09:48:03.6837484Z %68 = ttg.local_alloc %67 : (tensor<128x2xbf16, #blocked1>) -> !ttg.memdesc<128x2xbf16, #shared, #smem> 2026-02-21T09:48:03.6837834Z %69 = ttg.local_load %68 : !ttg.memdesc<128x2xbf16, #shared, #smem> -> tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:03.6838242Z %70 = arith.extf %69 : tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:03.6838541Z %71 = arith.muli %arg3, %c8192_i32 : i32 2026-02-21T09:48:03.6838719Z %72 = tt.splat %71 : i32 -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6838942Z %73 = arith.addi %72, %24 : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6839248Z %74 = tt.addptr %25, %73 : tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6839556Z %75 = tt.load %74 : tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6839780Z %76 = arith.shli %75, %cst_2 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6840010Z %77 = arith.shrsi %76, %cst_2 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6840239Z %78 = arith.shrsi %75, %cst_2 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6840525Z %79 = tt.expand_dims %77 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:48:03.6840859Z %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:48:03.6841138Z %81 = tt.broadcast %79 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:48:03.6841377Z %82 = arith.select %30, %81, %cst_0 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:48:03.6841626Z %83 = tt.broadcast %80 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:48:03.6841853Z %84 = arith.select %32, %83, %82 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:48:03.6842098Z %85 = tt.reshape %84 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked2> 2026-02-21T09:48:03.6842319Z %86 = arith.sitofp %85 : tensor<2x128xi8, #blocked2> to tensor<2x128xf32, #blocked2> 2026-02-21T09:48:03.6842647Z %87 = ttg.local_alloc %86 : (tensor<2x128xf32, #blocked2>) -> !ttg.memdesc<2x128xf32, #shared1, #smem> 2026-02-21T09:48:03.6842974Z %88 = ttg.local_load %87 : !ttg.memdesc<2x128xf32, #shared1, #smem> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:03.6843447Z %89 = tt.dot %70, %88, %arg4, inputPrecision = tf32 : tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:48:03.6843798Z %90 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:48:03.6843919Z %91 = arith.muli %90, %c2_i32 : i32 2026-02-21T09:48:03.6844091Z %92 = tt.splat %91 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:03.6844302Z %93 = arith.addi %92, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:03.6844577Z %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1> 2026-02-21T09:48:03.6844848Z %95 = tt.broadcast %94 : tensor<1x2xi32, #blocked1> -> tensor<128x2xi32, #blocked1> 2026-02-21T09:48:03.6845042Z %96 = arith.addi %22, %95 : tensor<128x2xi32, #blocked1> 2026-02-21T09:48:03.6845244Z %97 = tt.addptr %23, %96 : tensor<128x2x!tt.ptr, #blocked1>, tensor<128x2xi32, #blocked1> 2026-02-21T09:48:03.6845445Z %98 = tt.load %97 : tensor<128x2x!tt.ptr, #blocked1> 2026-02-21T09:48:03.6845684Z %99 = ttg.local_alloc %98 : (tensor<128x2xbf16, #blocked1>) -> !ttg.memdesc<128x2xbf16, #shared, #smem> 2026-02-21T09:48:03.6846011Z %100 = ttg.local_load %99 : !ttg.memdesc<128x2xbf16, #shared, #smem> -> tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:03.6846417Z %101 = arith.extf %100 : tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:03.6846726Z %102 = arith.muli %90, %c8192_i32 : i32 2026-02-21T09:48:03.6846901Z %103 = tt.splat %102 : i32 -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6847134Z %104 = arith.addi %103, %24 : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6847449Z %105 = tt.addptr %25, %104 : tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6847765Z %106 = tt.load %105 : tensor<1x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6848002Z %107 = arith.shli %106, %cst_2 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6848240Z %108 = arith.shrsi %107, %cst_2 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6848480Z %109 = arith.shrsi %106, %cst_2 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:03.6848772Z %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:48:03.6849110Z %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked> 2026-02-21T09:48:03.6849405Z %112 = tt.broadcast %110 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:48:03.6849698Z %113 = arith.select %30, %112, %cst_0 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:48:03.6849935Z %114 = tt.broadcast %111 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked> 2026-02-21T09:48:03.6850191Z %115 = arith.select %32, %114, %113 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked> 2026-02-21T09:48:03.6850421Z %116 = tt.reshape %115 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked2> 2026-02-21T09:48:03.6850647Z %117 = arith.sitofp %116 : tensor<2x128xi8, #blocked2> to tensor<2x128xf32, #blocked2> 2026-02-21T09:48:03.6850903Z %118 = ttg.local_alloc %117 : (tensor<2x128xf32, #blocked2>) -> !ttg.memdesc<2x128xf32, #shared1, #smem> 2026-02-21T09:48:03.6851229Z %119 = ttg.local_load %118 : !ttg.memdesc<2x128xf32, #shared1, #smem> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:03.6851702Z %120 = tt.dot %101, %119, %89, inputPrecision = tf32 : tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:48:03.6852051Z scf.yield %120 : tensor<128x128xf32, #mma> 2026-02-21T09:48:03.6852175Z } {tt.flatten} 2026-02-21T09:48:03.6852316Z %34 = arith.truncf %33 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:48:03.6852487Z %35 = arith.extsi %16 : i32 to i64 2026-02-21T09:48:03.6852604Z %36 = arith.extsi %9 : i32 to i64 2026-02-21T09:48:03.6852758Z %37 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:48:03.6852962Z %38 = tt.splat %35 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:03.6853238Z %39 = arith.extsi %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:03.6853574Z %40 = arith.extsi %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:03.6853863Z %41 = arith.addi %38, %39 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:03.6854120Z %42 = tt.expand_dims %41 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:48:03.6854376Z %43 = arith.muli %42, %cst_5 : tensor<128x1xi64, #mma> 2026-02-21T09:48:03.6854548Z %44 = tt.broadcast %43 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:48:03.6854750Z %45 = tt.splat %36 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:03.6854954Z %46 = arith.addi %45, %40 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:03.6855209Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:48:03.6855466Z %48 = tt.broadcast %47 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:48:03.6855642Z %49 = arith.addi %44, %48 : tensor<128x128xi64, #mma> 2026-02-21T09:48:03.6855830Z %50 = tt.addptr %37, %49 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi64, #mma> 2026-02-21T09:48:03.6856028Z %51 = arith.cmpi sge, %42, %cst_6 : tensor<128x1xi64, #mma> 2026-02-21T09:48:03.6856187Z %52 = arith.cmpi slt, %42, %cst_7 : tensor<128x1xi64, #mma> 2026-02-21T09:48:03.6856343Z %53 = arith.andi %51, %52 : tensor<128x1xi1, #mma> 2026-02-21T09:48:03.6856510Z %54 = tt.broadcast %53 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:48:03.6856693Z %55 = arith.cmpi sge, %47, %cst_8 : tensor<1x128xi64, #mma> 2026-02-21T09:48:03.6856849Z %56 = arith.cmpi slt, %47, %cst_9 : tensor<1x128xi64, #mma> 2026-02-21T09:48:03.6857000Z %57 = arith.andi %55, %56 : tensor<1x128xi1, #mma> 2026-02-21T09:48:03.6857167Z %58 = tt.broadcast %57 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:48:03.6857355Z %59 = arith.andi %54, %58 : tensor<128x128xi1, #mma> 2026-02-21T09:48:03.6857511Z tt.store %50, %34, %59 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:48:03.6857644Z tt.return 2026-02-21T09:48:03.6857744Z } 2026-02-21T09:48:03.6857823Z } 2026-02-21T09:48:03.6857869Z 2026-02-21T09:48:03.6857900Z {-# 2026-02-21T09:48:03.6857981Z external_resources: { 2026-02-21T09:48:03.6858084Z mlir_reproducer: { 2026-02-21T09:48:03.6859080Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:48:03.6860069Z disable_threading: false, 2026-02-21T09:48:03.6860173Z verify_each: true 2026-02-21T09:48:03.6860263Z } 2026-02-21T09:48:03.6860339Z } 2026-02-21T09:48:03.6860410Z #-} 2026-02-21T09:48:03.6860686Z /tmp/torchinductor_root/o3/co3bzd4cuo6myrozjopy26yiptt777ypmetejcg2rol46jmdbobw.py:12:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:48:03.6861376Z /tmp/torchinductor_root/o3/co3bzd4cuo6myrozjopy26yiptt777ypmetejcg2rol46jmdbobw.py:12:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:48:03.6861939Z [214s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:48:03.6862680Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 128], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T09:48:03.6863333Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:48:03.6863519Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:48:13.4421889Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:48:13.4424191Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:48:13.4425213Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:48:13.4426133Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:48:13.4426958Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:48:13.4427754Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:48:13.4428481Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:48:13.4428985Z #smem = #ttg.shared_memory 2026-02-21T09:48:13.4429627Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:48:13.4430936Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:48:13.4432593Z %cst = arith.constant dense<8192> : tensor<16x1xi32, #mma> 2026-02-21T09:48:13.4433088Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:13.4433634Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:13.4433924Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x32xf32, #mma> 2026-02-21T09:48:13.4434155Z %cst_3 = arith.constant dense<8192> : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4434359Z %cst_4 = arith.constant dense<0> : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4434565Z %cst_5 = arith.constant dense<512> : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4434763Z %cst_6 = arith.constant dense<0> : tensor<1x32xi64, #blocked1> 2026-02-21T09:48:13.4434967Z %cst_7 = arith.constant dense<8192> : tensor<1x32xi64, #blocked1> 2026-02-21T09:48:13.4435170Z %cst_8 = arith.constant dense<1024> : tensor<16x1xi32, #blocked2> 2026-02-21T09:48:13.4435351Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:48:13.4435490Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:48:13.4435622Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:48:13.4435760Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:48:13.4435902Z %c504_i32 = arith.constant 504 : i32 2026-02-21T09:48:13.4436038Z %c12_i32 = arith.constant 12 : i32 2026-02-21T09:48:13.4436203Z %cst_9 = arith.constant dense<0> : tensor<4x32xi8, #blocked1> 2026-02-21T09:48:13.4436373Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:48:13.4436540Z %cst_10 = arith.constant dense<0> : tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4436707Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:48:13.4436844Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:48:13.4436975Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:48:13.4437190Z %cst_11 = arith.constant dense<4> : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4437415Z %0 = tt.get_program_id x : i32 2026-02-21T09:48:13.4437549Z %1 = arith.divsi %0, %c512_i32 : i32 2026-02-21T09:48:13.4437779Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:48:13.4437913Z %3 = arith.subi %c1024_i32, %2 : i32 2026-02-21T09:48:13.4438076Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:48:13.4438217Z %5 = arith.remsi %0, %c512_i32 : i32 2026-02-21T09:48:13.4438470Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:48:13.4438601Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:48:13.4438733Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:48:13.4438865Z %9 = arith.muli %7, %c16_i32 : i32 2026-02-21T09:48:13.4439106Z %10 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:48:13.4439439Z %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:13.4439733Z %12 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:48:13.4439993Z %13 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:13.4440248Z %14 = arith.addi %12, %10 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:48:13.4440501Z %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:13.4440700Z %16 = arith.muli %8, %c32_i32 : i32 2026-02-21T09:48:13.4440938Z %17 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:13.4441261Z %18 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:13.4441541Z %19 = tt.splat %16 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:13.4441777Z %20 = arith.addi %19, %18 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:13.4442061Z %21 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:13.4442444Z %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2> 2026-02-21T09:48:13.4442843Z %23 = arith.muli %22, %cst_8 : tensor<16x1xi32, #blocked2> 2026-02-21T09:48:13.4443069Z %24 = tt.broadcast %23 : tensor<16x1xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:48:13.4443326Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T09:48:13.4443503Z %26 = arith.extsi %16 : i32 to i64 2026-02-21T09:48:13.4443662Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<4x32x!tt.ptr, #blocked1> 2026-02-21T09:48:13.4443905Z %28 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:13.4444227Z %29 = arith.extsi %28 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:13.4444527Z %30 = tt.splat %26 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:13.4444830Z %31 = arith.extsi %17 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:13.4445126Z %32 = arith.addi %30, %31 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:13.4445407Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x32xi64, #blocked1> 2026-02-21T09:48:13.4445682Z %34 = tt.broadcast %33 : tensor<1x32xi64, #blocked1> -> tensor<4x32xi64, #blocked1> 2026-02-21T09:48:13.4445885Z %35 = arith.cmpi sge, %33, %cst_6 : tensor<1x32xi64, #blocked1> 2026-02-21T09:48:13.4446059Z %36 = arith.cmpi slt, %33, %cst_7 : tensor<1x32xi64, #blocked1> 2026-02-21T09:48:13.4446225Z %37 = arith.andi %35, %36 : tensor<1x32xi1, #blocked1> 2026-02-21T09:48:13.4446411Z %38 = tt.broadcast %37 : tensor<1x32xi1, #blocked1> -> tensor<4x32xi1, #blocked1> 2026-02-21T09:48:13.4446718Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:48:13.4447136Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:48:13.4447560Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:13.4447810Z %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:13.4448010Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked> 2026-02-21T09:48:13.4448203Z %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:13.4448395Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked> 2026-02-21T09:48:13.4448670Z %46 = scf.for %arg3 = %c0_i32 to %c504_i32 step %c12_i32 iter_args(%arg4 = %cst_2) -> (tensor<16x32xf32, #mma>) : i32 { 2026-02-21T09:48:13.4448888Z %57 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:48:13.4449064Z %58 = tt.splat %57 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:13.4449289Z %59 = arith.addi %58, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:13.4449568Z %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:48:13.4449847Z %61 = tt.broadcast %60 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:48:13.4450038Z %62 = arith.addi %24, %61 : tensor<16x8xi32, #blocked2> 2026-02-21T09:48:13.4450239Z %63 = tt.addptr %25, %62 : tensor<16x8x!tt.ptr, #blocked2>, tensor<16x8xi32, #blocked2> 2026-02-21T09:48:13.4450460Z %64 = tt.load %63 : tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T09:48:13.4450733Z %65 = ttg.convert_layout %64 : tensor<16x8xbf16, #blocked2> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:13.4451154Z %66 = arith.extf %65 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:13.4451439Z %67 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:48:13.4451614Z %68 = tt.splat %67 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:13.4451832Z %69 = arith.addi %68, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:13.4452110Z %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4452365Z %71 = arith.muli %70, %cst_3 : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4452557Z %72 = tt.broadcast %71 : tensor<4x1xi64, #blocked1> -> tensor<4x32xi64, #blocked1> 2026-02-21T09:48:13.4452751Z %73 = arith.addi %72, %34 : tensor<4x32xi64, #blocked1> 2026-02-21T09:48:13.4452947Z %74 = tt.addptr %27, %73 : tensor<4x32x!tt.ptr, #blocked1>, tensor<4x32xi64, #blocked1> 2026-02-21T09:48:13.4453157Z %75 = arith.cmpi sge, %70, %cst_4 : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4453333Z %76 = arith.cmpi slt, %70, %cst_5 : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4453494Z %77 = arith.andi %75, %76 : tensor<4x1xi1, #blocked1> 2026-02-21T09:48:13.4453680Z %78 = tt.broadcast %77 : tensor<4x1xi1, #blocked1> -> tensor<4x32xi1, #blocked1> 2026-02-21T09:48:13.4453867Z %79 = arith.andi %78, %38 : tensor<4x32xi1, #blocked1> 2026-02-21T09:48:13.4454036Z %80 = tt.load %74, %79, %cst_9 : tensor<4x32x!tt.ptr, #blocked1> 2026-02-21T09:48:13.4454292Z %81 = ttg.convert_layout %80 : tensor<4x32xi8, #blocked1> -> tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4454639Z %82 = arith.shli %81, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4454877Z %83 = arith.shrsi %82, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4455116Z %84 = arith.shrsi %81, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4455427Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:13.4455757Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:13.4456039Z %87 = tt.broadcast %85 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4456279Z %88 = arith.select %43, %87, %cst_10 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4456515Z %89 = tt.broadcast %86 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4456744Z %90 = arith.select %45, %89, %88 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4456965Z %91 = tt.reshape %90 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked3> 2026-02-21T09:48:13.4457186Z %92 = arith.sitofp %91 : tensor<8x32xi8, #blocked3> to tensor<8x32xf32, #blocked3> 2026-02-21T09:48:13.4457434Z %93 = ttg.local_alloc %92 : (tensor<8x32xf32, #blocked3>) -> !ttg.memdesc<8x32xf32, #shared, #smem> 2026-02-21T09:48:13.4457750Z %94 = ttg.local_load %93 : !ttg.memdesc<8x32xf32, #shared, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:13.4458229Z %95 = tt.dot %66, %94, %arg4, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma> 2026-02-21T09:48:13.4458595Z %96 = arith.addi %arg3, %c4_i32 : i32 2026-02-21T09:48:13.4458722Z %97 = arith.muli %96, %c2_i32 : i32 2026-02-21T09:48:13.4458897Z %98 = tt.splat %97 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:13.4459130Z %99 = arith.addi %98, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:13.4459411Z %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:48:13.4459699Z %101 = tt.broadcast %100 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:48:13.4459893Z %102 = arith.addi %24, %101 : tensor<16x8xi32, #blocked2> 2026-02-21T09:48:13.4460101Z %103 = tt.addptr %25, %102 : tensor<16x8x!tt.ptr, #blocked2>, tensor<16x8xi32, #blocked2> 2026-02-21T09:48:13.4460308Z %104 = tt.load %103 : tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T09:48:13.4460579Z %105 = ttg.convert_layout %104 : tensor<16x8xbf16, #blocked2> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:13.4460991Z %106 = arith.extf %105 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:13.4461275Z %107 = arith.extsi %96 : i32 to i64 2026-02-21T09:48:13.4461452Z %108 = tt.splat %107 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:13.4461676Z %109 = arith.addi %108, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:13.4461958Z %110 = tt.expand_dims %109 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4462217Z %111 = arith.muli %110, %cst_3 : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4462410Z %112 = tt.broadcast %111 : tensor<4x1xi64, #blocked1> -> tensor<4x32xi64, #blocked1> 2026-02-21T09:48:13.4462627Z %113 = arith.addi %112, %34 : tensor<4x32xi64, #blocked1> 2026-02-21T09:48:13.4462825Z %114 = tt.addptr %27, %113 : tensor<4x32x!tt.ptr, #blocked1>, tensor<4x32xi64, #blocked1> 2026-02-21T09:48:13.4463046Z %115 = arith.cmpi sge, %110, %cst_4 : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4463237Z %116 = arith.cmpi slt, %110, %cst_5 : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4463411Z %117 = arith.andi %115, %116 : tensor<4x1xi1, #blocked1> 2026-02-21T09:48:13.4463603Z %118 = tt.broadcast %117 : tensor<4x1xi1, #blocked1> -> tensor<4x32xi1, #blocked1> 2026-02-21T09:48:13.4463794Z %119 = arith.andi %118, %38 : tensor<4x32xi1, #blocked1> 2026-02-21T09:48:13.4463968Z %120 = tt.load %114, %119, %cst_9 : tensor<4x32x!tt.ptr, #blocked1> 2026-02-21T09:48:13.4464230Z %121 = ttg.convert_layout %120 : tensor<4x32xi8, #blocked1> -> tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4464521Z %122 = arith.shli %121, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4464766Z %123 = arith.shrsi %122, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4465006Z %124 = arith.shrsi %121, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4465301Z %125 = tt.expand_dims %123 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:13.4465641Z %126 = tt.expand_dims %124 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:13.4465928Z %127 = tt.broadcast %125 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4466175Z %128 = arith.select %43, %127, %cst_10 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4466428Z %129 = tt.broadcast %126 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4466670Z %130 = arith.select %45, %129, %128 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4466918Z %131 = tt.reshape %130 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked3> 2026-02-21T09:48:13.4467144Z %132 = arith.sitofp %131 : tensor<8x32xi8, #blocked3> to tensor<8x32xf32, #blocked3> 2026-02-21T09:48:13.4467400Z %133 = ttg.local_alloc %132 : (tensor<8x32xf32, #blocked3>) -> !ttg.memdesc<8x32xf32, #shared, #smem> 2026-02-21T09:48:13.4467725Z %134 = ttg.local_load %133 : !ttg.memdesc<8x32xf32, #shared, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:13.4468196Z %135 = tt.dot %106, %134, %95, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma> 2026-02-21T09:48:13.4468543Z %136 = arith.addi %arg3, %c8_i32 : i32 2026-02-21T09:48:13.4468674Z %137 = arith.muli %136, %c2_i32 : i32 2026-02-21T09:48:13.4468851Z %138 = tt.splat %137 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:13.4469077Z %139 = arith.addi %138, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:13.4469361Z %140 = tt.expand_dims %139 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:48:13.4469640Z %141 = tt.broadcast %140 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:48:13.4469841Z %142 = arith.addi %24, %141 : tensor<16x8xi32, #blocked2> 2026-02-21T09:48:13.4470049Z %143 = tt.addptr %25, %142 : tensor<16x8x!tt.ptr, #blocked2>, tensor<16x8xi32, #blocked2> 2026-02-21T09:48:13.4470261Z %144 = tt.load %143 : tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T09:48:13.4470549Z %145 = ttg.convert_layout %144 : tensor<16x8xbf16, #blocked2> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:13.4470952Z %146 = arith.extf %145 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:13.4471240Z %147 = arith.extsi %136 : i32 to i64 2026-02-21T09:48:13.4471572Z %148 = tt.splat %147 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:13.4471799Z %149 = arith.addi %148, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:13.4472085Z %150 = tt.expand_dims %149 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4472373Z %151 = arith.muli %150, %cst_3 : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4472565Z %152 = tt.broadcast %151 : tensor<4x1xi64, #blocked1> -> tensor<4x32xi64, #blocked1> 2026-02-21T09:48:13.4472766Z %153 = arith.addi %152, %34 : tensor<4x32xi64, #blocked1> 2026-02-21T09:48:13.4472966Z %154 = tt.addptr %27, %153 : tensor<4x32x!tt.ptr, #blocked1>, tensor<4x32xi64, #blocked1> 2026-02-21T09:48:13.4473184Z %155 = arith.cmpi sge, %150, %cst_4 : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4473362Z %156 = arith.cmpi slt, %150, %cst_5 : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4473530Z %157 = arith.andi %155, %156 : tensor<4x1xi1, #blocked1> 2026-02-21T09:48:13.4473724Z %158 = tt.broadcast %157 : tensor<4x1xi1, #blocked1> -> tensor<4x32xi1, #blocked1> 2026-02-21T09:48:13.4473916Z %159 = arith.andi %158, %38 : tensor<4x32xi1, #blocked1> 2026-02-21T09:48:13.4474091Z %160 = tt.load %154, %159, %cst_9 : tensor<4x32x!tt.ptr, #blocked1> 2026-02-21T09:48:13.4474354Z %161 = ttg.convert_layout %160 : tensor<4x32xi8, #blocked1> -> tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4474654Z %162 = arith.shli %161, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4474899Z %163 = arith.shrsi %162, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4475155Z %164 = arith.shrsi %161, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4475451Z %165 = tt.expand_dims %163 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:13.4475793Z %166 = tt.expand_dims %164 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:13.4476078Z %167 = tt.broadcast %165 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4476322Z %168 = arith.select %43, %167, %cst_10 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4476561Z %169 = tt.broadcast %166 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4476806Z %170 = arith.select %45, %169, %168 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4477041Z %171 = tt.reshape %170 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked3> 2026-02-21T09:48:13.4477267Z %172 = arith.sitofp %171 : tensor<8x32xi8, #blocked3> to tensor<8x32xf32, #blocked3> 2026-02-21T09:48:13.4477526Z %173 = ttg.local_alloc %172 : (tensor<8x32xf32, #blocked3>) -> !ttg.memdesc<8x32xf32, #shared, #smem> 2026-02-21T09:48:13.4477853Z %174 = ttg.local_load %173 : !ttg.memdesc<8x32xf32, #shared, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:13.4478325Z %175 = tt.dot %146, %174, %135, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma> 2026-02-21T09:48:13.4478681Z scf.yield %175 : tensor<16x32xf32, #mma> 2026-02-21T09:48:13.4478833Z } {tt.num_stages = 1 : i32} 2026-02-21T09:48:13.4479041Z %47 = scf.for %arg3 = %c504_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %46) -> (tensor<16x32xf32, #mma>) : i32 { 2026-02-21T09:48:13.4479258Z %57 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:48:13.4479434Z %58 = tt.splat %57 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:13.4479670Z %59 = arith.addi %58, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:13.4479944Z %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:48:13.4480221Z %61 = tt.broadcast %60 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T09:48:13.4480415Z %62 = arith.addi %24, %61 : tensor<16x8xi32, #blocked2> 2026-02-21T09:48:13.4480616Z %63 = tt.addptr %25, %62 : tensor<16x8x!tt.ptr, #blocked2>, tensor<16x8xi32, #blocked2> 2026-02-21T09:48:13.4480826Z %64 = tt.load %63 : tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T09:48:13.4481096Z %65 = ttg.convert_layout %64 : tensor<16x8xbf16, #blocked2> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:13.4481503Z %66 = arith.extf %65 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:13.4481784Z %67 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:48:13.4481958Z %68 = tt.splat %67 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:13.4482183Z %69 = arith.addi %68, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:13.4482464Z %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4482791Z %71 = arith.muli %70, %cst_3 : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4482981Z %72 = tt.broadcast %71 : tensor<4x1xi64, #blocked1> -> tensor<4x32xi64, #blocked1> 2026-02-21T09:48:13.4483176Z %73 = arith.addi %72, %34 : tensor<4x32xi64, #blocked1> 2026-02-21T09:48:13.4483403Z %74 = tt.addptr %27, %73 : tensor<4x32x!tt.ptr, #blocked1>, tensor<4x32xi64, #blocked1> 2026-02-21T09:48:13.4483612Z %75 = arith.cmpi sge, %70, %cst_4 : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4483786Z %76 = arith.cmpi slt, %70, %cst_5 : tensor<4x1xi64, #blocked1> 2026-02-21T09:48:13.4483948Z %77 = arith.andi %75, %76 : tensor<4x1xi1, #blocked1> 2026-02-21T09:48:13.4484137Z %78 = tt.broadcast %77 : tensor<4x1xi1, #blocked1> -> tensor<4x32xi1, #blocked1> 2026-02-21T09:48:13.4484327Z %79 = arith.andi %78, %38 : tensor<4x32xi1, #blocked1> 2026-02-21T09:48:13.4484495Z %80 = tt.load %74, %79, %cst_9 : tensor<4x32x!tt.ptr, #blocked1> 2026-02-21T09:48:13.4484755Z %81 = ttg.convert_layout %80 : tensor<4x32xi8, #blocked1> -> tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4485031Z %82 = arith.shli %81, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4485275Z %83 = arith.shrsi %82, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4485510Z %84 = arith.shrsi %81, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:13.4485799Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:13.4486137Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:13.4486413Z %87 = tt.broadcast %85 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4486651Z %88 = arith.select %43, %87, %cst_10 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4486904Z %89 = tt.broadcast %86 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4487136Z %90 = arith.select %45, %89, %88 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:13.4487368Z %91 = tt.reshape %90 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked3> 2026-02-21T09:48:13.4487600Z %92 = arith.sitofp %91 : tensor<8x32xi8, #blocked3> to tensor<8x32xf32, #blocked3> 2026-02-21T09:48:13.4495447Z %93 = ttg.local_alloc %92 : (tensor<8x32xf32, #blocked3>) -> !ttg.memdesc<8x32xf32, #shared, #smem> 2026-02-21T09:48:13.4495804Z %94 = ttg.local_load %93 : !ttg.memdesc<8x32xf32, #shared, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:13.4496277Z %95 = tt.dot %66, %94, %arg4, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma> 2026-02-21T09:48:13.4496635Z scf.yield %95 : tensor<16x32xf32, #mma> 2026-02-21T09:48:13.4496768Z } {tt.num_stages = 1 : i32} 2026-02-21T09:48:13.4496926Z %48 = arith.truncf %47 : tensor<16x32xf32, #mma> to tensor<16x32xbf16, #mma> 2026-02-21T09:48:13.4497189Z %49 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma> 2026-02-21T09:48:13.4497427Z %50 = arith.muli %49, %cst : tensor<16x1xi32, #mma> 2026-02-21T09:48:13.4497657Z %51 = tt.expand_dims %20 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma> 2026-02-21T09:48:13.4497912Z %52 = tt.broadcast %50 : tensor<16x1xi32, #mma> -> tensor<16x32xi32, #mma> 2026-02-21T09:48:13.4498108Z %53 = tt.broadcast %51 : tensor<1x32xi32, #mma> -> tensor<16x32xi32, #mma> 2026-02-21T09:48:13.4498286Z %54 = arith.addi %52, %53 : tensor<16x32xi32, #mma> 2026-02-21T09:48:13.4498502Z %55 = tt.splat %arg2 : !tt.ptr -> tensor<16x32x!tt.ptr, #mma> 2026-02-21T09:48:13.4498717Z %56 = tt.addptr %55, %54 : tensor<16x32x!tt.ptr, #mma>, tensor<16x32xi32, #mma> 2026-02-21T09:48:13.4498925Z tt.store %56, %48 : tensor<16x32x!tt.ptr, #mma> 2026-02-21T09:48:13.4499061Z tt.return 2026-02-21T09:48:13.4499152Z } 2026-02-21T09:48:13.4499236Z } 2026-02-21T09:48:13.4499282Z 2026-02-21T09:48:13.4499321Z {-# 2026-02-21T09:48:13.4499406Z external_resources: { 2026-02-21T09:48:13.4499512Z mlir_reproducer: { 2026-02-21T09:48:13.4500516Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:48:13.4501523Z disable_threading: false, 2026-02-21T09:48:13.4501633Z verify_each: true 2026-02-21T09:48:13.4501730Z } 2026-02-21T09:48:13.4501809Z } 2026-02-21T09:48:13.4501887Z #-} 2026-02-21T09:48:13.4502167Z /tmp/torchinductor_root/nb/cnboqpelzdadahownptqgzjviokjf56pgn6uoo537sfljnhv5aii.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:48:13.4502860Z /tmp/torchinductor_root/nb/cnboqpelzdadahownptqgzjviokjf56pgn6uoo537sfljnhv5aii.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:48:13.4503419Z [224s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:48:13.4504159Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:48:13.4509496Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:48:13.4509672Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:48:13.4510188Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 0.9 configs/s 2026-02-21T09:48:13.4510433Z [224s] Adaptive compile timeout: 30s (90% percentile=15.1s, bounds=[30.0s, 30s]) 2026-02-21T09:48:13.4510635Z [224s] Initial random population of 100, 5 starting points: 2026-02-21T09:48:13.4510778Z error=15 2026-02-21T09:48:13.4510866Z timeout=3 2026-02-21T09:48:13.4510950Z ok=82 2026-02-21T09:48:13.4511038Z min=1.5749 2026-02-21T09:48:13.4511119Z mid=115.6728 2026-02-21T09:48:13.4511205Z max=3930.6316 2026-02-21T09:48:13.4511304Z best={'block_sizes': [8, 128, 512], 2026-02-21T09:48:13.4511445Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T09:48:13.4511577Z 'l2_groupings': [1], 2026-02-21T09:48:13.4511690Z 'load_eviction_policies': ['', ''], 2026-02-21T09:48:13.4511807Z 'loop_orders': [[0, 1]], 2026-02-21T09:48:13.4511923Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:48:13.4512084Z 'num_sm_multiplier': 128, 2026-02-21T09:48:13.4512187Z 'num_stages': 4, 2026-02-21T09:48:13.4512281Z 'num_warps': 16, 2026-02-21T09:48:13.4512379Z 'pid_type': 'persistent_blocked', 2026-02-21T09:48:13.4512501Z 'range_flattens': [False, True], 2026-02-21T09:48:13.4512616Z 'range_multi_buffers': [True, True], 2026-02-21T09:48:13.4512738Z 'range_num_stages': [1, 1], 2026-02-21T09:48:13.4512845Z 'range_unroll_factors': [3, 3], 2026-02-21T09:48:13.4512993Z 'range_warp_specializes': [], 2026-02-21T09:48:13.4513102Z 'waves_per_eu': 1} 2026-02-21T09:48:13.4513215Z [224s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:48:14.6224240Z [225s] Generation 1 starting: 108 neighbors, 5 active search path(s) 2026-02-21T09:48:40.0404193Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 111/111 0.7 configs/s 2026-02-21T09:48:41.3123092Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:48:41.3136108Z #blocked = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:48:41.3139503Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}> 2026-02-21T09:48:41.3140234Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:48:41.3140781Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}> 2026-02-21T09:48:41.3141303Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:48:41.3141769Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:48:41.3142181Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:48:41.3142501Z #smem = #ttg.shared_memory 2026-02-21T09:48:41.3142911Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:48:41.3143754Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:48:41.3144768Z %cst = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3145072Z %cst_0 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3145377Z %cst_1 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3145671Z %cst_2 = arith.constant dense<8192> : tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3146054Z %cst_3 = arith.constant dense<0> : tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3146353Z %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3146644Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3146940Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3147252Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:48:41.3147533Z %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:48:41.3147807Z %cst_9 = arith.constant dense<0.000000e+00> : tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3148084Z %cst_10 = arith.constant dense<0> : tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3148353Z %cst_11 = arith.constant dense<8192> : tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3148622Z %cst_12 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:48:41.3148936Z %cst_13 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3149301Z %cst_14 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3149576Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:48:41.3149785Z %cst_15 = arith.constant dense<0> : tensor<2x512xi8, #blocked> 2026-02-21T09:48:41.3149990Z %c6_i32 = arith.constant 6 : i32 2026-02-21T09:48:41.3150157Z %c510_i32 = arith.constant 510 : i32 2026-02-21T09:48:41.3150314Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:48:41.3150485Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T09:48:41.3150720Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:48:41.3150886Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:48:41.3151056Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:48:41.3151342Z %cst_16 = arith.constant dense<0> : tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3151553Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:48:41.3151715Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:48:41.3151970Z %cst_17 = arith.constant dense<4> : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3152243Z %0 = tt.get_program_id x : i32 2026-02-21T09:48:41.3152405Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:48:41.3152565Z %2 = arith.minsi %1, %c2048_i32 : i32 2026-02-21T09:48:41.3152859Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:48:41.3153260Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:41.3153635Z %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3153985Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3154261Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3154594Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3155035Z %9 = arith.extsi %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3155485Z %10 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:48:41.3155947Z %11 = arith.extsi %10 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:48:41.3156497Z %12 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:48:41.3157084Z %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:48:41.3157635Z %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:48:41.3157938Z %15 = arith.cmpi eq, %14, %cst_8 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:48:41.3158167Z %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x512xi1, #blocked1> 2026-02-21T09:48:41.3158416Z %17 = arith.cmpi eq, %14, %cst_7 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:48:41.3158651Z %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x512xi1, #blocked1> 2026-02-21T09:48:41.3158898Z %19 = tt.splat %arg2 : !tt.ptr -> tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:48:41.3159189Z %20 = arith.extsi %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:41.3159524Z %21 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:41.3159853Z %22 = arith.extsi %21 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:41.3160102Z %23 = arith.subi %2, %0 : i32 2026-02-21T09:48:41.3160228Z %24 = arith.remsi %23, %c3_i32 : i32 2026-02-21T09:48:41.3160351Z %25 = arith.subi %23, %24 : i32 2026-02-21T09:48:41.3160470Z %26 = arith.addi %0, %25 : i32 2026-02-21T09:48:41.3160653Z %27 = arith.addi %5, %cst_13 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3160970Z %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:48:41.3161268Z %29 = tt.broadcast %28 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3161531Z %30 = arith.addi %9, %cst_14 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3161848Z %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3162107Z %32 = arith.muli %31, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3162309Z %33 = tt.broadcast %32 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3162518Z %34 = arith.cmpi sge, %31, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3162770Z %35 = arith.cmpi slt, %31, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3162944Z %36 = arith.andi %34, %35 : tensor<2x1xi1, #blocked> 2026-02-21T09:48:41.3163134Z %37 = tt.broadcast %36 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3163331Z scf.for %arg3 = %0 to %26 step %c3_i32 : i32 { 2026-02-21T09:48:41.3163483Z %38 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:48:41.3163628Z %39 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:48:41.3163764Z %40 = arith.muli %38, %c128_i32 : i32 2026-02-21T09:48:41.3163952Z %41 = tt.splat %40 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:48:41.3164196Z %42 = arith.addi %41, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:48:41.3164385Z %43 = arith.muli %39, %c512_i32 : i32 2026-02-21T09:48:41.3164627Z %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:48:41.3164901Z %45 = arith.muli %44, %cst_12 : tensor<128x1xi32, #blocked2> 2026-02-21T09:48:41.3165115Z %46 = tt.broadcast %45 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3165325Z %47 = arith.extsi %43 : i32 to i64 2026-02-21T09:48:41.3165507Z %48 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:48:41.3165740Z %49 = arith.addi %48, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:48:41.3166058Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3166358Z %51 = tt.broadcast %50 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3166579Z %52 = arith.cmpi sge, %50, %cst_10 : tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3166765Z %53 = arith.cmpi slt, %50, %cst_11 : tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3166942Z %54 = arith.andi %52, %53 : tensor<1x512xi1, #blocked> 2026-02-21T09:48:41.3167135Z %55 = tt.broadcast %54 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3167400Z %56 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_9) -> (tensor<128x512xf32, #mma>) : i32 { 2026-02-21T09:48:41.3167639Z %238 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:41.3167812Z %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3168037Z %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3168309Z %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:48:41.3168586Z %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3168781Z %243 = arith.addi %46, %242 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3168981Z %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3169211Z %245 = tt.load %244 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3169433Z %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3169787Z %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3170200Z %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3170483Z %249 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:41.3170657Z %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3170872Z %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3171148Z %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3171391Z %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3171583Z %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3171778Z %255 = arith.addi %254, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3171972Z %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3172183Z %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3172357Z %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3172517Z %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked> 2026-02-21T09:48:41.3172702Z %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3172888Z %261 = arith.andi %260, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3173102Z %262 = tt.load %256, %261, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3173362Z %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3173647Z %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3173905Z %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3174148Z %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3174450Z %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3174801Z %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3175352Z %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3175606Z %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3175852Z %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3176097Z %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3176335Z %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3176559Z %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3176815Z %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3177155Z %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3177640Z %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3178009Z %278 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:48:41.3178132Z %279 = arith.muli %278, %c2_i32 : i32 2026-02-21T09:48:41.3178304Z %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3178529Z %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3178804Z %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:48:41.3179083Z %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3179278Z %284 = arith.addi %46, %283 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3179481Z %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3179689Z %286 = tt.load %285 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3179916Z %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3180247Z %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3180654Z %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3180938Z %290 = arith.extsi %278 : i32 to i64 2026-02-21T09:48:41.3181128Z %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3181345Z %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3181620Z %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3181877Z %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3182069Z %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3182259Z %296 = arith.addi %295, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3182454Z %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3182663Z %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3182831Z %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3182996Z %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked> 2026-02-21T09:48:41.3183179Z %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3183370Z %302 = arith.andi %301, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3183537Z %303 = tt.load %297, %302, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3183797Z %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3184086Z %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3184327Z %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3184572Z %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3184886Z %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3185229Z %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3185538Z %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3185787Z %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3186035Z %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3186279Z %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3186513Z %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3186744Z %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3186999Z %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3187335Z %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3187812Z %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3188156Z %319 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:41.3188282Z %320 = arith.muli %319, %c2_i32 : i32 2026-02-21T09:48:41.3188455Z %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3188674Z %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3188968Z %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:48:41.3189244Z %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3189439Z %325 = arith.addi %46, %324 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3189663Z %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3189867Z %327 = tt.load %326 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3190089Z %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3190417Z %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3190827Z %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3191112Z %331 = arith.extsi %319 : i32 to i64 2026-02-21T09:48:41.3191280Z %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3191501Z %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3191770Z %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3192015Z %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3192206Z %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3192395Z %337 = arith.addi %336, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3192608Z %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3192813Z %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3192982Z %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3193159Z %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked> 2026-02-21T09:48:41.3193345Z %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3193532Z %343 = arith.andi %342, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3193697Z %344 = tt.load %338, %343, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3193957Z %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3194247Z %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3194490Z %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3194736Z %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3195031Z %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3195377Z %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3195668Z %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3195915Z %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3196164Z %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3196407Z %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3196660Z %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3196889Z %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3197141Z %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3197487Z %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3197953Z %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3198304Z scf.yield %359 : tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3198438Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:41.3198580Z %57 = arith.addi %46, %29 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3198778Z %58 = tt.addptr %6, %57 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3198979Z %59 = tt.load %58 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3199197Z %60 = ttg.local_alloc %59 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3199520Z %61 = ttg.local_load %60 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3199920Z %62 = arith.extf %61 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3200216Z %63 = arith.addi %33, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3200424Z %64 = tt.addptr %7, %63 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3200618Z %65 = arith.andi %37, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3200779Z %66 = tt.load %64, %65, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3201050Z %67 = ttg.convert_layout %66 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3201330Z %68 = arith.shli %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3201562Z %69 = arith.shrsi %68, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3201799Z %70 = arith.shrsi %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3202091Z %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3202426Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3202760Z %73 = tt.broadcast %71 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3203003Z %74 = arith.select %16, %73, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3203244Z %75 = tt.broadcast %72 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3203477Z %76 = arith.select %18, %75, %74 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3203702Z %77 = tt.reshape %76 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3203921Z %78 = arith.sitofp %77 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3204166Z %79 = ttg.local_alloc %78 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3204511Z %80 = ttg.local_load %79 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3204971Z %81 = tt.dot %62, %80, %56, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3205366Z %82 = arith.truncf %81 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma> 2026-02-21T09:48:41.3205538Z %83 = arith.extsi %40 : i32 to i64 2026-02-21T09:48:41.3205698Z %84 = tt.splat %83 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:41.3205905Z %85 = arith.addi %84, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:41.3206166Z %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3206399Z %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3206576Z %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3206777Z %89 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:41.3206980Z %90 = arith.addi %89, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:41.3207238Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3207493Z %92 = tt.broadcast %91 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3207670Z %93 = arith.addi %88, %92 : tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3207856Z %94 = tt.addptr %19, %93 : tensor<128x512x!tt.ptr, #mma>, tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3208054Z %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3240815Z %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3241095Z %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma> 2026-02-21T09:48:41.3241267Z %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:48:41.3241526Z %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3241691Z %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3241847Z %101 = arith.andi %99, %100 : tensor<1x512xi1, #mma> 2026-02-21T09:48:41.3242019Z %102 = tt.broadcast %101 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:48:41.3242199Z %103 = arith.andi %98, %102 : tensor<128x512xi1, #mma> 2026-02-21T09:48:41.3242357Z tt.store %94, %82, %103 : tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:48:41.3242507Z %104 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:48:41.3242674Z %105 = arith.remsi %104, %c128_i32 : i32 2026-02-21T09:48:41.3242797Z %106 = arith.divsi %104, %c128_i32 : i32 2026-02-21T09:48:41.3242918Z %107 = arith.muli %105, %c128_i32 : i32 2026-02-21T09:48:41.3243091Z %108 = tt.splat %107 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:48:41.3243323Z %109 = arith.addi %108, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:48:41.3243499Z %110 = arith.muli %106, %c512_i32 : i32 2026-02-21T09:48:41.3243728Z %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:48:41.3243984Z %112 = arith.muli %111, %cst_12 : tensor<128x1xi32, #blocked2> 2026-02-21T09:48:41.3244184Z %113 = tt.broadcast %112 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3244363Z %114 = arith.extsi %110 : i32 to i64 2026-02-21T09:48:41.3244530Z %115 = tt.splat %114 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:48:41.3244775Z %116 = arith.addi %115, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:48:41.3245052Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3245334Z %118 = tt.broadcast %117 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3245559Z %119 = arith.cmpi sge, %117, %cst_10 : tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3245731Z %120 = arith.cmpi slt, %117, %cst_11 : tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3245899Z %121 = arith.andi %119, %120 : tensor<1x512xi1, #blocked> 2026-02-21T09:48:41.3246084Z %122 = tt.broadcast %121 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3246350Z %123 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_9) -> (tensor<128x512xf32, #mma>) : i32 { 2026-02-21T09:48:41.3246570Z %238 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:41.3246743Z %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3246968Z %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3247244Z %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:48:41.3247524Z %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3247720Z %243 = arith.addi %113, %242 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3247924Z %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3248138Z %245 = tt.load %244 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3248362Z %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3248723Z %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3249152Z %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3249433Z %249 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:41.3249606Z %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3249822Z %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3250094Z %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3250337Z %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3250527Z %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3250719Z %255 = arith.addi %254, %118 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3250915Z %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3251122Z %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3251290Z %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3251454Z %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked> 2026-02-21T09:48:41.3251638Z %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3251825Z %261 = arith.andi %260, %122 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3251995Z %262 = tt.load %256, %261, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3252254Z %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3252559Z %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3252805Z %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3253064Z %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3253362Z %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3253705Z %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3253997Z %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3254250Z %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3254497Z %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3254739Z %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3254972Z %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3255199Z %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3255457Z %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3255785Z %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3256287Z %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3256645Z %278 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:48:41.3256786Z %279 = arith.muli %278, %c2_i32 : i32 2026-02-21T09:48:41.3256961Z %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3257185Z %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3257464Z %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:48:41.3257746Z %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3257944Z %284 = arith.addi %113, %283 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3258152Z %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3258361Z %286 = tt.load %285 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3258589Z %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3258927Z %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3259334Z %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3259622Z %290 = arith.extsi %278 : i32 to i64 2026-02-21T09:48:41.3259794Z %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3260021Z %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3260316Z %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3260562Z %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3260759Z %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3260965Z %296 = arith.addi %295, %118 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3261168Z %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3261379Z %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3261550Z %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3261719Z %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked> 2026-02-21T09:48:41.3261903Z %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3262100Z %302 = arith.andi %301, %122 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3262269Z %303 = tt.load %297, %302, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3262537Z %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3262827Z %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3263073Z %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3263324Z %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3263624Z %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3263999Z %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3264296Z %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3264566Z %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3264818Z %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3265060Z %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3265302Z %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3265533Z %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3265788Z %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3266123Z %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3266603Z %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3266955Z %319 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:41.3267085Z %320 = arith.muli %319, %c2_i32 : i32 2026-02-21T09:48:41.3267258Z %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3267484Z %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3267762Z %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:48:41.3268038Z %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3268254Z %325 = arith.addi %113, %324 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3268458Z %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3268669Z %327 = tt.load %326 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3268912Z %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3269241Z %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3269653Z %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3269934Z %331 = arith.extsi %319 : i32 to i64 2026-02-21T09:48:41.3270112Z %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3270338Z %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3270616Z %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3270866Z %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3271062Z %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3271263Z %337 = arith.addi %336, %118 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3271467Z %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3271678Z %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3271855Z %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3272036Z %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked> 2026-02-21T09:48:41.3272225Z %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3272432Z %343 = arith.andi %342, %122 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3272606Z %344 = tt.load %338, %343, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3272874Z %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3273160Z %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3273407Z %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3273652Z %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3273956Z %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3274308Z %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3274603Z %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3274859Z %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3275112Z %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3275356Z %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3275600Z %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3275830Z %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3276119Z %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3276455Z %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3276941Z %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3277295Z scf.yield %359 : tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3277430Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:41.3277581Z %124 = arith.addi %113, %29 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3277789Z %125 = tt.addptr %6, %124 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3277998Z %126 = tt.load %125 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3278227Z %127 = ttg.local_alloc %126 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3278555Z %128 = ttg.local_load %127 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3278964Z %129 = arith.extf %128 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3279268Z %130 = arith.addi %33, %118 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3279466Z %131 = tt.addptr %7, %130 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3279670Z %132 = arith.andi %37, %122 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3279838Z %133 = tt.load %131, %132, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3280116Z %134 = ttg.convert_layout %133 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3280420Z %135 = arith.shli %134, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3280662Z %136 = arith.shrsi %135, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3280908Z %137 = arith.shrsi %134, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3281203Z %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3281550Z %139 = tt.expand_dims %137 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3281845Z %140 = tt.broadcast %138 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3282095Z %141 = arith.select %16, %140, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3282344Z %142 = tt.broadcast %139 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3282643Z %143 = arith.select %18, %142, %141 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3282887Z %144 = tt.reshape %143 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3283117Z %145 = arith.sitofp %144 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3283372Z %146 = ttg.local_alloc %145 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3283707Z %147 = ttg.local_load %146 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3284204Z %148 = tt.dot %129, %147, %123, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3284596Z %149 = arith.truncf %148 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma> 2026-02-21T09:48:41.3284798Z %150 = arith.extsi %107 : i32 to i64 2026-02-21T09:48:41.3284965Z %151 = tt.splat %150 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:41.3285186Z %152 = arith.addi %151, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:41.3285462Z %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3285703Z %154 = arith.muli %153, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3285890Z %155 = tt.broadcast %154 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3286105Z %156 = tt.splat %114 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:41.3286321Z %157 = arith.addi %156, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:41.3286591Z %158 = tt.expand_dims %157 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3286864Z %159 = tt.broadcast %158 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3287056Z %160 = arith.addi %155, %159 : tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3287248Z %161 = tt.addptr %19, %160 : tensor<128x512x!tt.ptr, #mma>, tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3287456Z %162 = arith.cmpi sge, %153, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3287622Z %163 = arith.cmpi slt, %153, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3287783Z %164 = arith.andi %162, %163 : tensor<128x1xi1, #mma> 2026-02-21T09:48:41.3287983Z %165 = tt.broadcast %164 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:48:41.3288171Z %166 = arith.cmpi sge, %158, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3288361Z %167 = arith.cmpi slt, %158, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3288519Z %168 = arith.andi %166, %167 : tensor<1x512xi1, #mma> 2026-02-21T09:48:41.3288699Z %169 = tt.broadcast %168 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:48:41.3288880Z %170 = arith.andi %165, %169 : tensor<128x512xi1, #mma> 2026-02-21T09:48:41.3289051Z tt.store %161, %149, %170 : tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:48:41.3289208Z %171 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:48:41.3289334Z %172 = arith.remsi %171, %c128_i32 : i32 2026-02-21T09:48:41.3289464Z %173 = arith.divsi %171, %c128_i32 : i32 2026-02-21T09:48:41.3289587Z %174 = arith.muli %172, %c128_i32 : i32 2026-02-21T09:48:41.3289765Z %175 = tt.splat %174 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:48:41.3289993Z %176 = arith.addi %175, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:48:41.3290176Z %177 = arith.muli %173, %c512_i32 : i32 2026-02-21T09:48:41.3290406Z %178 = tt.expand_dims %176 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:48:41.3290663Z %179 = arith.muli %178, %cst_12 : tensor<128x1xi32, #blocked2> 2026-02-21T09:48:41.3290865Z %180 = tt.broadcast %179 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3291042Z %181 = arith.extsi %177 : i32 to i64 2026-02-21T09:48:41.3291213Z %182 = tt.splat %181 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:48:41.3291439Z %183 = arith.addi %182, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:48:41.3291744Z %184 = tt.expand_dims %183 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3292028Z %185 = tt.broadcast %184 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3292234Z %186 = arith.cmpi sge, %184, %cst_10 : tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3292443Z %187 = arith.cmpi slt, %184, %cst_11 : tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3292613Z %188 = arith.andi %186, %187 : tensor<1x512xi1, #blocked> 2026-02-21T09:48:41.3292822Z %189 = tt.broadcast %188 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3293092Z %190 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_9) -> (tensor<128x512xf32, #mma>) : i32 { 2026-02-21T09:48:41.3293310Z %238 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:41.3293483Z %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3293704Z %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3293980Z %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:48:41.3294259Z %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3294454Z %243 = arith.addi %180, %242 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3294658Z %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3294864Z %245 = tt.load %244 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3295087Z %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3295433Z %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3295837Z %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3296141Z %249 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:41.3296312Z %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3296527Z %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3296798Z %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3297041Z %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3297231Z %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3297423Z %255 = arith.addi %254, %185 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3297618Z %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3297824Z %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3297996Z %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3298160Z %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked> 2026-02-21T09:48:41.3298343Z %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3298532Z %261 = arith.andi %260, %189 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3298701Z %262 = tt.load %256, %261, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3298957Z %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3299244Z %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3299500Z %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3299745Z %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3300040Z %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3300399Z %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3300689Z %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3300935Z %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3301184Z %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3301426Z %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3301660Z %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3301887Z %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3302139Z %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3302466Z %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3302942Z %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3303302Z %278 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:48:41.3303428Z %279 = arith.muli %278, %c2_i32 : i32 2026-02-21T09:48:41.3303598Z %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3303836Z %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3304113Z %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:48:41.3304388Z %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3304584Z %284 = arith.addi %180, %283 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3304786Z %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3304992Z %286 = tt.load %285 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3305217Z %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3305542Z %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3305950Z %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3306232Z %290 = arith.extsi %278 : i32 to i64 2026-02-21T09:48:41.3306399Z %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3306617Z %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3306885Z %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3307132Z %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3307443Z %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3307635Z %296 = arith.addi %295, %185 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3307832Z %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3308051Z %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3308222Z %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3308382Z %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked> 2026-02-21T09:48:41.3308571Z %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3308760Z %302 = arith.andi %301, %189 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3308928Z %303 = tt.load %297, %302, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3309193Z %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3309481Z %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3309725Z %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3309970Z %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3310265Z %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3310611Z %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3310899Z %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3311168Z %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3311415Z %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3311673Z %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3311912Z %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3312136Z %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3312392Z %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3312725Z %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3313198Z %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3313550Z %319 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:41.3313676Z %320 = arith.muli %319, %c2_i32 : i32 2026-02-21T09:48:41.3313844Z %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3314068Z %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3314341Z %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:48:41.3314616Z %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3314813Z %325 = arith.addi %180, %324 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3315032Z %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3315239Z %327 = tt.load %326 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3315460Z %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3315801Z %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3316206Z %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3316484Z %331 = arith.extsi %319 : i32 to i64 2026-02-21T09:48:41.3316652Z %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3316868Z %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3317140Z %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3317387Z %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3317578Z %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3317771Z %337 = arith.addi %336, %185 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3317963Z %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3318170Z %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3318341Z %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3318500Z %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked> 2026-02-21T09:48:41.3318699Z %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3318888Z %343 = arith.andi %342, %189 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3319056Z %344 = tt.load %338, %343, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3319334Z %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3319620Z %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3319860Z %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3320102Z %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3320400Z %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3320744Z %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3321031Z %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3321284Z %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3321529Z %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3321769Z %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3322005Z %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3322227Z %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3322485Z %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3322896Z %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3323374Z %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3323741Z scf.yield %359 : tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3323873Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:41.3324017Z %191 = arith.addi %180, %29 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3324216Z %192 = tt.addptr %6, %191 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3324425Z %193 = tt.load %192 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3324648Z %194 = ttg.local_alloc %193 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3324974Z %195 = ttg.local_load %194 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3325377Z %196 = arith.extf %195 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3325675Z %197 = arith.addi %33, %185 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3325871Z %198 = tt.addptr %7, %197 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3326067Z %199 = arith.andi %37, %189 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3326232Z %200 = tt.load %198, %199, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3326491Z %201 = ttg.convert_layout %200 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3326791Z %202 = arith.shli %201, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3327033Z %203 = arith.shrsi %202, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3327293Z %204 = arith.shrsi %201, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3327588Z %205 = tt.expand_dims %203 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3327929Z %206 = tt.expand_dims %204 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3328215Z %207 = tt.broadcast %205 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3328464Z %208 = arith.select %16, %207, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3328711Z %209 = tt.broadcast %206 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3328949Z %210 = arith.select %18, %209, %208 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3329188Z %211 = tt.reshape %210 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3329414Z %212 = arith.sitofp %211 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3329670Z %213 = ttg.local_alloc %212 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3330000Z %214 = ttg.local_load %213 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3330468Z %215 = tt.dot %196, %214, %190, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3330873Z %216 = arith.truncf %215 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma> 2026-02-21T09:48:41.3331052Z %217 = arith.extsi %174 : i32 to i64 2026-02-21T09:48:41.3331215Z %218 = tt.splat %217 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:41.3331442Z %219 = arith.addi %218, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:41.3331709Z %220 = tt.expand_dims %219 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3331949Z %221 = arith.muli %220, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3332132Z %222 = tt.broadcast %221 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3332340Z %223 = tt.splat %181 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:41.3332550Z %224 = arith.addi %223, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:41.3332815Z %225 = tt.expand_dims %224 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3333080Z %226 = tt.broadcast %225 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3333262Z %227 = arith.addi %222, %226 : tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3333454Z %228 = tt.addptr %19, %227 : tensor<128x512x!tt.ptr, #mma>, tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3333658Z %229 = arith.cmpi sge, %220, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3333819Z %230 = arith.cmpi slt, %220, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3333976Z %231 = arith.andi %229, %230 : tensor<128x1xi1, #mma> 2026-02-21T09:48:41.3334151Z %232 = tt.broadcast %231 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:48:41.3334354Z %233 = arith.cmpi sge, %225, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3334518Z %234 = arith.cmpi slt, %225, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3334673Z %235 = arith.andi %233, %234 : tensor<1x512xi1, #mma> 2026-02-21T09:48:41.3334861Z %236 = tt.broadcast %235 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:48:41.3335040Z %237 = arith.andi %232, %236 : tensor<128x512xi1, #mma> 2026-02-21T09:48:41.3335202Z tt.store %228, %216, %237 : tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:48:41.3335346Z } {tt.num_stages = 1 : i32} 2026-02-21T09:48:41.3335467Z scf.for %arg3 = %26 to %2 step %c1_i32 : i32 { 2026-02-21T09:48:41.3335603Z %38 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:48:41.3335727Z %39 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:48:41.3335849Z %40 = arith.muli %38, %c128_i32 : i32 2026-02-21T09:48:41.3336015Z %41 = tt.splat %40 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:48:41.3336238Z %42 = arith.addi %41, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:48:41.3336409Z %43 = arith.muli %39, %c512_i32 : i32 2026-02-21T09:48:41.3336637Z %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:48:41.3336890Z %45 = arith.muli %44, %cst_12 : tensor<128x1xi32, #blocked2> 2026-02-21T09:48:41.3337079Z %46 = tt.broadcast %45 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3337254Z %47 = arith.extsi %43 : i32 to i64 2026-02-21T09:48:41.3337415Z %48 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:48:41.3337629Z %49 = arith.addi %48, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:48:41.3337897Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3338187Z %51 = tt.broadcast %50 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3338384Z %52 = arith.cmpi sge, %50, %cst_10 : tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3338557Z %53 = arith.cmpi slt, %50, %cst_11 : tensor<1x512xi64, #blocked> 2026-02-21T09:48:41.3338733Z %54 = arith.andi %52, %53 : tensor<1x512xi1, #blocked> 2026-02-21T09:48:41.3338910Z %55 = tt.broadcast %54 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3339176Z %56 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_9) -> (tensor<128x512xf32, #mma>) : i32 { 2026-02-21T09:48:41.3339391Z %104 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:41.3339562Z %105 = tt.splat %104 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3339785Z %106 = arith.addi %105, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3340061Z %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:48:41.3340345Z %108 = tt.broadcast %107 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3340541Z %109 = arith.addi %46, %108 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3340744Z %110 = tt.addptr %6, %109 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3340953Z %111 = tt.load %110 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3341175Z %112 = ttg.local_alloc %111 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3341507Z %113 = ttg.local_load %112 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3341930Z %114 = arith.extf %113 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3350044Z %115 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:41.3350309Z %116 = tt.splat %115 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3350540Z %117 = arith.addi %116, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3350828Z %118 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3351077Z %119 = arith.muli %118, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3351271Z %120 = tt.broadcast %119 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3351467Z %121 = arith.addi %120, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3351668Z %122 = tt.addptr %7, %121 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3351884Z %123 = arith.cmpi sge, %118, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3352053Z %124 = arith.cmpi slt, %118, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3352222Z %125 = arith.andi %123, %124 : tensor<2x1xi1, #blocked> 2026-02-21T09:48:41.3352408Z %126 = tt.broadcast %125 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3352596Z %127 = arith.andi %126, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3352766Z %128 = tt.load %122, %127, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3353027Z %129 = ttg.convert_layout %128 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3353314Z %130 = arith.shli %129, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3353558Z %131 = arith.shrsi %130, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3353827Z %132 = arith.shrsi %129, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3354130Z %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3354488Z %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3354781Z %135 = tt.broadcast %133 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3355031Z %136 = arith.select %16, %135, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3355282Z %137 = tt.broadcast %134 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3355527Z %138 = arith.select %18, %137, %136 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3355762Z %139 = tt.reshape %138 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3355994Z %140 = arith.sitofp %139 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3356250Z %141 = ttg.local_alloc %140 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3356581Z %142 = ttg.local_load %141 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3357063Z %143 = tt.dot %114, %142, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3357413Z %144 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:48:41.3357540Z %145 = arith.muli %144, %c2_i32 : i32 2026-02-21T09:48:41.3357730Z %146 = tt.splat %145 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3357952Z %147 = arith.addi %146, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3358247Z %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:48:41.3358523Z %149 = tt.broadcast %148 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3358721Z %150 = arith.addi %46, %149 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3358926Z %151 = tt.addptr %6, %150 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3359133Z %152 = tt.load %151 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3359361Z %153 = ttg.local_alloc %152 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3359692Z %154 = ttg.local_load %153 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3360099Z %155 = arith.extf %154 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3360382Z %156 = arith.extsi %144 : i32 to i64 2026-02-21T09:48:41.3360549Z %157 = tt.splat %156 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3360768Z %158 = arith.addi %157, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3361041Z %159 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3361289Z %160 = arith.muli %159, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3361482Z %161 = tt.broadcast %160 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3361690Z %162 = arith.addi %161, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3361888Z %163 = tt.addptr %7, %162 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3362094Z %164 = arith.cmpi sge, %159, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3362282Z %165 = arith.cmpi slt, %159, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3362446Z %166 = arith.andi %164, %165 : tensor<2x1xi1, #blocked> 2026-02-21T09:48:41.3362677Z %167 = tt.broadcast %166 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3362867Z %168 = arith.andi %167, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3363035Z %169 = tt.load %163, %168, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3363299Z %170 = ttg.convert_layout %169 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3363585Z %171 = arith.shli %170, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3363832Z %172 = arith.shrsi %171, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3364077Z %173 = arith.shrsi %170, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3364377Z %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3364726Z %175 = tt.expand_dims %173 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3365021Z %176 = tt.broadcast %174 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3365291Z %177 = arith.select %16, %176, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3365541Z %178 = tt.broadcast %175 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3365781Z %179 = arith.select %18, %178, %177 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3366034Z %180 = tt.reshape %179 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3366264Z %181 = arith.sitofp %180 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3366517Z %182 = ttg.local_alloc %181 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3366851Z %183 = ttg.local_load %182 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3367326Z %184 = tt.dot %155, %183, %143, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3367675Z %185 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:41.3367805Z %186 = arith.muli %185, %c2_i32 : i32 2026-02-21T09:48:41.3367975Z %187 = tt.splat %186 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3368201Z %188 = arith.addi %187, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:48:41.3368475Z %189 = tt.expand_dims %188 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:48:41.3368754Z %190 = tt.broadcast %189 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3368952Z %191 = arith.addi %46, %190 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3369154Z %192 = tt.addptr %6, %191 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3369388Z %193 = tt.load %192 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3369618Z %194 = ttg.local_alloc %193 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3369961Z %195 = ttg.local_load %194 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3370386Z %196 = arith.extf %195 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3370667Z %197 = arith.extsi %185 : i32 to i64 2026-02-21T09:48:41.3370838Z %198 = tt.splat %197 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3371056Z %199 = arith.addi %198, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:41.3371332Z %200 = tt.expand_dims %199 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3371576Z %201 = arith.muli %200, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3371766Z %202 = tt.broadcast %201 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3371957Z %203 = arith.addi %202, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3372152Z %204 = tt.addptr %7, %203 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3372361Z %205 = arith.cmpi sge, %200, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3372530Z %206 = arith.cmpi slt, %200, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:48:41.3372691Z %207 = arith.andi %205, %206 : tensor<2x1xi1, #blocked> 2026-02-21T09:48:41.3372876Z %208 = tt.broadcast %207 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3373061Z %209 = arith.andi %208, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3373247Z %210 = tt.load %204, %209, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3373506Z %211 = ttg.convert_layout %210 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3373815Z %212 = arith.shli %211, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3374058Z %213 = arith.shrsi %212, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3374302Z %214 = arith.shrsi %211, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3374600Z %215 = tt.expand_dims %213 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3374944Z %216 = tt.expand_dims %214 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3375235Z %217 = tt.broadcast %215 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3375485Z %218 = arith.select %16, %217, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3375734Z %219 = tt.broadcast %216 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3375977Z %220 = arith.select %18, %219, %218 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3376214Z %221 = tt.reshape %220 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3376439Z %222 = arith.sitofp %221 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3376695Z %223 = ttg.local_alloc %222 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3377042Z %224 = ttg.local_load %223 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3377513Z %225 = tt.dot %196, %224, %184, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3377892Z scf.yield %225 : tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3378027Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:41.3378170Z %57 = arith.addi %46, %29 : tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3378365Z %58 = tt.addptr %6, %57 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:48:41.3378565Z %59 = tt.load %58 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:48:41.3378783Z %60 = ttg.local_alloc %59 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:48:41.3379108Z %61 = ttg.local_load %60 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3379509Z %62 = arith.extf %61 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3379804Z %63 = arith.addi %33, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3379994Z %64 = tt.addptr %7, %63 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:48:41.3380187Z %65 = arith.andi %37, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:48:41.3380347Z %66 = tt.load %64, %65, %cst_15 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:48:41.3380601Z %67 = ttg.convert_layout %66 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3380878Z %68 = arith.shli %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3381131Z %69 = arith.shrsi %68, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3381369Z %70 = arith.shrsi %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:41.3381675Z %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3382013Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:48:41.3382294Z %73 = tt.broadcast %71 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3382537Z %74 = arith.select %16, %73, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3382775Z %75 = tt.broadcast %72 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3383008Z %76 = arith.select %18, %75, %74 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:48:41.3383235Z %77 = tt.reshape %76 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3> 2026-02-21T09:48:41.3383456Z %78 = arith.sitofp %77 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3> 2026-02-21T09:48:41.3383707Z %79 = ttg.local_alloc %78 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:48:41.3384027Z %80 = ttg.local_load %79 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:41.3384487Z %81 = tt.dot %62, %80, %56, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:48:41.3384867Z %82 = arith.truncf %81 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma> 2026-02-21T09:48:41.3385040Z %83 = arith.extsi %40 : i32 to i64 2026-02-21T09:48:41.3385217Z %84 = tt.splat %83 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:41.3385428Z %85 = arith.addi %84, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:41.3385691Z %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3385943Z %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3386118Z %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3386321Z %89 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:41.3386524Z %90 = arith.addi %89, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:41.3386780Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3387040Z %92 = tt.broadcast %91 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3387216Z %93 = arith.addi %88, %92 : tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3387409Z %94 = tt.addptr %19, %93 : tensor<128x512x!tt.ptr, #mma>, tensor<128x512xi64, #mma> 2026-02-21T09:48:41.3387608Z %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3387768Z %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:48:41.3387920Z %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma> 2026-02-21T09:48:41.3388087Z %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:48:41.3388268Z %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3388432Z %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T09:48:41.3388589Z %101 = arith.andi %99, %100 : tensor<1x512xi1, #mma> 2026-02-21T09:48:41.3388782Z %102 = tt.broadcast %101 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:48:41.3388960Z %103 = arith.andi %98, %102 : tensor<128x512xi1, #mma> 2026-02-21T09:48:41.3389137Z tt.store %94, %82, %103 : tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:48:41.3389280Z } {tt.num_stages = 1 : i32} 2026-02-21T09:48:41.3389385Z tt.return 2026-02-21T09:48:41.3389463Z } 2026-02-21T09:48:41.3389539Z } 2026-02-21T09:48:41.3389583Z 2026-02-21T09:48:41.3389615Z {-# 2026-02-21T09:48:41.3389694Z external_resources: { 2026-02-21T09:48:41.3389792Z mlir_reproducer: { 2026-02-21T09:48:41.3390793Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:48:41.3391786Z disable_threading: false, 2026-02-21T09:48:41.3391893Z verify_each: true 2026-02-21T09:48:41.3391982Z } 2026-02-21T09:48:41.3392055Z } 2026-02-21T09:48:41.3392123Z #-} 2026-02-21T09:48:41.3392399Z /tmp/torchinductor_root/rq/crqlyd4egjayq53eu2nago7umu7wvbg3q2ja7qxq3ncocq5xs6tb.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:48:41.3393113Z /tmp/torchinductor_root/rq/crqlyd4egjayq53eu2nago7umu7wvbg3q2ja7qxq3ncocq5xs6tb.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:48:41.3393666Z [252s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:48:41.3394464Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 512], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:48:41.3395186Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:48:41.3395353Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:48:42.3209861Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:48:42.3224933Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:48:42.3225534Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:48:42.3226077Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:48:42.3226571Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:48:42.3227018Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:48:42.3227463Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:48:42.3227784Z #smem = #ttg.shared_memory 2026-02-21T09:48:42.3228196Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:48:42.3229270Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:48:42.3230089Z %cst = arith.constant dense<0.000000e+00> : tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3230376Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:48:42.3230589Z %c608_i32 = arith.constant 608 : i32 2026-02-21T09:48:42.3230793Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:48:42.3230994Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:48:42.3231200Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:48:42.3231466Z %cst_0 = arith.constant dense<0> : tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3231730Z %c1824_i32 = arith.constant 1824 : i32 2026-02-21T09:48:42.3231933Z %c1216_i32 = arith.constant 1216 : i32 2026-02-21T09:48:42.3232132Z %c6_i32 = arith.constant 6 : i32 2026-02-21T09:48:42.3232331Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:48:42.3232532Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:48:42.3232733Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T09:48:42.3232948Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:48:42.3233145Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:48:42.3233339Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:48:42.3233542Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:48:42.3233883Z %cst_1 = arith.constant dense<0> : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3234343Z %cst_2 = arith.constant dense<8192> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3234795Z %cst_3 = arith.constant dense<0> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3235126Z %c16991_i32 = arith.constant 16991 : i32 2026-02-21T09:48:42.3235397Z %cst_4 = arith.constant dense<1024> : tensor<256x1xi32, #blocked1> 2026-02-21T09:48:42.3235886Z %cst_5 = arith.constant dense<4> : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3236267Z %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:42.3236570Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:42.3236941Z %cst_8 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3237414Z %cst_9 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3237736Z %cst_10 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3238023Z %cst_11 = arith.constant dense<0> : tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3238245Z %cst_12 = arith.constant dense<8192> : tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3238468Z %cst_13 = arith.constant dense<8192> : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3238696Z %cst_14 = arith.constant dense<0> : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3238916Z %cst_15 = arith.constant dense<16384> : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3239116Z %0 = tt.get_program_id x : i32 2026-02-21T09:48:42.3239375Z %1 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:42.3239751Z %2 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:42.3240105Z %3 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3240434Z %4 = tt.splat %arg0 : !tt.ptr -> tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3240751Z %5 = tt.splat %arg1 : !tt.ptr -> tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3241171Z %6 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3241738Z %7 = arith.extsi %6 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3242331Z %8 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3242943Z %9 = arith.extsi %8 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3243506Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:48:42.3244050Z %11 = tt.expand_dims %10 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:48:42.3244582Z %12 = tt.expand_dims %11 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:42.3244913Z %13 = arith.cmpi eq, %12, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:42.3245176Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x32xi1, #blocked> 2026-02-21T09:48:42.3245433Z %15 = arith.cmpi eq, %12, %cst_7 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:42.3245679Z %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x32xi1, #blocked> 2026-02-21T09:48:42.3245960Z %17 = tt.splat %arg2 : !tt.ptr -> tensor<256x32x!tt.ptr, #mma> 2026-02-21T09:48:42.3246321Z %18 = arith.extsi %2 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:42.3246723Z %19 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:42.3247151Z %20 = arith.extsi %19 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:42.3247396Z %21 = arith.subi %c16991_i32, %0 : i32 2026-02-21T09:48:42.3247521Z %22 = arith.divui %21, %c608_i32 : i32 2026-02-21T09:48:42.3247663Z %23 = arith.remsi %22, %c4_i32 : i32 2026-02-21T09:48:42.3247784Z %24 = arith.subi %22, %23 : i32 2026-02-21T09:48:42.3247908Z %25 = arith.muli %24, %c608_i32 : i32 2026-02-21T09:48:42.3248029Z %26 = arith.addi %0, %25 : i32 2026-02-21T09:48:42.3248164Z scf.for %arg3 = %0 to %26 step %c2432_i32 : i32 { 2026-02-21T09:48:42.3248314Z %27 = arith.divsi %arg3, %c1024_i32 : i32 2026-02-21T09:48:42.3248440Z %28 = arith.muli %27, %c4_i32 : i32 2026-02-21T09:48:42.3248562Z %29 = arith.subi %c64_i32, %28 : i32 2026-02-21T09:48:42.3248683Z %30 = arith.minsi %29, %c4_i32 : i32 2026-02-21T09:48:42.3248809Z %31 = arith.remsi %arg3, %c1024_i32 : i32 2026-02-21T09:48:42.3248934Z %32 = arith.remsi %31, %30 : i32 2026-02-21T09:48:42.3249053Z %33 = arith.addi %28, %32 : i32 2026-02-21T09:48:42.3249174Z %34 = arith.divsi %31, %30 : i32 2026-02-21T09:48:42.3249294Z %35 = arith.muli %33, %c256_i32 : i32 2026-02-21T09:48:42.3249477Z %36 = tt.splat %35 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:42.3249715Z %37 = arith.addi %36, %1 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:42.3249901Z %38 = arith.muli %34, %c32_i32 : i32 2026-02-21T09:48:42.3250135Z %39 = tt.expand_dims %37 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1> 2026-02-21T09:48:42.3250405Z %40 = arith.muli %39, %cst_4 : tensor<256x1xi32, #blocked1> 2026-02-21T09:48:42.3250611Z %41 = tt.broadcast %40 : tensor<256x1xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3250813Z %42 = arith.extsi %38 : i32 to i64 2026-02-21T09:48:42.3251031Z %43 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3251363Z %44 = arith.addi %43, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3251773Z %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3252224Z %46 = tt.broadcast %45 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3252548Z %47 = arith.cmpi sge, %45, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3252803Z %48 = arith.cmpi slt, %45, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3253050Z %49 = arith.andi %47, %48 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3253358Z %50 = tt.broadcast %49 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3253721Z %51 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg5 = %cst) -> (tensor<256x32xf32, #mma>) : i32 { 2026-02-21T09:48:42.3253948Z %218 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:42.3254132Z %219 = tt.splat %218 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3254372Z %220 = arith.addi %219, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3254669Z %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3254966Z %222 = tt.broadcast %221 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3255191Z %223 = arith.addi %41, %222 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3255410Z %224 = tt.addptr %4, %223 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3255632Z %225 = tt.load %224 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3255890Z %226 = ttg.local_alloc %225 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3256245Z %227 = ttg.local_load %226 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3256684Z %228 = arith.extf %227 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3256992Z %229 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:42.3257208Z %230 = tt.splat %229 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3257501Z %231 = arith.addi %230, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3257888Z %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3258243Z %233 = arith.muli %232, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3258550Z %234 = tt.broadcast %233 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3258854Z %235 = arith.addi %234, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3259175Z %236 = tt.addptr %5, %235 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3259490Z %237 = arith.cmpi sge, %232, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3259754Z %238 = arith.cmpi slt, %232, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3259991Z %239 = arith.andi %237, %238 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3260294Z %240 = tt.broadcast %239 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3260594Z %241 = arith.andi %240, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3260840Z %242 = tt.load %236, %241, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3261088Z %243 = arith.shli %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3261325Z %244 = arith.shrsi %243, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3261566Z %245 = arith.shrsi %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3261859Z %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3262225Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3262511Z %248 = tt.broadcast %246 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3262750Z %249 = arith.select %14, %248, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3262991Z %250 = tt.broadcast %247 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3263260Z %251 = arith.select %16, %250, %249 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3263491Z %252 = tt.reshape %251 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3263716Z %253 = arith.sitofp %252 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3263981Z %254 = ttg.local_alloc %253 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3264304Z %255 = ttg.local_load %254 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3264779Z %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3265131Z %257 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:48:42.3265256Z %258 = arith.muli %257, %c2_i32 : i32 2026-02-21T09:48:42.3265426Z %259 = tt.splat %258 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3265651Z %260 = arith.addi %259, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3265927Z %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3266202Z %262 = tt.broadcast %261 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3266397Z %263 = arith.addi %41, %262 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3266597Z %264 = tt.addptr %4, %263 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3266808Z %265 = tt.load %264 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3267049Z %266 = ttg.local_alloc %265 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3267380Z %267 = ttg.local_load %266 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3267805Z %268 = arith.extf %267 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3268085Z %269 = arith.extsi %257 : i32 to i64 2026-02-21T09:48:42.3268296Z %270 = tt.splat %269 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3268595Z %271 = arith.addi %270, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3268987Z %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3269339Z %273 = arith.muli %272, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3269650Z %274 = tt.broadcast %273 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3269953Z %275 = arith.addi %274, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3270260Z %276 = tt.addptr %5, %275 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3270571Z %277 = arith.cmpi sge, %272, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3270827Z %278 = arith.cmpi slt, %272, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3271084Z %279 = arith.andi %277, %278 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3271380Z %280 = tt.broadcast %279 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3271679Z %281 = arith.andi %280, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3271941Z %282 = tt.load %276, %281, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3272186Z %283 = arith.shli %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3272421Z %284 = arith.shrsi %283, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3272654Z %285 = arith.shrsi %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3272941Z %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3273275Z %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3273555Z %288 = tt.broadcast %286 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3273793Z %289 = arith.select %14, %288, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3274025Z %290 = tt.broadcast %287 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3274254Z %291 = arith.select %16, %290, %289 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3274484Z %292 = tt.reshape %291 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3274703Z %293 = arith.sitofp %292 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3274971Z %294 = ttg.local_alloc %293 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3275292Z %295 = ttg.local_load %294 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3275777Z %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3276124Z %297 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:42.3276246Z %298 = arith.muli %297, %c2_i32 : i32 2026-02-21T09:48:42.3276418Z %299 = tt.splat %298 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3276639Z %300 = arith.addi %299, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3276918Z %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3277196Z %302 = tt.broadcast %301 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3277392Z %303 = arith.addi %41, %302 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3277594Z %304 = tt.addptr %4, %303 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3277800Z %305 = tt.load %304 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3278027Z %306 = ttg.local_alloc %305 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3278357Z %307 = ttg.local_load %306 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3278764Z %308 = arith.extf %307 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3279065Z %309 = arith.extsi %297 : i32 to i64 2026-02-21T09:48:42.3279271Z %310 = tt.splat %309 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3279570Z %311 = arith.addi %310, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3279971Z %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3280323Z %313 = arith.muli %312, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3280629Z %314 = tt.broadcast %313 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3280935Z %315 = arith.addi %314, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3281243Z %316 = tt.addptr %5, %315 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3281558Z %317 = arith.cmpi sge, %312, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3281802Z %318 = arith.cmpi slt, %312, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3282042Z %319 = arith.andi %317, %318 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3282344Z %320 = tt.broadcast %319 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3282703Z %321 = arith.andi %320, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3282968Z %322 = tt.load %316, %321, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3283218Z %323 = arith.shli %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3283482Z %324 = arith.shrsi %323, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3283725Z %325 = arith.shrsi %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3284013Z %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3284353Z %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3284642Z %328 = tt.broadcast %326 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3284880Z %329 = arith.select %14, %328, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3285123Z %330 = tt.broadcast %327 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3285357Z %331 = arith.select %16, %330, %329 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3285593Z %332 = tt.reshape %331 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3285821Z %333 = arith.sitofp %332 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3286075Z %334 = ttg.local_alloc %333 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3286402Z %335 = ttg.local_load %334 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3286890Z %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3287241Z %337 = arith.addi %arg4, %c6_i32 : i32 2026-02-21T09:48:42.3287374Z %338 = arith.muli %337, %c2_i32 : i32 2026-02-21T09:48:42.3287550Z %339 = tt.splat %338 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3287798Z %340 = arith.addi %339, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3288076Z %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3288361Z %342 = tt.broadcast %341 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3288563Z %343 = arith.addi %41, %342 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3288767Z %344 = tt.addptr %4, %343 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3288982Z %345 = tt.load %344 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3289210Z %346 = ttg.local_alloc %345 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3289546Z %347 = ttg.local_load %346 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3289961Z %348 = arith.extf %347 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3290245Z %349 = arith.extsi %337 : i32 to i64 2026-02-21T09:48:42.3290460Z %350 = tt.splat %349 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3290761Z %351 = arith.addi %350, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3291173Z %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3291556Z %353 = arith.muli %352, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3291864Z %354 = tt.broadcast %353 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3292170Z %355 = arith.addi %354, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3292482Z %356 = tt.addptr %5, %355 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3292798Z %357 = arith.cmpi sge, %352, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3293050Z %358 = arith.cmpi slt, %352, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3293287Z %359 = arith.andi %357, %358 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3293594Z %360 = tt.broadcast %359 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3293897Z %361 = arith.andi %360, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3294140Z %362 = tt.load %356, %361, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3294393Z %363 = arith.shli %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3294627Z %364 = arith.shrsi %363, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3294867Z %365 = arith.shrsi %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3308479Z %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3308833Z %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3309147Z %368 = tt.broadcast %366 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3309388Z %369 = arith.select %14, %368, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3309626Z %370 = tt.broadcast %367 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3309863Z %371 = arith.select %16, %370, %369 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3310096Z %372 = tt.reshape %371 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3310326Z %373 = arith.sitofp %372 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3310583Z %374 = ttg.local_alloc %373 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3310909Z %375 = ttg.local_load %374 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3311383Z %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3311740Z scf.yield %376 : tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3311879Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:42.3312053Z %52 = arith.truncf %51 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma> 2026-02-21T09:48:42.3312224Z %53 = arith.extsi %35 : i32 to i64 2026-02-21T09:48:42.3312409Z %54 = tt.splat %53 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:42.3312620Z %55 = arith.addi %54, %18 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:42.3312908Z %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3313152Z %57 = arith.muli %56, %cst_13 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3313331Z %58 = tt.broadcast %57 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3313540Z %59 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:42.3313744Z %60 = arith.addi %59, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:42.3314006Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3314262Z %62 = tt.broadcast %61 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3314444Z %63 = arith.addi %58, %62 : tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3314633Z %64 = tt.addptr %17, %63 : tensor<256x32x!tt.ptr, #mma>, tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3314831Z %65 = arith.cmpi sge, %56, %cst_14 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3315002Z %66 = arith.cmpi slt, %56, %cst_15 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3315157Z %67 = arith.andi %65, %66 : tensor<256x1xi1, #mma> 2026-02-21T09:48:42.3315329Z %68 = tt.broadcast %67 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3315515Z %69 = arith.cmpi sge, %61, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3315675Z %70 = arith.cmpi slt, %61, %cst_12 : tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3315833Z %71 = arith.andi %69, %70 : tensor<1x32xi1, #mma> 2026-02-21T09:48:42.3316001Z %72 = tt.broadcast %71 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3316194Z %73 = arith.andi %68, %72 : tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3316347Z tt.store %64, %52, %73 : tensor<256x32x!tt.ptr, #mma> 2026-02-21T09:48:42.3316500Z %74 = arith.addi %arg3, %c608_i32 : i32 2026-02-21T09:48:42.3316628Z %75 = arith.divsi %74, %c1024_i32 : i32 2026-02-21T09:48:42.3316762Z %76 = arith.muli %75, %c4_i32 : i32 2026-02-21T09:48:42.3316886Z %77 = arith.subi %c64_i32, %76 : i32 2026-02-21T09:48:42.3317004Z %78 = arith.minsi %77, %c4_i32 : i32 2026-02-21T09:48:42.3317126Z %79 = arith.remsi %74, %c1024_i32 : i32 2026-02-21T09:48:42.3317245Z %80 = arith.remsi %79, %78 : i32 2026-02-21T09:48:42.3317363Z %81 = arith.addi %76, %80 : i32 2026-02-21T09:48:42.3317476Z %82 = arith.divsi %79, %78 : i32 2026-02-21T09:48:42.3317597Z %83 = arith.muli %81, %c256_i32 : i32 2026-02-21T09:48:42.3317771Z %84 = tt.splat %83 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:42.3317995Z %85 = arith.addi %84, %1 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:42.3318171Z %86 = arith.muli %82, %c32_i32 : i32 2026-02-21T09:48:42.3318397Z %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1> 2026-02-21T09:48:42.3318655Z %88 = arith.muli %87, %cst_4 : tensor<256x1xi32, #blocked1> 2026-02-21T09:48:42.3318854Z %89 = tt.broadcast %88 : tensor<256x1xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3319029Z %90 = arith.extsi %86 : i32 to i64 2026-02-21T09:48:42.3319239Z %91 = tt.splat %90 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3319535Z %92 = arith.addi %91, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3319946Z %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3320391Z %94 = tt.broadcast %93 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3320701Z %95 = arith.cmpi sge, %93, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3320949Z %96 = arith.cmpi slt, %93, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3321179Z %97 = arith.andi %95, %96 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3321480Z %98 = tt.broadcast %97 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3321826Z %99 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg5 = %cst) -> (tensor<256x32xf32, #mma>) : i32 { 2026-02-21T09:48:42.3322041Z %218 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:42.3322222Z %219 = tt.splat %218 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3322452Z %220 = arith.addi %219, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3322768Z %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3323051Z %222 = tt.broadcast %221 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3323249Z %223 = arith.addi %89, %222 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3323460Z %224 = tt.addptr %4, %223 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3323670Z %225 = tt.load %224 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3323922Z %226 = ttg.local_alloc %225 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3324258Z %227 = ttg.local_load %226 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3324669Z %228 = arith.extf %227 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3324980Z %229 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:42.3325193Z %230 = tt.splat %229 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3325496Z %231 = arith.addi %230, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3325893Z %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3326252Z %233 = arith.muli %232, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3326564Z %234 = tt.broadcast %233 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3326874Z %235 = arith.addi %234, %94 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3327183Z %236 = tt.addptr %5, %235 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3327502Z %237 = arith.cmpi sge, %232, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3327752Z %238 = arith.cmpi slt, %232, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3328010Z %239 = arith.andi %237, %238 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3328317Z %240 = tt.broadcast %239 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3328635Z %241 = arith.andi %240, %98 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3328885Z %242 = tt.load %236, %241, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3329132Z %243 = arith.shli %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3329371Z %244 = arith.shrsi %243, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3329614Z %245 = arith.shrsi %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3329904Z %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3330244Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3330536Z %248 = tt.broadcast %246 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3330776Z %249 = arith.select %14, %248, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3331015Z %250 = tt.broadcast %247 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3331246Z %251 = arith.select %16, %250, %249 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3331482Z %252 = tt.reshape %251 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3331709Z %253 = arith.sitofp %252 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3331982Z %254 = ttg.local_alloc %253 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3332312Z %255 = ttg.local_load %254 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3332782Z %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3333149Z %257 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:48:42.3333274Z %258 = arith.muli %257, %c2_i32 : i32 2026-02-21T09:48:42.3333446Z %259 = tt.splat %258 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3333670Z %260 = arith.addi %259, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3333948Z %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3334230Z %262 = tt.broadcast %261 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3334428Z %263 = arith.addi %89, %262 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3334629Z %264 = tt.addptr %4, %263 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3334838Z %265 = tt.load %264 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3335060Z %266 = ttg.local_alloc %265 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3335392Z %267 = ttg.local_load %266 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3335828Z %268 = arith.extf %267 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3336109Z %269 = arith.extsi %257 : i32 to i64 2026-02-21T09:48:42.3336318Z %270 = tt.splat %269 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3336643Z %271 = arith.addi %270, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3337032Z %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3337388Z %273 = arith.muli %272, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3337697Z %274 = tt.broadcast %273 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3338003Z %275 = arith.addi %274, %94 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3338311Z %276 = tt.addptr %5, %275 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3338625Z %277 = arith.cmpi sge, %272, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3338874Z %278 = arith.cmpi slt, %272, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3339107Z %279 = arith.andi %277, %278 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3339407Z %280 = tt.broadcast %279 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3339702Z %281 = arith.andi %280, %98 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3339960Z %282 = tt.load %276, %281, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3340207Z %283 = arith.shli %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3340439Z %284 = arith.shrsi %283, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3340689Z %285 = arith.shrsi %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3340974Z %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3341306Z %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3341589Z %288 = tt.broadcast %286 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3341828Z %289 = arith.select %14, %288, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3342066Z %290 = tt.broadcast %287 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3342298Z %291 = arith.select %16, %290, %289 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3342524Z %292 = tt.reshape %291 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3342745Z %293 = arith.sitofp %292 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3342993Z %294 = ttg.local_alloc %293 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3343314Z %295 = ttg.local_load %294 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3343803Z %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3344144Z %297 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:42.3344286Z %298 = arith.muli %297, %c2_i32 : i32 2026-02-21T09:48:42.3344459Z %299 = tt.splat %298 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3344681Z %300 = arith.addi %299, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3344956Z %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3345230Z %302 = tt.broadcast %301 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3345425Z %303 = arith.addi %89, %302 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3345627Z %304 = tt.addptr %4, %303 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3345832Z %305 = tt.load %304 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3346056Z %306 = ttg.local_alloc %305 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3346388Z %307 = ttg.local_load %306 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3346799Z %308 = arith.extf %307 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3347083Z %309 = arith.extsi %297 : i32 to i64 2026-02-21T09:48:42.3347289Z %310 = tt.splat %309 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3347588Z %311 = arith.addi %310, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3347997Z %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3348350Z %313 = arith.muli %312, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3348668Z %314 = tt.broadcast %313 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3348966Z %315 = arith.addi %314, %94 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3349275Z %316 = tt.addptr %5, %315 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3349591Z %317 = arith.cmpi sge, %312, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3349840Z %318 = arith.cmpi slt, %312, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3350074Z %319 = arith.andi %317, %318 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3350434Z %320 = tt.broadcast %319 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3350733Z %321 = arith.andi %320, %98 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3350977Z %322 = tt.load %316, %321, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3351221Z %323 = arith.shli %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3351458Z %324 = arith.shrsi %323, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3351708Z %325 = arith.shrsi %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3351995Z %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3352347Z %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3352630Z %328 = tt.broadcast %326 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3352865Z %329 = arith.select %14, %328, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3353100Z %330 = tt.broadcast %327 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3353326Z %331 = arith.select %16, %330, %329 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3353553Z %332 = tt.reshape %331 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3353774Z %333 = arith.sitofp %332 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3354025Z %334 = ttg.local_alloc %333 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3354355Z %335 = ttg.local_load %334 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3354820Z %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3355164Z %337 = arith.addi %arg4, %c6_i32 : i32 2026-02-21T09:48:42.3355285Z %338 = arith.muli %337, %c2_i32 : i32 2026-02-21T09:48:42.3355458Z %339 = tt.splat %338 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3355700Z %340 = arith.addi %339, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3355975Z %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3356253Z %342 = tt.broadcast %341 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3356462Z %343 = arith.addi %89, %342 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3356662Z %344 = tt.addptr %4, %343 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3356869Z %345 = tt.load %344 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3357093Z %346 = ttg.local_alloc %345 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3357422Z %347 = ttg.local_load %346 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3357827Z %348 = arith.extf %347 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3358111Z %349 = arith.extsi %337 : i32 to i64 2026-02-21T09:48:42.3358320Z %350 = tt.splat %349 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3358616Z %351 = arith.addi %350, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3359010Z %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3359367Z %353 = arith.muli %352, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3359687Z %354 = tt.broadcast %353 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3359987Z %355 = arith.addi %354, %94 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3360311Z %356 = tt.addptr %5, %355 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3360631Z %357 = arith.cmpi sge, %352, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3360880Z %358 = arith.cmpi slt, %352, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3361113Z %359 = arith.andi %357, %358 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3361410Z %360 = tt.broadcast %359 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3361704Z %361 = arith.andi %360, %98 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3361947Z %362 = tt.load %356, %361, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3362193Z %363 = arith.shli %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3362426Z %364 = arith.shrsi %363, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3362700Z %365 = arith.shrsi %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3362985Z %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3363323Z %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3363628Z %368 = tt.broadcast %366 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3363862Z %369 = arith.select %14, %368, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3364102Z %370 = tt.broadcast %367 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3364353Z %371 = arith.select %16, %370, %369 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3364584Z %372 = tt.reshape %371 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3364808Z %373 = arith.sitofp %372 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3365058Z %374 = ttg.local_alloc %373 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3365382Z %375 = ttg.local_load %374 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3365854Z %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3366200Z scf.yield %376 : tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3366333Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:42.3366499Z %100 = arith.truncf %99 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma> 2026-02-21T09:48:42.3366668Z %101 = arith.extsi %83 : i32 to i64 2026-02-21T09:48:42.3366831Z %102 = tt.splat %101 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:42.3367045Z %103 = arith.addi %102, %18 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:42.3367312Z %104 = tt.expand_dims %103 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3367572Z %105 = arith.muli %104, %cst_13 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3367754Z %106 = tt.broadcast %105 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3367978Z %107 = tt.splat %90 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:42.3368184Z %108 = arith.addi %107, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:42.3368446Z %109 = tt.expand_dims %108 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3368700Z %110 = tt.broadcast %109 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3368882Z %111 = arith.addi %106, %110 : tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3369070Z %112 = tt.addptr %17, %111 : tensor<256x32x!tt.ptr, #mma>, tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3369270Z %113 = arith.cmpi sge, %104, %cst_14 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3369439Z %114 = arith.cmpi slt, %104, %cst_15 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3369598Z %115 = arith.andi %113, %114 : tensor<256x1xi1, #mma> 2026-02-21T09:48:42.3369773Z %116 = tt.broadcast %115 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3369956Z %117 = arith.cmpi sge, %109, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3370123Z %118 = arith.cmpi slt, %109, %cst_12 : tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3370278Z %119 = arith.andi %117, %118 : tensor<1x32xi1, #mma> 2026-02-21T09:48:42.3370449Z %120 = tt.broadcast %119 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3370625Z %121 = arith.andi %116, %120 : tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3370780Z tt.store %112, %100, %121 : tensor<256x32x!tt.ptr, #mma> 2026-02-21T09:48:42.3370930Z %122 = arith.addi %arg3, %c1216_i32 : i32 2026-02-21T09:48:42.3371054Z %123 = arith.divsi %122, %c1024_i32 : i32 2026-02-21T09:48:42.3371177Z %124 = arith.muli %123, %c4_i32 : i32 2026-02-21T09:48:42.3371311Z %125 = arith.subi %c64_i32, %124 : i32 2026-02-21T09:48:42.3371462Z %126 = arith.minsi %125, %c4_i32 : i32 2026-02-21T09:48:42.3371585Z %127 = arith.remsi %122, %c1024_i32 : i32 2026-02-21T09:48:42.3371703Z %128 = arith.remsi %127, %126 : i32 2026-02-21T09:48:42.3371854Z %129 = arith.addi %124, %128 : i32 2026-02-21T09:48:42.3371967Z %130 = arith.divsi %127, %126 : i32 2026-02-21T09:48:42.3372083Z %131 = arith.muli %129, %c256_i32 : i32 2026-02-21T09:48:42.3372254Z %132 = tt.splat %131 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:42.3372480Z %133 = arith.addi %132, %1 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:42.3372654Z %134 = arith.muli %130, %c32_i32 : i32 2026-02-21T09:48:42.3372879Z %135 = tt.expand_dims %133 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1> 2026-02-21T09:48:42.3373133Z %136 = arith.muli %135, %cst_4 : tensor<256x1xi32, #blocked1> 2026-02-21T09:48:42.3373329Z %137 = tt.broadcast %136 : tensor<256x1xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3373508Z %138 = arith.extsi %134 : i32 to i64 2026-02-21T09:48:42.3373713Z %139 = tt.splat %138 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3374012Z %140 = arith.addi %139, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3374401Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3374843Z %142 = tt.broadcast %141 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3375157Z %143 = arith.cmpi sge, %141, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3375419Z %144 = arith.cmpi slt, %141, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3375653Z %145 = arith.andi %143, %144 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3375955Z %146 = tt.broadcast %145 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3376290Z %147 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg5 = %cst) -> (tensor<256x32xf32, #mma>) : i32 { 2026-02-21T09:48:42.3376505Z %218 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:42.3376679Z %219 = tt.splat %218 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3376903Z %220 = arith.addi %219, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3377183Z %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3377459Z %222 = tt.broadcast %221 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3377660Z %223 = arith.addi %137, %222 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3377863Z %224 = tt.addptr %4, %223 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3378069Z %225 = tt.load %224 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3378293Z %226 = ttg.local_alloc %225 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3378626Z %227 = ttg.local_load %226 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3379059Z %228 = arith.extf %227 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3379345Z %229 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:42.3379555Z %230 = tt.splat %229 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3379870Z %231 = arith.addi %230, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3380260Z %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3380619Z %233 = arith.muli %232, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3380937Z %234 = tt.broadcast %233 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3381242Z %235 = arith.addi %234, %142 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3381557Z %236 = tt.addptr %5, %235 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3381877Z %237 = arith.cmpi sge, %232, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3382120Z %238 = arith.cmpi slt, %232, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3382359Z %239 = arith.andi %237, %238 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3382662Z %240 = tt.broadcast %239 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3382977Z %241 = arith.andi %240, %146 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3383220Z %242 = tt.load %236, %241, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3383479Z %243 = arith.shli %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3383715Z %244 = arith.shrsi %243, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3383948Z %245 = arith.shrsi %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3384235Z %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3384568Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3384848Z %248 = tt.broadcast %246 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3385087Z %249 = arith.select %14, %248, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3385323Z %250 = tt.broadcast %247 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3385552Z %251 = arith.select %16, %250, %249 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3385781Z %252 = tt.reshape %251 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3386000Z %253 = arith.sitofp %252 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3386249Z %254 = ttg.local_alloc %253 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3394790Z %255 = ttg.local_load %254 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3395349Z %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3395703Z %257 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:48:42.3395854Z %258 = arith.muli %257, %c2_i32 : i32 2026-02-21T09:48:42.3396025Z %259 = tt.splat %258 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3396253Z %260 = arith.addi %259, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3396533Z %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3396809Z %262 = tt.broadcast %261 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3397007Z %263 = arith.addi %137, %262 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3397217Z %264 = tt.addptr %4, %263 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3397427Z %265 = tt.load %264 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3397653Z %266 = ttg.local_alloc %265 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3397986Z %267 = ttg.local_load %266 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3398396Z %268 = arith.extf %267 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3398676Z %269 = arith.extsi %257 : i32 to i64 2026-02-21T09:48:42.3398886Z %270 = tt.splat %269 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3399198Z %271 = arith.addi %270, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3399585Z %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3399962Z %273 = arith.muli %272, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3400268Z %274 = tt.broadcast %273 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3400573Z %275 = arith.addi %274, %142 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3400888Z %276 = tt.addptr %5, %275 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3401206Z %277 = arith.cmpi sge, %272, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3401453Z %278 = arith.cmpi slt, %272, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3401691Z %279 = arith.andi %277, %278 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3401997Z %280 = tt.broadcast %279 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3402300Z %281 = arith.andi %280, %146 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3402546Z %282 = tt.load %276, %281, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3402844Z %283 = arith.shli %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3403084Z %284 = arith.shrsi %283, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3403347Z %285 = arith.shrsi %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3403636Z %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3403988Z %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3404270Z %288 = tt.broadcast %286 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3404506Z %289 = arith.select %14, %288, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3404738Z %290 = tt.broadcast %287 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3404970Z %291 = arith.select %16, %290, %289 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3405196Z %292 = tt.reshape %291 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3405423Z %293 = arith.sitofp %292 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3405674Z %294 = ttg.local_alloc %293 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3405996Z %295 = ttg.local_load %294 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3406465Z %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3406812Z %297 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:42.3406936Z %298 = arith.muli %297, %c2_i32 : i32 2026-02-21T09:48:42.3407130Z %299 = tt.splat %298 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3407354Z %300 = arith.addi %299, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3407650Z %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3407926Z %302 = tt.broadcast %301 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3408122Z %303 = arith.addi %137, %302 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3408328Z %304 = tt.addptr %4, %303 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3408533Z %305 = tt.load %304 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3408759Z %306 = ttg.local_alloc %305 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3409090Z %307 = ttg.local_load %306 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3409496Z %308 = arith.extf %307 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3409782Z %309 = arith.extsi %297 : i32 to i64 2026-02-21T09:48:42.3409988Z %310 = tt.splat %309 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3410287Z %311 = arith.addi %310, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3410678Z %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3411038Z %313 = arith.muli %312, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3411367Z %314 = tt.broadcast %313 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3411672Z %315 = arith.addi %314, %142 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3411995Z %316 = tt.addptr %5, %315 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3412310Z %317 = arith.cmpi sge, %312, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3412553Z %318 = arith.cmpi slt, %312, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3412790Z %319 = arith.andi %317, %318 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3413091Z %320 = tt.broadcast %319 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3413386Z %321 = arith.andi %320, %146 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3413630Z %322 = tt.load %316, %321, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3413873Z %323 = arith.shli %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3414106Z %324 = arith.shrsi %323, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3414341Z %325 = arith.shrsi %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3414626Z %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3414979Z %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3415260Z %328 = tt.broadcast %326 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3415512Z %329 = arith.select %14, %328, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3415749Z %330 = tt.broadcast %327 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3415977Z %331 = arith.select %16, %330, %329 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3416206Z %332 = tt.reshape %331 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3416425Z %333 = arith.sitofp %332 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3416676Z %334 = ttg.local_alloc %333 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3417003Z %335 = ttg.local_load %334 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3417468Z %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3417814Z %337 = arith.addi %arg4, %c6_i32 : i32 2026-02-21T09:48:42.3417938Z %338 = arith.muli %337, %c2_i32 : i32 2026-02-21T09:48:42.3418110Z %339 = tt.splat %338 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3418336Z %340 = arith.addi %339, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3418610Z %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3418890Z %342 = tt.broadcast %341 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3419108Z %343 = arith.addi %137, %342 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3419319Z %344 = tt.addptr %4, %343 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3419530Z %345 = tt.load %344 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3419773Z %346 = ttg.local_alloc %345 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3420105Z %347 = ttg.local_load %346 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3420515Z %348 = arith.extf %347 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3420799Z %349 = arith.extsi %337 : i32 to i64 2026-02-21T09:48:42.3421012Z %350 = tt.splat %349 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3421307Z %351 = arith.addi %350, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3421696Z %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3422051Z %353 = arith.muli %352, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3422356Z %354 = tt.broadcast %353 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3422660Z %355 = arith.addi %354, %142 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3422988Z %356 = tt.addptr %5, %355 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3423303Z %357 = arith.cmpi sge, %352, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3423567Z %358 = arith.cmpi slt, %352, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3423803Z %359 = arith.andi %357, %358 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3424105Z %360 = tt.broadcast %359 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3424403Z %361 = arith.andi %360, %146 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3424648Z %362 = tt.load %356, %361, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3424896Z %363 = arith.shli %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3425127Z %364 = arith.shrsi %363, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3425363Z %365 = arith.shrsi %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3425650Z %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3425983Z %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3426267Z %368 = tt.broadcast %366 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3426503Z %369 = arith.select %14, %368, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3426741Z %370 = tt.broadcast %367 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3426987Z %371 = arith.select %16, %370, %369 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3427215Z %372 = tt.reshape %371 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3427439Z %373 = arith.sitofp %372 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3427705Z %374 = ttg.local_alloc %373 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3428027Z %375 = ttg.local_load %374 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3428492Z %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3428838Z scf.yield %376 : tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3428974Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:42.3429144Z %148 = arith.truncf %147 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma> 2026-02-21T09:48:42.3429318Z %149 = arith.extsi %131 : i32 to i64 2026-02-21T09:48:42.3429487Z %150 = tt.splat %149 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:42.3429702Z %151 = arith.addi %150, %18 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:42.3429970Z %152 = tt.expand_dims %151 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3430210Z %153 = arith.muli %152, %cst_13 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3430395Z %154 = tt.broadcast %153 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3430605Z %155 = tt.splat %138 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:42.3430834Z %156 = arith.addi %155, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:42.3431095Z %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3431370Z %158 = tt.broadcast %157 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3431555Z %159 = arith.addi %154, %158 : tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3431744Z %160 = tt.addptr %17, %159 : tensor<256x32x!tt.ptr, #mma>, tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3431947Z %161 = arith.cmpi sge, %152, %cst_14 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3432119Z %162 = arith.cmpi slt, %152, %cst_15 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3432276Z %163 = arith.andi %161, %162 : tensor<256x1xi1, #mma> 2026-02-21T09:48:42.3432452Z %164 = tt.broadcast %163 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3432638Z %165 = arith.cmpi sge, %157, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3432804Z %166 = arith.cmpi slt, %157, %cst_12 : tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3432965Z %167 = arith.andi %165, %166 : tensor<1x32xi1, #mma> 2026-02-21T09:48:42.3433133Z %168 = tt.broadcast %167 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3433311Z %169 = arith.andi %164, %168 : tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3433471Z tt.store %160, %148, %169 : tensor<256x32x!tt.ptr, #mma> 2026-02-21T09:48:42.3433626Z %170 = arith.addi %arg3, %c1824_i32 : i32 2026-02-21T09:48:42.3433751Z %171 = arith.divsi %170, %c1024_i32 : i32 2026-02-21T09:48:42.3433874Z %172 = arith.muli %171, %c4_i32 : i32 2026-02-21T09:48:42.3433995Z %173 = arith.subi %c64_i32, %172 : i32 2026-02-21T09:48:42.3434113Z %174 = arith.minsi %173, %c4_i32 : i32 2026-02-21T09:48:42.3434234Z %175 = arith.remsi %170, %c1024_i32 : i32 2026-02-21T09:48:42.3434354Z %176 = arith.remsi %175, %174 : i32 2026-02-21T09:48:42.3434493Z %177 = arith.addi %172, %176 : i32 2026-02-21T09:48:42.3434609Z %178 = arith.divsi %175, %174 : i32 2026-02-21T09:48:42.3434730Z %179 = arith.muli %177, %c256_i32 : i32 2026-02-21T09:48:42.3434903Z %180 = tt.splat %179 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:42.3435146Z %181 = arith.addi %180, %1 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:42.3435321Z %182 = arith.muli %178, %c32_i32 : i32 2026-02-21T09:48:42.3435545Z %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1> 2026-02-21T09:48:42.3435803Z %184 = arith.muli %183, %cst_4 : tensor<256x1xi32, #blocked1> 2026-02-21T09:48:42.3435999Z %185 = tt.broadcast %184 : tensor<256x1xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3436179Z %186 = arith.extsi %182 : i32 to i64 2026-02-21T09:48:42.3436388Z %187 = tt.splat %186 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3436686Z %188 = arith.addi %187, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3437077Z %189 = tt.expand_dims %188 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3437502Z %190 = tt.broadcast %189 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3437814Z %191 = arith.cmpi sge, %189, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3438058Z %192 = arith.cmpi slt, %189, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3438308Z %193 = arith.andi %191, %192 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3438608Z %194 = tt.broadcast %193 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3438966Z %195 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg5 = %cst) -> (tensor<256x32xf32, #mma>) : i32 { 2026-02-21T09:48:42.3439181Z %218 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:42.3439354Z %219 = tt.splat %218 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3439575Z %220 = arith.addi %219, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3439848Z %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3440121Z %222 = tt.broadcast %221 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3440319Z %223 = arith.addi %185, %222 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3440521Z %224 = tt.addptr %4, %223 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3440726Z %225 = tt.load %224 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3440948Z %226 = ttg.local_alloc %225 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3441275Z %227 = ttg.local_load %226 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3441678Z %228 = arith.extf %227 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3441959Z %229 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:42.3442185Z %230 = tt.splat %229 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3442482Z %231 = arith.addi %230, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3442898Z %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3443265Z %233 = arith.muli %232, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3443567Z %234 = tt.broadcast %233 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3443868Z %235 = arith.addi %234, %190 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3444176Z %236 = tt.addptr %5, %235 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3444492Z %237 = arith.cmpi sge, %232, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3444734Z %238 = arith.cmpi slt, %232, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3444968Z %239 = arith.andi %237, %238 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3445264Z %240 = tt.broadcast %239 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3445560Z %241 = arith.andi %240, %194 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3445803Z %242 = tt.load %236, %241, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3446063Z %243 = arith.shli %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3446296Z %244 = arith.shrsi %243, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3446548Z %245 = arith.shrsi %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3446832Z %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3447164Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3447442Z %248 = tt.broadcast %246 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3447681Z %249 = arith.select %14, %248, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3447915Z %250 = tt.broadcast %247 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3448142Z %251 = arith.select %16, %250, %249 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3448368Z %252 = tt.reshape %251 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3448589Z %253 = arith.sitofp %252 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3448839Z %254 = ttg.local_alloc %253 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3449159Z %255 = ttg.local_load %254 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3449627Z %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3449974Z %257 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:48:42.3450112Z %258 = arith.muli %257, %c2_i32 : i32 2026-02-21T09:48:42.3450281Z %259 = tt.splat %258 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3450503Z %260 = arith.addi %259, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3450800Z %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3451075Z %262 = tt.broadcast %261 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3451271Z %263 = arith.addi %185, %262 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3451472Z %264 = tt.addptr %4, %263 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3451676Z %265 = tt.load %264 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3451897Z %266 = ttg.local_alloc %265 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3452224Z %267 = ttg.local_load %266 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3452628Z %268 = arith.extf %267 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3452909Z %269 = arith.extsi %257 : i32 to i64 2026-02-21T09:48:42.3453116Z %270 = tt.splat %269 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3453411Z %271 = arith.addi %270, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3453814Z %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3454164Z %273 = arith.muli %272, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3454481Z %274 = tt.broadcast %273 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3454783Z %275 = arith.addi %274, %190 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3455087Z %276 = tt.addptr %5, %275 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3455397Z %277 = arith.cmpi sge, %272, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3455639Z %278 = arith.cmpi slt, %272, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3455872Z %279 = arith.andi %277, %278 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3456168Z %280 = tt.broadcast %279 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3456464Z %281 = arith.andi %280, %194 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3456705Z %282 = tt.load %276, %281, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3456947Z %283 = arith.shli %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3457175Z %284 = arith.shrsi %283, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3457408Z %285 = arith.shrsi %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3457690Z %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3458037Z %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3458317Z %288 = tt.broadcast %286 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3458566Z %289 = arith.select %14, %288, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3458797Z %290 = tt.broadcast %287 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3459024Z %291 = arith.select %16, %290, %289 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3459246Z %292 = tt.reshape %291 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3459465Z %293 = arith.sitofp %292 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3459713Z %294 = ttg.local_alloc %293 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3460032Z %295 = ttg.local_load %294 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3460496Z %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3460839Z %297 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:42.3460961Z %298 = arith.muli %297, %c2_i32 : i32 2026-02-21T09:48:42.3461127Z %299 = tt.splat %298 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3461346Z %300 = arith.addi %299, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3461634Z %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3461909Z %302 = tt.broadcast %301 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3462124Z %303 = arith.addi %185, %302 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3462322Z %304 = tt.addptr %4, %303 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3462525Z %305 = tt.load %304 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3462746Z %306 = ttg.local_alloc %305 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3463073Z %307 = ttg.local_load %306 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3463477Z %308 = arith.extf %307 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3463754Z %309 = arith.extsi %297 : i32 to i64 2026-02-21T09:48:42.3463960Z %310 = tt.splat %309 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3464258Z %311 = arith.addi %310, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3464641Z %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3464993Z %313 = arith.muli %312, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3465294Z %314 = tt.broadcast %313 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3465613Z %315 = arith.addi %314, %190 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3465919Z %316 = tt.addptr %5, %315 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3466228Z %317 = arith.cmpi sge, %312, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3466484Z %318 = arith.cmpi slt, %312, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3466717Z %319 = arith.andi %317, %318 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3467010Z %320 = tt.broadcast %319 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3467304Z %321 = arith.andi %320, %194 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3467543Z %322 = tt.load %316, %321, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3467785Z %323 = arith.shli %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3468015Z %324 = arith.shrsi %323, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3468247Z %325 = arith.shrsi %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3468530Z %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3468860Z %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3469138Z %328 = tt.broadcast %326 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3469389Z %329 = arith.select %14, %328, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3469620Z %330 = tt.broadcast %327 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3469867Z %331 = arith.select %16, %330, %329 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3470090Z %332 = tt.reshape %331 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3470307Z %333 = arith.sitofp %332 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3470554Z %334 = ttg.local_alloc %333 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3470869Z %335 = ttg.local_load %334 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3471336Z %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3471675Z %337 = arith.addi %arg4, %c6_i32 : i32 2026-02-21T09:48:42.3471798Z %338 = arith.muli %337, %c2_i32 : i32 2026-02-21T09:48:42.3471970Z %339 = tt.splat %338 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3472191Z %340 = arith.addi %339, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3472465Z %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3472740Z %342 = tt.broadcast %341 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3472935Z %343 = arith.addi %185, %342 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3473136Z %344 = tt.addptr %4, %343 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3473356Z %345 = tt.load %344 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3473577Z %346 = ttg.local_alloc %345 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3473903Z %347 = ttg.local_load %346 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3474320Z %348 = arith.extf %347 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3474597Z %349 = arith.extsi %337 : i32 to i64 2026-02-21T09:48:42.3474803Z %350 = tt.splat %349 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3475097Z %351 = arith.addi %350, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3475483Z %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3475832Z %353 = arith.muli %352, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3476137Z %354 = tt.broadcast %353 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3476437Z %355 = arith.addi %354, %190 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3476740Z %356 = tt.addptr %5, %355 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3477054Z %357 = arith.cmpi sge, %352, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3477313Z %358 = arith.cmpi slt, %352, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3477544Z %359 = arith.andi %357, %358 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3477858Z %360 = tt.broadcast %359 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3478151Z %361 = arith.andi %360, %194 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3478391Z %362 = tt.load %356, %361, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3478631Z %363 = arith.shli %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3478861Z %364 = arith.shrsi %363, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3479094Z %365 = arith.shrsi %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3479379Z %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3479713Z %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3479992Z %368 = tt.broadcast %366 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3480227Z %369 = arith.select %14, %368, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3480461Z %370 = tt.broadcast %367 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3480687Z %371 = arith.select %16, %370, %369 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3480912Z %372 = tt.reshape %371 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3481154Z %373 = arith.sitofp %372 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3481400Z %374 = ttg.local_alloc %373 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3481717Z %375 = ttg.local_load %374 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3482191Z %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3482536Z scf.yield %376 : tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3482692Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:42.3482859Z %196 = arith.truncf %195 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma> 2026-02-21T09:48:42.3483031Z %197 = arith.extsi %179 : i32 to i64 2026-02-21T09:48:42.3483193Z %198 = tt.splat %197 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:42.3483402Z %199 = arith.addi %198, %18 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:42.3483669Z %200 = tt.expand_dims %199 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3483910Z %201 = arith.muli %200, %cst_13 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3484090Z %202 = tt.broadcast %201 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3484291Z %203 = tt.splat %186 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:42.3484495Z %204 = arith.addi %203, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:42.3484749Z %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3485021Z %206 = tt.broadcast %205 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3485200Z %207 = arith.addi %202, %206 : tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3485407Z %208 = tt.addptr %17, %207 : tensor<256x32x!tt.ptr, #mma>, tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3485608Z %209 = arith.cmpi sge, %200, %cst_14 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3485773Z %210 = arith.cmpi slt, %200, %cst_15 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3485929Z %211 = arith.andi %209, %210 : tensor<256x1xi1, #mma> 2026-02-21T09:48:42.3486100Z %212 = tt.broadcast %211 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3486281Z %213 = arith.cmpi sge, %205, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3486444Z %214 = arith.cmpi slt, %205, %cst_12 : tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3486596Z %215 = arith.andi %213, %214 : tensor<1x32xi1, #mma> 2026-02-21T09:48:42.3486764Z %216 = tt.broadcast %215 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3486937Z %217 = arith.andi %212, %216 : tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3487096Z tt.store %208, %196, %217 : tensor<256x32x!tt.ptr, #mma> 2026-02-21T09:48:42.3487262Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:48:42.3487421Z scf.for %arg3 = %26 to %c16384_i32 step %c608_i32 : i32 { 2026-02-21T09:48:42.3487565Z %27 = arith.divsi %arg3, %c1024_i32 : i32 2026-02-21T09:48:42.3487684Z %28 = arith.muli %27, %c4_i32 : i32 2026-02-21T09:48:42.3487800Z %29 = arith.subi %c64_i32, %28 : i32 2026-02-21T09:48:42.3487913Z %30 = arith.minsi %29, %c4_i32 : i32 2026-02-21T09:48:42.3488031Z %31 = arith.remsi %arg3, %c1024_i32 : i32 2026-02-21T09:48:42.3488147Z %32 = arith.remsi %31, %30 : i32 2026-02-21T09:48:42.3488256Z %33 = arith.addi %28, %32 : i32 2026-02-21T09:48:42.3488364Z %34 = arith.divsi %31, %30 : i32 2026-02-21T09:48:42.3488495Z %35 = arith.muli %33, %c256_i32 : i32 2026-02-21T09:48:42.3488663Z %36 = tt.splat %35 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:42.3488884Z %37 = arith.addi %36, %1 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:42.3489054Z %38 = arith.muli %34, %c32_i32 : i32 2026-02-21T09:48:42.3489292Z %39 = tt.expand_dims %37 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1> 2026-02-21T09:48:42.3489540Z %40 = arith.muli %39, %cst_4 : tensor<256x1xi32, #blocked1> 2026-02-21T09:48:42.3489735Z %41 = tt.broadcast %40 : tensor<256x1xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3489905Z %42 = arith.extsi %38 : i32 to i64 2026-02-21T09:48:42.3490107Z %43 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3490397Z %44 = arith.addi %43, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3490777Z %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3491197Z %46 = tt.broadcast %45 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3491498Z %47 = arith.cmpi sge, %45, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3491736Z %48 = arith.cmpi slt, %45, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3491961Z %49 = arith.andi %47, %48 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3492263Z %50 = tt.broadcast %49 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3492595Z %51 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg5 = %cst) -> (tensor<256x32xf32, #mma>) : i32 { 2026-02-21T09:48:42.3492822Z %74 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:42.3492990Z %75 = tt.splat %74 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3493207Z %76 = arith.addi %75, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3493474Z %77 = tt.expand_dims %76 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3493745Z %78 = tt.broadcast %77 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3493937Z %79 = arith.addi %41, %78 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3494133Z %80 = tt.addptr %4, %79 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3494336Z %81 = tt.load %80 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3494551Z %82 = ttg.local_alloc %81 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3494876Z %83 = ttg.local_load %82 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3495281Z %84 = arith.extf %83 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3495562Z %85 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:42.3495769Z %86 = tt.splat %85 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3496060Z %87 = arith.addi %86, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3496459Z %88 = tt.expand_dims %87 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3496804Z %89 = arith.muli %88, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3497117Z %90 = tt.broadcast %89 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3497413Z %91 = arith.addi %90, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3497708Z %92 = tt.addptr %5, %91 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3498014Z %93 = arith.cmpi sge, %88, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3498250Z %94 = arith.cmpi slt, %88, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3498473Z %95 = arith.andi %93, %94 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3498763Z %96 = tt.broadcast %95 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3499051Z %97 = arith.andi %96, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3499283Z %98 = tt.load %92, %97, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3499519Z %99 = arith.shli %98, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3499747Z %100 = arith.shrsi %99, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3499994Z %101 = arith.shrsi %98, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3500282Z %102 = tt.expand_dims %100 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3500642Z %103 = tt.expand_dims %101 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3500926Z %104 = tt.broadcast %102 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3501167Z %105 = arith.select %14, %104, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3501408Z %106 = tt.broadcast %103 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3501642Z %107 = arith.select %16, %106, %105 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3501878Z %108 = tt.reshape %107 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3502108Z %109 = arith.sitofp %108 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3502359Z %110 = ttg.local_alloc %109 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3502687Z %111 = ttg.local_load %110 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3503158Z %112 = tt.dot %84, %111, %arg5, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3503506Z %113 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:48:42.3503634Z %114 = arith.muli %113, %c2_i32 : i32 2026-02-21T09:48:42.3503806Z %115 = tt.splat %114 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3504034Z %116 = arith.addi %115, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3504336Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3504617Z %118 = tt.broadcast %117 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3504831Z %119 = arith.addi %41, %118 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3505034Z %120 = tt.addptr %4, %119 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3505242Z %121 = tt.load %120 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3505469Z %122 = ttg.local_alloc %121 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3505801Z %123 = ttg.local_load %122 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3506208Z %124 = arith.extf %123 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3506493Z %125 = arith.extsi %113 : i32 to i64 2026-02-21T09:48:42.3506705Z %126 = tt.splat %125 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3507003Z %127 = arith.addi %126, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3507387Z %128 = tt.expand_dims %127 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3507746Z %129 = arith.muli %128, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3508066Z %130 = tt.broadcast %129 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3508370Z %131 = arith.addi %130, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3508705Z %132 = tt.addptr %5, %131 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3509019Z %133 = arith.cmpi sge, %128, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3509270Z %134 = arith.cmpi slt, %128, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3509504Z %135 = arith.andi %133, %134 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3509805Z %136 = tt.broadcast %135 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3510107Z %137 = arith.andi %136, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3510349Z %138 = tt.load %132, %137, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3510599Z %139 = arith.shli %138, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3510837Z %140 = arith.shrsi %139, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3511077Z %141 = arith.shrsi %138, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3511369Z %142 = tt.expand_dims %140 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3511702Z %143 = tt.expand_dims %141 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3511984Z %144 = tt.broadcast %142 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3512241Z %145 = arith.select %14, %144, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3512475Z %146 = tt.broadcast %143 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3512706Z %147 = arith.select %16, %146, %145 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3512951Z %148 = tt.reshape %147 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3513175Z %149 = arith.sitofp %148 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3513422Z %150 = ttg.local_alloc %149 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3513740Z %151 = ttg.local_load %150 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3514207Z %152 = tt.dot %124, %151, %112, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3514553Z %153 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:42.3514676Z %154 = arith.muli %153, %c2_i32 : i32 2026-02-21T09:48:42.3514847Z %155 = tt.splat %154 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3515067Z %156 = arith.addi %155, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3515340Z %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3515617Z %158 = tt.broadcast %157 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3515821Z %159 = arith.addi %41, %158 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3516047Z %160 = tt.addptr %4, %159 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3516252Z %161 = tt.load %160 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3516399Z %162 = ttg.local_alloc %161 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3516571Z %163 = ttg.local_load %162 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3516770Z %164 = arith.extf %163 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3516817Z %165 = arith.extsi %153 : i32 to i64 2026-02-21T09:48:42.3516946Z %166 = tt.splat %165 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3517073Z %167 = arith.addi %166, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3517292Z %168 = tt.expand_dims %167 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3517392Z %169 = arith.muli %168, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3517559Z %170 = tt.broadcast %169 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3517653Z %171 = arith.addi %170, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3517824Z %172 = tt.addptr %5, %171 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3517942Z %173 = arith.cmpi sge, %168, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3518047Z %174 = arith.cmpi slt, %168, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3518142Z %175 = arith.andi %173, %174 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3518321Z %176 = tt.broadcast %175 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3518416Z %177 = arith.andi %176, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3518524Z %178 = tt.load %172, %177, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3518619Z %179 = arith.shli %178, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3518718Z %180 = arith.shrsi %179, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3518814Z %181 = arith.shrsi %178, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3518961Z %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3519109Z %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3519202Z %184 = tt.broadcast %182 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3519304Z %185 = arith.select %14, %184, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3519396Z %186 = tt.broadcast %183 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3519493Z %187 = arith.select %16, %186, %185 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3519595Z %188 = tt.reshape %187 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3519689Z %189 = arith.sitofp %188 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3519824Z %190 = ttg.local_alloc %189 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3519990Z %191 = ttg.local_load %190 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3520253Z %192 = tt.dot %164, %191, %152, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3520299Z %193 = arith.addi %arg4, %c6_i32 : i32 2026-02-21T09:48:42.3520344Z %194 = arith.muli %193, %c2_i32 : i32 2026-02-21T09:48:42.3520438Z %195 = tt.splat %194 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3520529Z %196 = arith.addi %195, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:42.3520671Z %197 = tt.expand_dims %196 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:48:42.3520769Z %198 = tt.broadcast %197 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3520832Z %199 = arith.addi %41, %198 : tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3520935Z %200 = tt.addptr %4, %199 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi32, #blocked1> 2026-02-21T09:48:42.3520999Z %201 = tt.load %200 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T09:48:42.3521124Z %202 = ttg.local_alloc %201 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem> 2026-02-21T09:48:42.3521305Z %203 = ttg.local_load %202 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3521502Z %204 = arith.extf %203 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3521565Z %205 = arith.extsi %193 : i32 to i64 2026-02-21T09:48:42.3521694Z %206 = tt.splat %205 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3521821Z %207 = arith.addi %206, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:42.3522039Z %208 = tt.expand_dims %207 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3522138Z %209 = arith.muli %208, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3522307Z %210 = tt.broadcast %209 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3522402Z %211 = arith.addi %210, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3522608Z %212 = tt.addptr %5, %211 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3522713Z %213 = arith.cmpi sge, %208, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3522815Z %214 = arith.cmpi slt, %208, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3522908Z %215 = arith.andi %213, %214 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3523092Z %216 = tt.broadcast %215 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3523185Z %217 = arith.andi %216, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3523311Z %218 = tt.load %212, %217, %cst_1 : tensor<2x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3523407Z %219 = arith.shli %218, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3523506Z %220 = arith.shrsi %219, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3523601Z %221 = arith.shrsi %218, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:42.3523750Z %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3523899Z %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked> 2026-02-21T09:48:42.3523993Z %224 = tt.broadcast %222 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3524094Z %225 = arith.select %14, %224, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3524190Z %226 = tt.broadcast %223 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3524288Z %227 = arith.select %16, %226, %225 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked> 2026-02-21T09:48:42.3524375Z %228 = tt.reshape %227 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2> 2026-02-21T09:48:42.3524464Z %229 = arith.sitofp %228 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2> 2026-02-21T09:48:42.3524579Z %230 = ttg.local_alloc %229 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem> 2026-02-21T09:48:42.3524760Z %231 = ttg.local_load %230 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:42.3525018Z %232 = tt.dot %204, %231, %192, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3525086Z scf.yield %232 : tensor<256x32xf32, #mma> 2026-02-21T09:48:42.3525132Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:42.3525219Z %52 = arith.truncf %51 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma> 2026-02-21T09:48:42.3525261Z %53 = arith.extsi %35 : i32 to i64 2026-02-21T09:48:42.3525346Z %54 = tt.splat %53 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:42.3525430Z %55 = arith.addi %54, %18 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:42.3525573Z %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3525633Z %57 = arith.muli %56, %cst_13 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3525718Z %58 = tt.broadcast %57 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3525798Z %59 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:42.3525882Z %60 = arith.addi %59, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:42.3526014Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3526091Z %62 = tt.broadcast %61 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3526147Z %63 = arith.addi %58, %62 : tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3526240Z %64 = tt.addptr %17, %63 : tensor<256x32x!tt.ptr, #mma>, tensor<256x32xi64, #mma> 2026-02-21T09:48:42.3526326Z %65 = arith.cmpi sge, %56, %cst_14 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3526389Z %66 = arith.cmpi slt, %56, %cst_15 : tensor<256x1xi64, #mma> 2026-02-21T09:48:42.3526461Z %67 = arith.andi %65, %66 : tensor<256x1xi1, #mma> 2026-02-21T09:48:42.3526540Z %68 = tt.broadcast %67 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3526605Z %69 = arith.cmpi sge, %61, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3526664Z %70 = arith.cmpi slt, %61, %cst_12 : tensor<1x32xi64, #mma> 2026-02-21T09:48:42.3526719Z %71 = arith.andi %69, %70 : tensor<1x32xi1, #mma> 2026-02-21T09:48:42.3526796Z %72 = tt.broadcast %71 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3526849Z %73 = arith.andi %68, %72 : tensor<256x32xi1, #mma> 2026-02-21T09:48:42.3526912Z tt.store %64, %52, %73 : tensor<256x32x!tt.ptr, #mma> 2026-02-21T09:48:42.3526977Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:48:42.3527013Z tt.return 2026-02-21T09:48:42.3527050Z } 2026-02-21T09:48:42.3527087Z } 2026-02-21T09:48:42.3527091Z 2026-02-21T09:48:42.3527122Z {-# 2026-02-21T09:48:42.3527165Z external_resources: { 2026-02-21T09:48:42.3527206Z mlir_reproducer: { 2026-02-21T09:48:42.3528143Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:48:42.3528192Z disable_threading: false, 2026-02-21T09:48:42.3528230Z verify_each: true 2026-02-21T09:48:42.3528262Z } 2026-02-21T09:48:42.3528311Z } 2026-02-21T09:48:42.3528347Z #-} 2026-02-21T09:48:42.3528583Z /tmp/torchinductor_root/vg/cvgiz5oieiqzkpc2xx5tqhdu2sgedy4mj66kjm2f2poq4hiiezf3.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:48:42.3528995Z /tmp/torchinductor_root/vg/cvgiz5oieiqzkpc2xx5tqhdu2sgedy4mj66kjm2f2poq4hiiezf3.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:48:42.3529128Z [253s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:48:42.3529761Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 256, 32], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[4, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:48:42.3529821Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:48:42.3529903Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:48:43.9219616Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:48:43.9234491Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:48:43.9235112Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:48:43.9235896Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:48:43.9236420Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:48:43.9236993Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:48:43.9237420Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:48:43.9237753Z #smem = #ttg.shared_memory 2026-02-21T09:48:43.9238177Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:48:43.9239035Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:48:43.9239740Z %cst = arith.constant dense<0.000000e+00> : tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9240035Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:48:43.9240261Z %c608_i32 = arith.constant 608 : i32 2026-02-21T09:48:43.9240471Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:48:43.9240680Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:48:43.9240893Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:48:43.9241165Z %cst_0 = arith.constant dense<0> : tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9241433Z %c1824_i32 = arith.constant 1824 : i32 2026-02-21T09:48:43.9241646Z %c1216_i32 = arith.constant 1216 : i32 2026-02-21T09:48:43.9241857Z %c12_i32 = arith.constant 12 : i32 2026-02-21T09:48:43.9242059Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:48:43.9242263Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T09:48:43.9242467Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:48:43.9242798Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T09:48:43.9243011Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:48:43.9243223Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:48:43.9243543Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:48:43.9243878Z %cst_1 = arith.constant dense<0> : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9244350Z %cst_2 = arith.constant dense<8192> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9244881Z %cst_3 = arith.constant dense<0> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9245226Z %c33375_i32 = arith.constant 33375 : i32 2026-02-21T09:48:43.9245498Z %cst_4 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:48:43.9245884Z %cst_5 = arith.constant dense<4> : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9246269Z %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:43.9246606Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:43.9246981Z %cst_8 = arith.constant dense<8192> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9247314Z %cst_9 = arith.constant dense<0> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9247658Z %cst_10 = arith.constant dense<512> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9247939Z %cst_11 = arith.constant dense<0> : tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9248172Z %cst_12 = arith.constant dense<8192> : tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9248395Z %cst_13 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9248624Z %cst_14 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9248846Z %cst_15 = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9249046Z %0 = tt.get_program_id x : i32 2026-02-21T09:48:43.9249308Z %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:43.9249713Z %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:43.9250076Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9250427Z %4 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9250745Z %5 = tt.splat %arg1 : !tt.ptr -> tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9251154Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9251721Z %7 = arith.extsi %6 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9252295Z %8 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9252868Z %9 = arith.extsi %8 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9253447Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:48:43.9254006Z %11 = tt.expand_dims %10 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:48:43.9254537Z %12 = tt.expand_dims %11 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:43.9254884Z %13 = arith.cmpi eq, %12, %cst_6 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:43.9255173Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked> 2026-02-21T09:48:43.9255434Z %15 = arith.cmpi eq, %12, %cst_7 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:48:43.9255688Z %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked> 2026-02-21T09:48:43.9255963Z %17 = tt.splat %arg2 : !tt.ptr -> tensor<128x32x!tt.ptr, #mma> 2026-02-21T09:48:43.9256352Z %18 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:43.9256754Z %19 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:43.9257158Z %20 = arith.extsi %19 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:43.9257406Z %21 = arith.subi %c33375_i32, %0 : i32 2026-02-21T09:48:43.9257535Z %22 = arith.divui %21, %c608_i32 : i32 2026-02-21T09:48:43.9257663Z %23 = arith.remsi %22, %c4_i32 : i32 2026-02-21T09:48:43.9257785Z %24 = arith.subi %22, %23 : i32 2026-02-21T09:48:43.9257910Z %25 = arith.muli %24, %c608_i32 : i32 2026-02-21T09:48:43.9258035Z %26 = arith.addi %0, %25 : i32 2026-02-21T09:48:43.9258170Z scf.for %arg3 = %0 to %26 step %c2432_i32 : i32 { 2026-02-21T09:48:43.9258324Z %27 = arith.divsi %arg3, %c1024_i32 : i32 2026-02-21T09:48:43.9258454Z %28 = arith.muli %27, %c4_i32 : i32 2026-02-21T09:48:43.9258582Z %29 = arith.subi %c128_i32, %28 : i32 2026-02-21T09:48:43.9258703Z %30 = arith.minsi %29, %c4_i32 : i32 2026-02-21T09:48:43.9258835Z %31 = arith.remsi %arg3, %c1024_i32 : i32 2026-02-21T09:48:43.9258966Z %32 = arith.remsi %31, %30 : i32 2026-02-21T09:48:43.9259087Z %33 = arith.addi %28, %32 : i32 2026-02-21T09:48:43.9259206Z %34 = arith.divsi %31, %30 : i32 2026-02-21T09:48:43.9259328Z %35 = arith.muli %33, %c128_i32 : i32 2026-02-21T09:48:43.9259527Z %36 = tt.splat %35 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:43.9259767Z %37 = arith.addi %36, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:43.9259971Z %38 = arith.muli %34, %c32_i32 : i32 2026-02-21T09:48:43.9260216Z %39 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:48:43.9260483Z %40 = arith.muli %39, %cst_4 : tensor<128x1xi32, #blocked1> 2026-02-21T09:48:43.9260691Z %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9260878Z %42 = arith.extsi %38 : i32 to i64 2026-02-21T09:48:43.9261099Z %43 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9261413Z %44 = arith.addi %43, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9261824Z %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9262285Z %46 = tt.broadcast %45 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9262620Z %47 = arith.cmpi sge, %45, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9262878Z %48 = arith.cmpi slt, %45, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9263126Z %49 = arith.andi %47, %48 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9263442Z %50 = tt.broadcast %49 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9263824Z %51 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c16_i32 iter_args(%arg5 = %cst) -> (tensor<128x32xf32, #mma>) : i32 { 2026-02-21T09:48:43.9264057Z %218 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:43.9264251Z %219 = tt.splat %218 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9264507Z %220 = arith.addi %219, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9264800Z %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9265099Z %222 = tt.broadcast %221 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9265308Z %223 = arith.addi %41, %222 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9265532Z %224 = tt.addptr %4, %223 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9265758Z %225 = tt.load %224 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9266001Z %226 = ttg.local_alloc %225 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9266367Z %227 = ttg.local_load %226 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9266809Z %228 = arith.extf %227 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9267096Z %229 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:43.9267311Z %230 = tt.splat %229 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9267607Z %231 = arith.addi %230, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9268018Z %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9268391Z %233 = arith.muli %232, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9268696Z %234 = tt.broadcast %233 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9269001Z %235 = arith.addi %234, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9269306Z %236 = tt.addptr %5, %235 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9269625Z %237 = arith.cmpi sge, %232, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9269875Z %238 = arith.cmpi slt, %232, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9270111Z %239 = arith.andi %237, %238 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9270416Z %240 = tt.broadcast %239 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9270712Z %241 = arith.andi %240, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9270958Z %242 = tt.load %236, %241, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9271205Z %243 = arith.shli %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9271437Z %244 = arith.shrsi %243, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9271672Z %245 = arith.shrsi %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9271990Z %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9272325Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9272622Z %248 = tt.broadcast %246 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9272860Z %249 = arith.select %14, %248, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9273098Z %250 = tt.broadcast %247 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9273331Z %251 = arith.select %16, %250, %249 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9273559Z %252 = tt.reshape %251 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9273788Z %253 = arith.sitofp %252 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9274037Z %254 = ttg.local_alloc %253 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9274363Z %255 = ttg.local_load %254 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9274838Z %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9275188Z %257 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:43.9275313Z %258 = arith.muli %257, %c2_i32 : i32 2026-02-21T09:48:43.9275483Z %259 = tt.splat %258 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9275708Z %260 = arith.addi %259, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9276000Z %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9276293Z %262 = tt.broadcast %261 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9276489Z %263 = arith.addi %41, %262 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9276689Z %264 = tt.addptr %4, %263 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9276898Z %265 = tt.load %264 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9277124Z %266 = ttg.local_alloc %265 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9277453Z %267 = ttg.local_load %266 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9277862Z %268 = arith.extf %267 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9278147Z %269 = arith.extsi %257 : i32 to i64 2026-02-21T09:48:43.9278359Z %270 = tt.splat %269 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9278656Z %271 = arith.addi %270, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9279043Z %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9279398Z %273 = arith.muli %272, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9279705Z %274 = tt.broadcast %273 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9280027Z %275 = arith.addi %274, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9280336Z %276 = tt.addptr %5, %275 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9280667Z %277 = arith.cmpi sge, %272, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9280914Z %278 = arith.cmpi slt, %272, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9281152Z %279 = arith.andi %277, %278 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9281453Z %280 = tt.broadcast %279 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9281754Z %281 = arith.andi %280, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9281999Z %282 = tt.load %276, %281, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9282252Z %283 = arith.shli %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9282489Z %284 = arith.shrsi %283, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9282773Z %285 = arith.shrsi %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9283061Z %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9283391Z %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9283675Z %288 = tt.broadcast %286 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9283932Z %289 = arith.select %14, %288, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9284166Z %290 = tt.broadcast %287 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9284424Z %291 = arith.select %16, %290, %289 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9284660Z %292 = tt.reshape %291 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9284883Z %293 = arith.sitofp %292 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9285137Z %294 = ttg.local_alloc %293 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9285463Z %295 = ttg.local_load %294 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9285939Z %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9286286Z %297 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:48:43.9286410Z %298 = arith.muli %297, %c2_i32 : i32 2026-02-21T09:48:43.9286583Z %299 = tt.splat %298 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9286804Z %300 = arith.addi %299, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9287081Z %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9287364Z %302 = tt.broadcast %301 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9287558Z %303 = arith.addi %41, %302 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9287785Z %304 = tt.addptr %4, %303 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9287992Z %305 = tt.load %304 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9288219Z %306 = ttg.local_alloc %305 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9288572Z %307 = ttg.local_load %306 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9288974Z %308 = arith.extf %307 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9289258Z %309 = arith.extsi %297 : i32 to i64 2026-02-21T09:48:43.9289503Z %310 = tt.splat %309 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9289800Z %311 = arith.addi %310, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9290186Z %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9290546Z %313 = arith.muli %312, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9290856Z %314 = tt.broadcast %313 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9291159Z %315 = arith.addi %314, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9291467Z %316 = tt.addptr %5, %315 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9291806Z %317 = arith.cmpi sge, %312, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9292056Z %318 = arith.cmpi slt, %312, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9292308Z %319 = arith.andi %317, %318 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9292609Z %320 = tt.broadcast %319 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9292904Z %321 = arith.andi %320, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9293147Z %322 = tt.load %316, %321, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9293392Z %323 = arith.shli %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9293634Z %324 = arith.shrsi %323, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9301921Z %325 = arith.shrsi %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9302405Z %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9302834Z %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9303122Z %328 = tt.broadcast %326 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9303393Z %329 = arith.select %14, %328, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9303780Z %330 = tt.broadcast %327 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9304100Z %331 = arith.select %16, %330, %329 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9304388Z %332 = tt.reshape %331 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9304662Z %333 = arith.sitofp %332 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9305084Z %334 = ttg.local_alloc %333 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9305617Z %335 = ttg.local_load %334 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9306129Z %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9306580Z %337 = arith.addi %arg4, %c12_i32 : i32 2026-02-21T09:48:43.9306713Z %338 = arith.muli %337, %c2_i32 : i32 2026-02-21T09:48:43.9306892Z %339 = tt.splat %338 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9307275Z %340 = arith.addi %339, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9307748Z %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9308207Z %342 = tt.broadcast %341 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9308524Z %343 = arith.addi %41, %342 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9308864Z %344 = tt.addptr %4, %343 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9309207Z %345 = tt.load %344 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9309578Z %346 = ttg.local_alloc %345 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9310168Z %347 = ttg.local_load %346 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9310589Z %348 = arith.extf %347 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9311019Z %349 = arith.extsi %337 : i32 to i64 2026-02-21T09:48:43.9311364Z %350 = tt.splat %349 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9311664Z %351 = arith.addi %350, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9312179Z %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9312801Z %353 = arith.muli %352, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9313337Z %354 = tt.broadcast %353 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9313842Z %355 = arith.addi %354, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9314154Z %356 = tt.addptr %5, %355 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9314469Z %357 = arith.cmpi sge, %352, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9314716Z %358 = arith.cmpi slt, %352, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9314953Z %359 = arith.andi %357, %358 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9315253Z %360 = tt.broadcast %359 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9315579Z %361 = arith.andi %360, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9315824Z %362 = tt.load %356, %361, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9316088Z %363 = arith.shli %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9316323Z %364 = arith.shrsi %363, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9316557Z %365 = arith.shrsi %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9316846Z %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9317186Z %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9317470Z %368 = tt.broadcast %366 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9317709Z %369 = arith.select %14, %368, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9317943Z %370 = tt.broadcast %367 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9318215Z %371 = arith.select %16, %370, %369 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9318446Z %372 = tt.reshape %371 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9318679Z %373 = arith.sitofp %372 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9319089Z %374 = ttg.local_alloc %373 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9319583Z %375 = ttg.local_load %374 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9320370Z %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9320741Z scf.yield %376 : tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9320920Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:43.9321255Z %52 = arith.truncf %51 : tensor<128x32xf32, #mma> to tensor<128x32xbf16, #mma> 2026-02-21T09:48:43.9321537Z %53 = arith.extsi %35 : i32 to i64 2026-02-21T09:48:43.9321801Z %54 = tt.splat %53 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:43.9322146Z %55 = arith.addi %54, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:43.9322628Z %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9323027Z %57 = arith.muli %56, %cst_13 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9323315Z %58 = tt.broadcast %57 : tensor<128x1xi64, #mma> -> tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9323644Z %59 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:43.9323911Z %60 = arith.addi %59, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:43.9324167Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9324481Z %62 = tt.broadcast %61 : tensor<1x32xi64, #mma> -> tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9324768Z %63 = arith.addi %58, %62 : tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9325069Z %64 = tt.addptr %17, %63 : tensor<128x32x!tt.ptr, #mma>, tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9325392Z %65 = arith.cmpi sge, %56, %cst_14 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9325699Z %66 = arith.cmpi slt, %56, %cst_15 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9325956Z %67 = arith.andi %65, %66 : tensor<128x1xi1, #mma> 2026-02-21T09:48:43.9326236Z %68 = tt.broadcast %67 : tensor<128x1xi1, #mma> -> tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9326567Z %69 = arith.cmpi sge, %61, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9326831Z %70 = arith.cmpi slt, %61, %cst_12 : tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9327078Z %71 = arith.andi %69, %70 : tensor<1x32xi1, #mma> 2026-02-21T09:48:43.9327344Z %72 = tt.broadcast %71 : tensor<1x32xi1, #mma> -> tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9327626Z %73 = arith.andi %68, %72 : tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9327867Z tt.store %64, %52, %73 : tensor<128x32x!tt.ptr, #mma> 2026-02-21T09:48:43.9328104Z %74 = arith.addi %arg3, %c608_i32 : i32 2026-02-21T09:48:43.9328261Z %75 = arith.divsi %74, %c1024_i32 : i32 2026-02-21T09:48:43.9328382Z %76 = arith.muli %75, %c4_i32 : i32 2026-02-21T09:48:43.9328503Z %77 = arith.subi %c128_i32, %76 : i32 2026-02-21T09:48:43.9328684Z %78 = arith.minsi %77, %c4_i32 : i32 2026-02-21T09:48:43.9328842Z %79 = arith.remsi %74, %c1024_i32 : i32 2026-02-21T09:48:43.9328960Z %80 = arith.remsi %79, %78 : i32 2026-02-21T09:48:43.9329075Z %81 = arith.addi %76, %80 : i32 2026-02-21T09:48:43.9329189Z %82 = arith.divsi %79, %78 : i32 2026-02-21T09:48:43.9329303Z %83 = arith.muli %81, %c128_i32 : i32 2026-02-21T09:48:43.9329472Z %84 = tt.splat %83 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:43.9329693Z %85 = arith.addi %84, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:43.9329866Z %86 = arith.muli %82, %c32_i32 : i32 2026-02-21T09:48:43.9330112Z %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:48:43.9330362Z %88 = arith.muli %87, %cst_4 : tensor<128x1xi32, #blocked1> 2026-02-21T09:48:43.9330574Z %89 = tt.broadcast %88 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9330752Z %90 = arith.extsi %86 : i32 to i64 2026-02-21T09:48:43.9330959Z %91 = tt.splat %90 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9331250Z %92 = arith.addi %91, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9331679Z %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9332398Z %94 = tt.broadcast %93 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9332772Z %95 = arith.cmpi sge, %93, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9333013Z %96 = arith.cmpi slt, %93, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9333313Z %97 = arith.andi %95, %96 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9333800Z %98 = tt.broadcast %97 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9334358Z %99 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c16_i32 iter_args(%arg5 = %cst) -> (tensor<128x32xf32, #mma>) : i32 { 2026-02-21T09:48:43.9334714Z %218 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:43.9334953Z %219 = tt.splat %218 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9335228Z %220 = arith.addi %219, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9335524Z %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9335850Z %222 = tt.broadcast %221 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9336185Z %223 = arith.addi %89, %222 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9336516Z %224 = tt.addptr %4, %223 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9336860Z %225 = tt.load %224 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9337128Z %226 = ttg.local_alloc %225 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9337463Z %227 = ttg.local_load %226 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9337873Z %228 = arith.extf %227 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9338161Z %229 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:43.9338374Z %230 = tt.splat %229 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9338671Z %231 = arith.addi %230, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9339061Z %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9339420Z %233 = arith.muli %232, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9339755Z %234 = tt.broadcast %233 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9340060Z %235 = arith.addi %234, %94 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9340383Z %236 = tt.addptr %5, %235 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9340703Z %237 = arith.cmpi sge, %232, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9340954Z %238 = arith.cmpi slt, %232, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9341193Z %239 = arith.andi %237, %238 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9341498Z %240 = tt.broadcast %239 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9341802Z %241 = arith.andi %240, %98 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9342050Z %242 = tt.load %236, %241, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9342297Z %243 = arith.shli %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9342538Z %244 = arith.shrsi %243, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9342774Z %245 = arith.shrsi %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9343058Z %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9343392Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9343676Z %248 = tt.broadcast %246 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9343929Z %249 = arith.select %14, %248, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9344166Z %250 = tt.broadcast %247 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9344410Z %251 = arith.select %16, %250, %249 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9344641Z %252 = tt.reshape %251 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9344864Z %253 = arith.sitofp %252 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9345114Z %254 = ttg.local_alloc %253 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9345440Z %255 = ttg.local_load %254 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9345919Z %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9346269Z %257 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:43.9346397Z %258 = arith.muli %257, %c2_i32 : i32 2026-02-21T09:48:43.9346568Z %259 = tt.splat %258 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9346795Z %260 = arith.addi %259, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9347073Z %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9347349Z %262 = tt.broadcast %261 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9347549Z %263 = arith.addi %89, %262 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9347769Z %264 = tt.addptr %4, %263 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9347980Z %265 = tt.load %264 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9348222Z %266 = ttg.local_alloc %265 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9348553Z %267 = ttg.local_load %266 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9348960Z %268 = arith.extf %267 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9349242Z %269 = arith.extsi %257 : i32 to i64 2026-02-21T09:48:43.9349453Z %270 = tt.splat %269 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9349750Z %271 = arith.addi %270, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9350135Z %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9350492Z %273 = arith.muli %272, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9350796Z %274 = tt.broadcast %273 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9351102Z %275 = arith.addi %274, %94 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9351416Z %276 = tt.addptr %5, %275 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9351754Z %277 = arith.cmpi sge, %272, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9352001Z %278 = arith.cmpi slt, %272, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9352244Z %279 = arith.andi %277, %278 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9352555Z %280 = tt.broadcast %279 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9352851Z %281 = arith.andi %280, %98 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9353092Z %282 = tt.load %276, %281, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9353339Z %283 = arith.shli %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9353575Z %284 = arith.shrsi %283, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9353810Z %285 = arith.shrsi %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9354102Z %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9354440Z %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9354722Z %288 = tt.broadcast %286 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9354958Z %289 = arith.select %14, %288, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9355190Z %290 = tt.broadcast %287 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9355424Z %291 = arith.select %16, %290, %289 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9355668Z %292 = tt.reshape %291 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9355890Z %293 = arith.sitofp %292 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9356151Z %294 = ttg.local_alloc %293 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9356476Z %295 = ttg.local_load %294 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9356946Z %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9357286Z %297 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:48:43.9357408Z %298 = arith.muli %297, %c2_i32 : i32 2026-02-21T09:48:43.9357577Z %299 = tt.splat %298 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9357801Z %300 = arith.addi %299, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9358077Z %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9358352Z %302 = tt.broadcast %301 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9358547Z %303 = arith.addi %89, %302 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9358746Z %304 = tt.addptr %4, %303 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9358952Z %305 = tt.load %304 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9359174Z %306 = ttg.local_alloc %305 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9359534Z %307 = ttg.local_load %306 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9359941Z %308 = arith.extf %307 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9360233Z %309 = arith.extsi %297 : i32 to i64 2026-02-21T09:48:43.9360442Z %310 = tt.splat %309 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9360740Z %311 = arith.addi %310, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9361124Z %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9361481Z %313 = arith.muli %312, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9361786Z %314 = tt.broadcast %313 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9362086Z %315 = arith.addi %314, %94 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9362397Z %316 = tt.addptr %5, %315 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9362774Z %317 = arith.cmpi sge, %312, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9363019Z %318 = arith.cmpi slt, %312, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9363253Z %319 = arith.andi %317, %318 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9363571Z %320 = tt.broadcast %319 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9363865Z %321 = arith.andi %320, %98 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9364124Z %322 = tt.load %316, %321, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9364369Z %323 = arith.shli %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9364600Z %324 = arith.shrsi %323, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9364835Z %325 = arith.shrsi %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9365120Z %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9365453Z %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9365735Z %328 = tt.broadcast %326 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9365971Z %329 = arith.select %14, %328, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9366203Z %330 = tt.broadcast %327 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9366432Z %331 = arith.select %16, %330, %329 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9366658Z %332 = tt.reshape %331 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9366879Z %333 = arith.sitofp %332 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9367128Z %334 = ttg.local_alloc %333 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9367468Z %335 = ttg.local_load %334 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9367933Z %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9368303Z %337 = arith.addi %arg4, %c12_i32 : i32 2026-02-21T09:48:43.9368427Z %338 = arith.muli %337, %c2_i32 : i32 2026-02-21T09:48:43.9368599Z %339 = tt.splat %338 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9368822Z %340 = arith.addi %339, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9369097Z %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9369374Z %342 = tt.broadcast %341 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9369568Z %343 = arith.addi %89, %342 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9369769Z %344 = tt.addptr %4, %343 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9369974Z %345 = tt.load %344 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9370199Z %346 = ttg.local_alloc %345 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9370528Z %347 = ttg.local_load %346 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9370931Z %348 = arith.extf %347 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9371210Z %349 = arith.extsi %337 : i32 to i64 2026-02-21T09:48:43.9371431Z %350 = tt.splat %349 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9371726Z %351 = arith.addi %350, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9372123Z %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9372474Z %353 = arith.muli %352, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9372777Z %354 = tt.broadcast %353 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9373075Z %355 = arith.addi %354, %94 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9373380Z %356 = tt.addptr %5, %355 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9373695Z %357 = arith.cmpi sge, %352, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9373939Z %358 = arith.cmpi slt, %352, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9374175Z %359 = arith.andi %357, %358 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9374473Z %360 = tt.broadcast %359 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9374766Z %361 = arith.andi %360, %98 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9375008Z %362 = tt.load %356, %361, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9375252Z %363 = arith.shli %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9375503Z %364 = arith.shrsi %363, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9375738Z %365 = arith.shrsi %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9376035Z %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9376366Z %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9376644Z %368 = tt.broadcast %366 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9376879Z %369 = arith.select %14, %368, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9377114Z %370 = tt.broadcast %367 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9377344Z %371 = arith.select %16, %370, %369 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9377571Z %372 = tt.reshape %371 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9377791Z %373 = arith.sitofp %372 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9378041Z %374 = ttg.local_alloc %373 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9378363Z %375 = ttg.local_load %374 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9378830Z %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9379192Z scf.yield %376 : tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9379358Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:43.9379560Z %100 = arith.truncf %99 : tensor<128x32xf32, #mma> to tensor<128x32xbf16, #mma> 2026-02-21T09:48:43.9379745Z %101 = arith.extsi %83 : i32 to i64 2026-02-21T09:48:43.9379910Z %102 = tt.splat %101 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:43.9380123Z %103 = arith.addi %102, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:43.9380389Z %104 = tt.expand_dims %103 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9380634Z %105 = arith.muli %104, %cst_13 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9380817Z %106 = tt.broadcast %105 : tensor<128x1xi64, #mma> -> tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9381023Z %107 = tt.splat %90 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:43.9381231Z %108 = arith.addi %107, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:43.9381494Z %109 = tt.expand_dims %108 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9381755Z %110 = tt.broadcast %109 : tensor<1x32xi64, #mma> -> tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9381938Z %111 = arith.addi %106, %110 : tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9382126Z %112 = tt.addptr %17, %111 : tensor<128x32x!tt.ptr, #mma>, tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9382326Z %113 = arith.cmpi sge, %104, %cst_14 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9382493Z %114 = arith.cmpi slt, %104, %cst_15 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9382652Z %115 = arith.andi %113, %114 : tensor<128x1xi1, #mma> 2026-02-21T09:48:43.9382826Z %116 = tt.broadcast %115 : tensor<128x1xi1, #mma> -> tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9383029Z %117 = arith.cmpi sge, %109, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9383195Z %118 = arith.cmpi slt, %109, %cst_12 : tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9383350Z %119 = arith.andi %117, %118 : tensor<1x32xi1, #mma> 2026-02-21T09:48:43.9383520Z %120 = tt.broadcast %119 : tensor<1x32xi1, #mma> -> tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9383715Z %121 = arith.andi %116, %120 : tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9383874Z tt.store %112, %100, %121 : tensor<128x32x!tt.ptr, #mma> 2026-02-21T09:48:43.9384025Z %122 = arith.addi %arg3, %c1216_i32 : i32 2026-02-21T09:48:43.9384150Z %123 = arith.divsi %122, %c1024_i32 : i32 2026-02-21T09:48:43.9384272Z %124 = arith.muli %123, %c4_i32 : i32 2026-02-21T09:48:43.9384393Z %125 = arith.subi %c128_i32, %124 : i32 2026-02-21T09:48:43.9384512Z %126 = arith.minsi %125, %c4_i32 : i32 2026-02-21T09:48:43.9384630Z %127 = arith.remsi %122, %c1024_i32 : i32 2026-02-21T09:48:43.9384752Z %128 = arith.remsi %127, %126 : i32 2026-02-21T09:48:43.9384865Z %129 = arith.addi %124, %128 : i32 2026-02-21T09:48:43.9384978Z %130 = arith.divsi %127, %126 : i32 2026-02-21T09:48:43.9385096Z %131 = arith.muli %129, %c128_i32 : i32 2026-02-21T09:48:43.9385267Z %132 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:43.9385493Z %133 = arith.addi %132, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:43.9385667Z %134 = arith.muli %130, %c32_i32 : i32 2026-02-21T09:48:43.9385894Z %135 = tt.expand_dims %133 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:48:43.9386150Z %136 = arith.muli %135, %cst_4 : tensor<128x1xi32, #blocked1> 2026-02-21T09:48:43.9386347Z %137 = tt.broadcast %136 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9386541Z %138 = arith.extsi %134 : i32 to i64 2026-02-21T09:48:43.9386746Z %139 = tt.splat %138 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9387060Z %140 = arith.addi %139, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9387450Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9387876Z %142 = tt.broadcast %141 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9388185Z %143 = arith.cmpi sge, %141, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9388428Z %144 = arith.cmpi slt, %141, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9388664Z %145 = arith.andi %143, %144 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9388964Z %146 = tt.broadcast %145 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9389306Z %147 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c16_i32 iter_args(%arg5 = %cst) -> (tensor<128x32xf32, #mma>) : i32 { 2026-02-21T09:48:43.9389523Z %218 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:43.9389695Z %219 = tt.splat %218 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9389916Z %220 = arith.addi %219, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9390193Z %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9390487Z %222 = tt.broadcast %221 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9390683Z %223 = arith.addi %137, %222 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9390888Z %224 = tt.addptr %4, %223 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9391095Z %225 = tt.load %224 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9391338Z %226 = ttg.local_alloc %225 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9391665Z %227 = ttg.local_load %226 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9392072Z %228 = arith.extf %227 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9392351Z %229 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:43.9392562Z %230 = tt.splat %229 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9392857Z %231 = arith.addi %230, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9393240Z %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9393590Z %233 = arith.muli %232, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9393896Z %234 = tt.broadcast %233 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9394199Z %235 = arith.addi %234, %142 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9394531Z %236 = tt.addptr %5, %235 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9394842Z %237 = arith.cmpi sge, %232, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9395101Z %238 = arith.cmpi slt, %232, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9395337Z %239 = arith.andi %237, %238 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9395633Z %240 = tt.broadcast %239 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9395928Z %241 = arith.andi %240, %146 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9396169Z %242 = tt.load %236, %241, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9396414Z %243 = arith.shli %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9396647Z %244 = arith.shrsi %243, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9396879Z %245 = arith.shrsi %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9397166Z %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9397502Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9397781Z %248 = tt.broadcast %246 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9398015Z %249 = arith.select %14, %248, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9398248Z %250 = tt.broadcast %247 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9398496Z %251 = arith.select %16, %250, %249 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9398724Z %252 = tt.reshape %251 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9398944Z %253 = arith.sitofp %252 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9399209Z %254 = ttg.local_alloc %253 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9399530Z %255 = ttg.local_load %254 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9400000Z %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9400355Z %257 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:43.9400478Z %258 = arith.muli %257, %c2_i32 : i32 2026-02-21T09:48:43.9400647Z %259 = tt.splat %258 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9400870Z %260 = arith.addi %259, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9401148Z %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9401425Z %262 = tt.broadcast %261 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9401621Z %263 = arith.addi %137, %262 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9401824Z %264 = tt.addptr %4, %263 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9402029Z %265 = tt.load %264 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9402267Z %266 = ttg.local_alloc %265 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9402643Z %267 = ttg.local_load %266 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9403068Z %268 = arith.extf %267 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9403349Z %269 = arith.extsi %257 : i32 to i64 2026-02-21T09:48:43.9403557Z %270 = tt.splat %269 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9403851Z %271 = arith.addi %270, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9404235Z %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9404584Z %273 = arith.muli %272, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9404890Z %274 = tt.broadcast %273 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9405194Z %275 = arith.addi %274, %142 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9405499Z %276 = tt.addptr %5, %275 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9405815Z %277 = arith.cmpi sge, %272, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9406058Z %278 = arith.cmpi slt, %272, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9406317Z %279 = arith.andi %277, %278 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9406617Z %280 = tt.broadcast %279 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9406914Z %281 = arith.andi %280, %146 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9407174Z %282 = tt.load %276, %281, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9407418Z %283 = arith.shli %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9407651Z %284 = arith.shrsi %283, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9407886Z %285 = arith.shrsi %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9408174Z %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9408509Z %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9408789Z %288 = tt.broadcast %286 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9409026Z %289 = arith.select %14, %288, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9409264Z %290 = tt.broadcast %287 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9409493Z %291 = arith.select %16, %290, %289 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9409723Z %292 = tt.reshape %291 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9409943Z %293 = arith.sitofp %292 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9410212Z %294 = ttg.local_alloc %293 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9410538Z %295 = ttg.local_load %294 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9411018Z %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9411367Z %297 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:48:43.9411494Z %298 = arith.muli %297, %c2_i32 : i32 2026-02-21T09:48:43.9411666Z %299 = tt.splat %298 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9411892Z %300 = arith.addi %299, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9412172Z %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9412451Z %302 = tt.broadcast %301 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9412649Z %303 = arith.addi %137, %302 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9412851Z %304 = tt.addptr %4, %303 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9413061Z %305 = tt.load %304 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9413285Z %306 = ttg.local_alloc %305 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9413621Z %307 = ttg.local_load %306 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9414033Z %308 = arith.extf %307 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9414333Z %309 = arith.extsi %297 : i32 to i64 2026-02-21T09:48:43.9414543Z %310 = tt.splat %309 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9414842Z %311 = arith.addi %310, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9415242Z %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9415599Z %313 = arith.muli %312, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9415904Z %314 = tt.broadcast %313 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9416212Z %315 = arith.addi %314, %142 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9416521Z %316 = tt.addptr %5, %315 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9416838Z %317 = arith.cmpi sge, %312, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9417088Z %318 = arith.cmpi slt, %312, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9417326Z %319 = arith.andi %317, %318 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9417632Z %320 = tt.broadcast %319 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9417940Z %321 = arith.andi %320, %146 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9418204Z %322 = tt.load %316, %321, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9418455Z %323 = arith.shli %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9418707Z %324 = arith.shrsi %323, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9418951Z %325 = arith.shrsi %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9419248Z %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9419588Z %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9419877Z %328 = tt.broadcast %326 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9420117Z %329 = arith.select %14, %328, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9420360Z %330 = tt.broadcast %327 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9420600Z %331 = arith.select %16, %330, %329 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9420831Z %332 = tt.reshape %331 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9421060Z %333 = arith.sitofp %332 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9421315Z %334 = ttg.local_alloc %333 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9421644Z %335 = ttg.local_load %334 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9422135Z %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9422484Z %337 = arith.addi %arg4, %c12_i32 : i32 2026-02-21T09:48:43.9422618Z %338 = arith.muli %337, %c2_i32 : i32 2026-02-21T09:48:43.9422798Z %339 = tt.splat %338 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9423039Z %340 = arith.addi %339, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9423322Z %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9423600Z %342 = tt.broadcast %341 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9423803Z %343 = arith.addi %137, %342 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9424008Z %344 = tt.addptr %4, %343 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9424223Z %345 = tt.load %344 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9424453Z %346 = ttg.local_alloc %345 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9424786Z %347 = ttg.local_load %346 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9425203Z %348 = arith.extf %347 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9425494Z %349 = arith.extsi %337 : i32 to i64 2026-02-21T09:48:43.9425706Z %350 = tt.splat %349 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9426010Z %351 = arith.addi %350, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9426460Z %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9426842Z %353 = arith.muli %352, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9427158Z %354 = tt.broadcast %353 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9427466Z %355 = arith.addi %354, %142 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9427783Z %356 = tt.addptr %5, %355 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9428106Z %357 = arith.cmpi sge, %352, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9428353Z %358 = arith.cmpi slt, %352, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9428596Z %359 = arith.andi %357, %358 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9428901Z %360 = tt.broadcast %359 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9429207Z %361 = arith.andi %360, %146 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9429459Z %362 = tt.load %356, %361, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9429708Z %363 = arith.shli %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9429948Z %364 = arith.shrsi %363, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9430187Z %365 = arith.shrsi %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9430500Z %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9430842Z %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9431139Z %368 = tt.broadcast %366 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9431380Z %369 = arith.select %14, %368, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9431617Z %370 = tt.broadcast %367 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9431854Z %371 = arith.select %16, %370, %369 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9432088Z %372 = tt.reshape %371 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9432313Z %373 = arith.sitofp %372 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9432570Z %374 = ttg.local_alloc %373 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9432895Z %375 = ttg.local_load %374 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9433366Z %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9433721Z scf.yield %376 : tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9433892Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:43.9434104Z %148 = arith.truncf %147 : tensor<128x32xf32, #mma> to tensor<128x32xbf16, #mma> 2026-02-21T09:48:43.9434301Z %149 = arith.extsi %131 : i32 to i64 2026-02-21T09:48:43.9434471Z %150 = tt.splat %149 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:43.9434706Z %151 = arith.addi %150, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:43.9434978Z %152 = tt.expand_dims %151 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9435230Z %153 = arith.muli %152, %cst_13 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9435415Z %154 = tt.broadcast %153 : tensor<128x1xi64, #mma> -> tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9435632Z %155 = tt.splat %138 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:43.9435852Z %156 = arith.addi %155, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:43.9436116Z %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9436381Z %158 = tt.broadcast %157 : tensor<1x32xi64, #mma> -> tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9436566Z %159 = arith.addi %154, %158 : tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9436765Z %160 = tt.addptr %17, %159 : tensor<128x32x!tt.ptr, #mma>, tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9436977Z %161 = arith.cmpi sge, %152, %cst_14 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9437146Z %162 = arith.cmpi slt, %152, %cst_15 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9437312Z %163 = arith.andi %161, %162 : tensor<128x1xi1, #mma> 2026-02-21T09:48:43.9437492Z %164 = tt.broadcast %163 : tensor<128x1xi1, #mma> -> tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9437684Z %165 = arith.cmpi sge, %157, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9437853Z %166 = arith.cmpi slt, %157, %cst_12 : tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9438020Z %167 = arith.andi %165, %166 : tensor<1x32xi1, #mma> 2026-02-21T09:48:43.9438217Z %168 = tt.broadcast %167 : tensor<1x32xi1, #mma> -> tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9438398Z %169 = arith.andi %164, %168 : tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9438566Z tt.store %160, %148, %169 : tensor<128x32x!tt.ptr, #mma> 2026-02-21T09:48:43.9438736Z %170 = arith.addi %arg3, %c1824_i32 : i32 2026-02-21T09:48:43.9438873Z %171 = arith.divsi %170, %c1024_i32 : i32 2026-02-21T09:48:43.9439000Z %172 = arith.muli %171, %c4_i32 : i32 2026-02-21T09:48:43.9439126Z %173 = arith.subi %c128_i32, %172 : i32 2026-02-21T09:48:43.9439251Z %174 = arith.minsi %173, %c4_i32 : i32 2026-02-21T09:48:43.9439375Z %175 = arith.remsi %170, %c1024_i32 : i32 2026-02-21T09:48:43.9439500Z %176 = arith.remsi %175, %174 : i32 2026-02-21T09:48:43.9439619Z %177 = arith.addi %172, %176 : i32 2026-02-21T09:48:43.9439741Z %178 = arith.divsi %175, %174 : i32 2026-02-21T09:48:43.9439863Z %179 = arith.muli %177, %c128_i32 : i32 2026-02-21T09:48:43.9440043Z %180 = tt.splat %179 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:43.9440279Z %181 = arith.addi %180, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:43.9440458Z %182 = arith.muli %178, %c32_i32 : i32 2026-02-21T09:48:43.9440691Z %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:48:43.9440948Z %184 = arith.muli %183, %cst_4 : tensor<128x1xi32, #blocked1> 2026-02-21T09:48:43.9441151Z %185 = tt.broadcast %184 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9441335Z %186 = arith.extsi %182 : i32 to i64 2026-02-21T09:48:43.9441547Z %187 = tt.splat %186 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9441870Z %188 = arith.addi %187, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9442261Z %189 = tt.expand_dims %188 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9442990Z %190 = tt.broadcast %189 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9443399Z %191 = arith.cmpi sge, %189, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9443648Z %192 = arith.cmpi slt, %189, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9443892Z %193 = arith.andi %191, %192 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9444199Z %194 = tt.broadcast %193 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9444553Z %195 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c16_i32 iter_args(%arg5 = %cst) -> (tensor<128x32xf32, #mma>) : i32 { 2026-02-21T09:48:43.9444780Z %218 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:43.9444958Z %219 = tt.splat %218 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9445196Z %220 = arith.addi %219, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9445484Z %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9445771Z %222 = tt.broadcast %221 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9445977Z %223 = arith.addi %185, %222 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9446185Z %224 = tt.addptr %4, %223 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9446475Z %225 = tt.load %224 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9446705Z %226 = ttg.local_alloc %225 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9447044Z %227 = ttg.local_load %226 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9447483Z %228 = arith.extf %227 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9447771Z %229 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:43.9447986Z %230 = tt.splat %229 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9448284Z %231 = arith.addi %230, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9448676Z %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9449037Z %233 = arith.muli %232, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9449348Z %234 = tt.broadcast %233 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9449661Z %235 = arith.addi %234, %190 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9449976Z %236 = tt.addptr %5, %235 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9450292Z %237 = arith.cmpi sge, %232, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9450564Z %238 = arith.cmpi slt, %232, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9450801Z %239 = arith.andi %237, %238 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9451126Z %240 = tt.broadcast %239 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9451430Z %241 = arith.andi %240, %194 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9451676Z %242 = tt.load %236, %241, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9451927Z %243 = arith.shli %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9452166Z %244 = arith.shrsi %243, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9452408Z %245 = arith.shrsi %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9452701Z %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9453038Z %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9453324Z %248 = tt.broadcast %246 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9453565Z %249 = arith.select %14, %248, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9453803Z %250 = tt.broadcast %247 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9454039Z %251 = arith.select %16, %250, %249 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9454269Z %252 = tt.reshape %251 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9454514Z %253 = arith.sitofp %252 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9454769Z %254 = ttg.local_alloc %253 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9455092Z %255 = ttg.local_load %254 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9455583Z %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9455934Z %257 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:43.9456064Z %258 = arith.muli %257, %c2_i32 : i32 2026-02-21T09:48:43.9456241Z %259 = tt.splat %258 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9456470Z %260 = arith.addi %259, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9456753Z %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9457036Z %262 = tt.broadcast %261 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9457236Z %263 = arith.addi %185, %262 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9457440Z %264 = tt.addptr %4, %263 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9457647Z %265 = tt.load %264 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9457872Z %266 = ttg.local_alloc %265 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9458215Z %267 = ttg.local_load %266 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9458623Z %268 = arith.extf %267 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9458924Z %269 = arith.extsi %257 : i32 to i64 2026-02-21T09:48:43.9459134Z %270 = tt.splat %269 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9459433Z %271 = arith.addi %270, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9459818Z %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9460170Z %273 = arith.muli %272, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9460477Z %274 = tt.broadcast %273 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9460779Z %275 = arith.addi %274, %190 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9461091Z %276 = tt.addptr %5, %275 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9461410Z %277 = arith.cmpi sge, %272, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9461652Z %278 = arith.cmpi slt, %272, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9461889Z %279 = arith.andi %277, %278 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9462187Z %280 = tt.broadcast %279 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9462501Z %281 = arith.andi %280, %194 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9462746Z %282 = tt.load %276, %281, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9462992Z %283 = arith.shli %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9463241Z %284 = arith.shrsi %283, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9463475Z %285 = arith.shrsi %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9463760Z %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9464093Z %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9464374Z %288 = tt.broadcast %286 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9464611Z %289 = arith.select %14, %288, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9464847Z %290 = tt.broadcast %287 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9465077Z %291 = arith.select %16, %290, %289 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9465307Z %292 = tt.reshape %291 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9465527Z %293 = arith.sitofp %292 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9465776Z %294 = ttg.local_alloc %293 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9466114Z %295 = ttg.local_load %294 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9466579Z %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9466939Z %297 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:48:43.9467063Z %298 = arith.muli %297, %c2_i32 : i32 2026-02-21T09:48:43.9467235Z %299 = tt.splat %298 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9467460Z %300 = arith.addi %299, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9467738Z %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9468017Z %302 = tt.broadcast %301 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9468216Z %303 = arith.addi %185, %302 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9468420Z %304 = tt.addptr %4, %303 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9468631Z %305 = tt.load %304 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9468857Z %306 = ttg.local_alloc %305 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9469190Z %307 = ttg.local_load %306 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9469593Z %308 = arith.extf %307 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9469877Z %309 = arith.extsi %297 : i32 to i64 2026-02-21T09:48:43.9470088Z %310 = tt.splat %309 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9470396Z %311 = arith.addi %310, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9470781Z %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9471155Z %313 = arith.muli %312, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9471457Z %314 = tt.broadcast %313 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9471760Z %315 = arith.addi %314, %190 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9472068Z %316 = tt.addptr %5, %315 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9472389Z %317 = arith.cmpi sge, %312, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9472635Z %318 = arith.cmpi slt, %312, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9472872Z %319 = arith.andi %317, %318 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9473176Z %320 = tt.broadcast %319 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9473476Z %321 = arith.andi %320, %194 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9473723Z %322 = tt.load %316, %321, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9473973Z %323 = arith.shli %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9474229Z %324 = arith.shrsi %323, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9474467Z %325 = arith.shrsi %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9474766Z %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9475100Z %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9475382Z %328 = tt.broadcast %326 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9475616Z %329 = arith.select %14, %328, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9475850Z %330 = tt.broadcast %327 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9476079Z %331 = arith.select %16, %330, %329 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9476307Z %332 = tt.reshape %331 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9476530Z %333 = arith.sitofp %332 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9476777Z %334 = ttg.local_alloc %333 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9477100Z %335 = ttg.local_load %334 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9477567Z %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9477911Z %337 = arith.addi %arg4, %c12_i32 : i32 2026-02-21T09:48:43.9478038Z %338 = arith.muli %337, %c2_i32 : i32 2026-02-21T09:48:43.9478224Z %339 = tt.splat %338 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9478447Z %340 = arith.addi %339, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9478724Z %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9479011Z %342 = tt.broadcast %341 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9479208Z %343 = arith.addi %185, %342 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9479409Z %344 = tt.addptr %4, %343 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9479617Z %345 = tt.load %344 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9479841Z %346 = ttg.local_alloc %345 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9480169Z %347 = ttg.local_load %346 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9480577Z %348 = arith.extf %347 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9480858Z %349 = arith.extsi %337 : i32 to i64 2026-02-21T09:48:43.9481067Z %350 = tt.splat %349 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9481366Z %351 = arith.addi %350, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9481752Z %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9482119Z %353 = arith.muli %352, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9482427Z %354 = tt.broadcast %353 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9482824Z %355 = arith.addi %354, %190 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9483134Z %356 = tt.addptr %5, %355 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9483447Z %357 = arith.cmpi sge, %352, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9483693Z %358 = arith.cmpi slt, %352, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9483929Z %359 = arith.andi %357, %358 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9484227Z %360 = tt.broadcast %359 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9484524Z %361 = arith.andi %360, %194 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9484766Z %362 = tt.load %356, %361, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9485013Z %363 = arith.shli %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9485246Z %364 = arith.shrsi %363, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9485480Z %365 = arith.shrsi %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9485766Z %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9486118Z %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9486400Z %368 = tt.broadcast %366 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9486637Z %369 = arith.select %14, %368, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9486888Z %370 = tt.broadcast %367 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9487118Z %371 = arith.select %16, %370, %369 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9487344Z %372 = tt.reshape %371 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9487566Z %373 = arith.sitofp %372 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9487814Z %374 = ttg.local_alloc %373 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9488136Z %375 = ttg.local_load %374 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9488609Z %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9488960Z scf.yield %376 : tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9489124Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:43.9489332Z %196 = arith.truncf %195 : tensor<128x32xf32, #mma> to tensor<128x32xbf16, #mma> 2026-02-21T09:48:43.9489503Z %197 = arith.extsi %179 : i32 to i64 2026-02-21T09:48:43.9489669Z %198 = tt.splat %197 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:43.9489897Z %199 = arith.addi %198, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:43.9490167Z %200 = tt.expand_dims %199 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9490428Z %201 = arith.muli %200, %cst_13 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9490610Z %202 = tt.broadcast %201 : tensor<128x1xi64, #mma> -> tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9490818Z %203 = tt.splat %186 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:43.9491024Z %204 = arith.addi %203, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:43.9491284Z %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9491548Z %206 = tt.broadcast %205 : tensor<1x32xi64, #mma> -> tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9491732Z %207 = arith.addi %202, %206 : tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9491927Z %208 = tt.addptr %17, %207 : tensor<128x32x!tt.ptr, #mma>, tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9492128Z %209 = arith.cmpi sge, %200, %cst_14 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9492298Z %210 = arith.cmpi slt, %200, %cst_15 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9492460Z %211 = arith.andi %209, %210 : tensor<128x1xi1, #mma> 2026-02-21T09:48:43.9492635Z %212 = tt.broadcast %211 : tensor<128x1xi1, #mma> -> tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9492821Z %213 = arith.cmpi sge, %205, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9492985Z %214 = arith.cmpi slt, %205, %cst_12 : tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9493142Z %215 = arith.andi %213, %214 : tensor<1x32xi1, #mma> 2026-02-21T09:48:43.9493313Z %216 = tt.broadcast %215 : tensor<1x32xi1, #mma> -> tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9493492Z %217 = arith.andi %212, %216 : tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9493655Z tt.store %208, %196, %217 : tensor<128x32x!tt.ptr, #mma> 2026-02-21T09:48:43.9493839Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:48:43.9494005Z scf.for %arg3 = %26 to %c32768_i32 step %c608_i32 : i32 { 2026-02-21T09:48:43.9494151Z %27 = arith.divsi %arg3, %c1024_i32 : i32 2026-02-21T09:48:43.9494276Z %28 = arith.muli %27, %c4_i32 : i32 2026-02-21T09:48:43.9494408Z %29 = arith.subi %c128_i32, %28 : i32 2026-02-21T09:48:43.9494528Z %30 = arith.minsi %29, %c4_i32 : i32 2026-02-21T09:48:43.9494650Z %31 = arith.remsi %arg3, %c1024_i32 : i32 2026-02-21T09:48:43.9494769Z %32 = arith.remsi %31, %30 : i32 2026-02-21T09:48:43.9494883Z %33 = arith.addi %28, %32 : i32 2026-02-21T09:48:43.9494992Z %34 = arith.divsi %31, %30 : i32 2026-02-21T09:48:43.9495105Z %35 = arith.muli %33, %c128_i32 : i32 2026-02-21T09:48:43.9495272Z %36 = tt.splat %35 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:43.9495495Z %37 = arith.addi %36, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:48:43.9495668Z %38 = arith.muli %34, %c32_i32 : i32 2026-02-21T09:48:43.9495890Z %39 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:48:43.9496144Z %40 = arith.muli %39, %cst_4 : tensor<128x1xi32, #blocked1> 2026-02-21T09:48:43.9496336Z %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9496511Z %42 = arith.extsi %38 : i32 to i64 2026-02-21T09:48:43.9496713Z %43 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9497007Z %44 = arith.addi %43, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9497405Z %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9497821Z %46 = tt.broadcast %45 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9498142Z %47 = arith.cmpi sge, %45, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9498382Z %48 = arith.cmpi slt, %45, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9498608Z %49 = arith.andi %47, %48 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9498900Z %50 = tt.broadcast %49 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9499232Z %51 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c16_i32 iter_args(%arg5 = %cst) -> (tensor<128x32xf32, #mma>) : i32 { 2026-02-21T09:48:43.9499451Z %74 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:48:43.9499620Z %75 = tt.splat %74 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9499835Z %76 = arith.addi %75, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9500107Z %77 = tt.expand_dims %76 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9500379Z %78 = tt.broadcast %77 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9500574Z %79 = arith.addi %41, %78 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9500773Z %80 = tt.addptr %4, %79 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9500974Z %81 = tt.load %80 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9501196Z %82 = ttg.local_alloc %81 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9501535Z %83 = ttg.local_load %82 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9501939Z %84 = arith.extf %83 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9502240Z %85 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:48:43.9502445Z %86 = tt.splat %85 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9502737Z %87 = arith.addi %86, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9503115Z %88 = tt.expand_dims %87 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9503463Z %89 = arith.muli %88, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9503765Z %90 = tt.broadcast %89 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9504061Z %91 = arith.addi %90, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9504361Z %92 = tt.addptr %5, %91 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9504671Z %93 = arith.cmpi sge, %88, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9504909Z %94 = arith.cmpi slt, %88, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9505135Z %95 = arith.andi %93, %94 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9505441Z %96 = tt.broadcast %95 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9505747Z %97 = arith.andi %96, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9505981Z %98 = tt.load %92, %97, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9506217Z %99 = arith.shli %98, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9506446Z %100 = arith.shrsi %99, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9506677Z %101 = arith.shrsi %98, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9506963Z %102 = tt.expand_dims %100 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9507298Z %103 = tt.expand_dims %101 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9507576Z %104 = tt.broadcast %102 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9507814Z %105 = arith.select %14, %104, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9508046Z %106 = tt.broadcast %103 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9508277Z %107 = arith.select %16, %106, %105 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9508503Z %108 = tt.reshape %107 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9508722Z %109 = arith.sitofp %108 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9508971Z %110 = ttg.local_alloc %109 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9509304Z %111 = ttg.local_load %110 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9509773Z %112 = tt.dot %84, %111, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9510131Z %113 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:48:43.9510255Z %114 = arith.muli %113, %c2_i32 : i32 2026-02-21T09:48:43.9510425Z %115 = tt.splat %114 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9510650Z %116 = arith.addi %115, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9515604Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9515893Z %118 = tt.broadcast %117 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9516092Z %119 = arith.addi %41, %118 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9516297Z %120 = tt.addptr %4, %119 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9516509Z %121 = tt.load %120 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9516736Z %122 = ttg.local_alloc %121 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9517071Z %123 = ttg.local_load %122 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9517485Z %124 = arith.extf %123 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9517809Z %125 = arith.extsi %113 : i32 to i64 2026-02-21T09:48:43.9518019Z %126 = tt.splat %125 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9518330Z %127 = arith.addi %126, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9518716Z %128 = tt.expand_dims %127 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9519071Z %129 = arith.muli %128, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9519375Z %130 = tt.broadcast %129 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9519676Z %131 = arith.addi %130, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9519985Z %132 = tt.addptr %5, %131 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9520299Z %133 = arith.cmpi sge, %128, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9520546Z %134 = arith.cmpi slt, %128, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9520780Z %135 = arith.andi %133, %134 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9521080Z %136 = tt.broadcast %135 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9521377Z %137 = arith.andi %136, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9521619Z %138 = tt.load %132, %137, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9521883Z %139 = arith.shli %138, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9522115Z %140 = arith.shrsi %139, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9522350Z %141 = arith.shrsi %138, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9522700Z %142 = tt.expand_dims %140 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9523034Z %143 = tt.expand_dims %141 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9523316Z %144 = tt.broadcast %142 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9523552Z %145 = arith.select %14, %144, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9523786Z %146 = tt.broadcast %143 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9524017Z %147 = arith.select %16, %146, %145 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9524243Z %148 = tt.reshape %147 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9524464Z %149 = arith.sitofp %148 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9524715Z %150 = ttg.local_alloc %149 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9525036Z %151 = ttg.local_load %150 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9525527Z %152 = tt.dot %124, %151, %112, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9525873Z %153 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:48:43.9526000Z %154 = arith.muli %153, %c2_i32 : i32 2026-02-21T09:48:43.9526188Z %155 = tt.splat %154 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9526407Z %156 = arith.addi %155, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9526681Z %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9526956Z %158 = tt.broadcast %157 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9527151Z %159 = arith.addi %41, %158 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9527353Z %160 = tt.addptr %4, %159 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9527558Z %161 = tt.load %160 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9527684Z %162 = ttg.local_alloc %161 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9527855Z %163 = ttg.local_load %162 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9528055Z %164 = arith.extf %163 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9528100Z %165 = arith.extsi %153 : i32 to i64 2026-02-21T09:48:43.9528228Z %166 = tt.splat %165 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9528354Z %167 = arith.addi %166, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9528591Z %168 = tt.expand_dims %167 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9528688Z %169 = arith.muli %168, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9528857Z %170 = tt.broadcast %169 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9528967Z %171 = arith.addi %170, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9529142Z %172 = tt.addptr %5, %171 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9529245Z %173 = arith.cmpi sge, %168, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9529347Z %174 = arith.cmpi slt, %168, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9529442Z %175 = arith.andi %173, %174 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9529605Z %176 = tt.broadcast %175 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9529699Z %177 = arith.andi %176, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9529808Z %178 = tt.load %172, %177, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9529902Z %179 = arith.shli %178, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9529999Z %180 = arith.shrsi %179, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9530094Z %181 = arith.shrsi %178, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9530255Z %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9530403Z %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9530509Z %184 = tt.broadcast %182 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9530611Z %185 = arith.select %14, %184, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9530703Z %186 = tt.broadcast %183 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9530799Z %187 = arith.select %16, %186, %185 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9530887Z %188 = tt.reshape %187 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9530978Z %189 = arith.sitofp %188 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9531097Z %190 = ttg.local_alloc %189 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9531263Z %191 = ttg.local_load %190 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9531523Z %192 = tt.dot %164, %191, %152, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9531571Z %193 = arith.addi %arg4, %c12_i32 : i32 2026-02-21T09:48:43.9531616Z %194 = arith.muli %193, %c2_i32 : i32 2026-02-21T09:48:43.9531707Z %195 = tt.splat %194 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9531799Z %196 = arith.addi %195, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:48:43.9531962Z %197 = tt.expand_dims %196 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:48:43.9532054Z %198 = tt.broadcast %197 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9532119Z %199 = arith.addi %41, %198 : tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9532234Z %200 = tt.addptr %4, %199 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:48:43.9532298Z %201 = tt.load %200 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:48:43.9532422Z %202 = ttg.local_alloc %201 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:48:43.9532591Z %203 = ttg.local_load %202 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9532789Z %204 = arith.extf %203 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9532833Z %205 = arith.extsi %193 : i32 to i64 2026-02-21T09:48:43.9532964Z %206 = tt.splat %205 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9533091Z %207 = arith.addi %206, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:48:43.9533312Z %208 = tt.expand_dims %207 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9533409Z %209 = arith.muli %208, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9533577Z %210 = tt.broadcast %209 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9533693Z %211 = arith.addi %210, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9533865Z %212 = tt.addptr %5, %211 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9533983Z %213 = arith.cmpi sge, %208, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9534087Z %214 = arith.cmpi slt, %208, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9534178Z %215 = arith.andi %213, %214 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9534340Z %216 = tt.broadcast %215 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9534432Z %217 = arith.andi %216, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9534541Z %218 = tt.load %212, %217, %cst_1 : tensor<4x32x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9534637Z %219 = arith.shli %218, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9534737Z %220 = arith.shrsi %219, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9534831Z %221 = arith.shrsi %218, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:48:43.9534978Z %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9535126Z %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked> 2026-02-21T09:48:43.9535219Z %224 = tt.broadcast %222 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9535336Z %225 = arith.select %14, %224, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9535429Z %226 = tt.broadcast %223 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9535526Z %227 = arith.select %16, %226, %225 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked> 2026-02-21T09:48:43.9535629Z %228 = tt.reshape %227 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2> 2026-02-21T09:48:43.9535722Z %229 = arith.sitofp %228 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2> 2026-02-21T09:48:43.9535839Z %230 = ttg.local_alloc %229 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem> 2026-02-21T09:48:43.9536003Z %231 = ttg.local_load %230 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:48:43.9536267Z %232 = tt.dot %204, %231, %192, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9536319Z scf.yield %232 : tensor<128x32xf32, #mma> 2026-02-21T09:48:43.9536401Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:48:43.9536489Z %52 = arith.truncf %51 : tensor<128x32xf32, #mma> to tensor<128x32xbf16, #mma> 2026-02-21T09:48:43.9536532Z %53 = arith.extsi %35 : i32 to i64 2026-02-21T09:48:43.9536615Z %54 = tt.splat %53 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:43.9536698Z %55 = arith.addi %54, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:48:43.9536835Z %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9536896Z %57 = arith.muli %56, %cst_13 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9536991Z %58 = tt.broadcast %57 : tensor<128x1xi64, #mma> -> tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9537071Z %59 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:43.9537165Z %60 = arith.addi %59, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:48:43.9537298Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9537376Z %62 = tt.broadcast %61 : tensor<1x32xi64, #mma> -> tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9537432Z %63 = arith.addi %58, %62 : tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9537523Z %64 = tt.addptr %17, %63 : tensor<128x32x!tt.ptr, #mma>, tensor<128x32xi64, #mma> 2026-02-21T09:48:43.9537587Z %65 = arith.cmpi sge, %56, %cst_14 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9537649Z %66 = arith.cmpi slt, %56, %cst_15 : tensor<128x1xi64, #mma> 2026-02-21T09:48:43.9537706Z %67 = arith.andi %65, %66 : tensor<128x1xi1, #mma> 2026-02-21T09:48:43.9537786Z %68 = tt.broadcast %67 : tensor<128x1xi1, #mma> -> tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9537849Z %69 = arith.cmpi sge, %61, %cst_11 : tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9537910Z %70 = arith.cmpi slt, %61, %cst_12 : tensor<1x32xi64, #mma> 2026-02-21T09:48:43.9537965Z %71 = arith.andi %69, %70 : tensor<1x32xi1, #mma> 2026-02-21T09:48:43.9538041Z %72 = tt.broadcast %71 : tensor<1x32xi1, #mma> -> tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9538095Z %73 = arith.andi %68, %72 : tensor<128x32xi1, #mma> 2026-02-21T09:48:43.9538158Z tt.store %64, %52, %73 : tensor<128x32x!tt.ptr, #mma> 2026-02-21T09:48:43.9538221Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:48:43.9538257Z tt.return 2026-02-21T09:48:43.9538288Z } 2026-02-21T09:48:43.9538324Z } 2026-02-21T09:48:43.9538329Z 2026-02-21T09:48:43.9538359Z {-# 2026-02-21T09:48:43.9538401Z external_resources: { 2026-02-21T09:48:43.9538464Z mlir_reproducer: { 2026-02-21T09:48:43.9539393Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:48:43.9539455Z disable_threading: false, 2026-02-21T09:48:43.9539492Z verify_each: true 2026-02-21T09:48:43.9539524Z } 2026-02-21T09:48:43.9539554Z } 2026-02-21T09:48:43.9539586Z #-} 2026-02-21T09:48:43.9539829Z /tmp/torchinductor_root/hq/chqcwabtsnwtngk7ihgmoab5jfdhtejdjrvnzsagkuazwdcogvdr.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:48:43.9540254Z /tmp/torchinductor_root/hq/chqcwabtsnwtngk7ihgmoab5jfdhtejdjrvnzsagkuazwdcogvdr.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:48:43.9540372Z [254s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:48:43.9541018Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 32], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[4, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:48:43.9541076Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:48:43.9541159Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:48:52.3663457Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 111/111 9.1 configs/s 2026-02-21T09:48:54.2141902Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 155/155 80.1 configs/s 2026-02-21T09:48:57.8750449Z [268s] Generation 1 complete: 2026-02-21T09:48:57.8750818Z error=15 2026-02-21T09:48:57.8751028Z ok=99 2026-02-21T09:48:57.8751229Z min=1.2871 2026-02-21T09:48:57.8751435Z mid=2.7963 2026-02-21T09:48:57.8751635Z max=72.1609 2026-02-21T09:48:57.8751873Z best={'block_sizes': [8, 128, 512], 2026-02-21T09:48:57.8752251Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T09:48:57.8752616Z 'l2_groupings': [1], 2026-02-21T09:48:57.8752896Z 'load_eviction_policies': ['', ''], 2026-02-21T09:48:57.8753212Z 'loop_orders': [[0, 1]], 2026-02-21T09:48:57.8753524Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:48:57.8753821Z 'num_sm_multiplier': 128, 2026-02-21T09:48:57.8754092Z 'num_stages': 4, 2026-02-21T09:48:57.8754321Z 'num_warps': 8, 2026-02-21T09:48:57.8754604Z 'pid_type': 'persistent_blocked', 2026-02-21T09:48:57.8754918Z 'range_flattens': [False, True], 2026-02-21T09:48:57.8755239Z 'range_multi_buffers': [True, True], 2026-02-21T09:48:57.8755545Z 'range_num_stages': [2, 1], 2026-02-21T09:48:57.8755826Z 'range_unroll_factors': [3, 3], 2026-02-21T09:48:57.8756154Z 'range_warp_specializes': [], 2026-02-21T09:48:57.8756373Z 'waves_per_eu': 1} 2026-02-21T09:48:57.8801599Z [268s] Fitting surrogate: 214 points, 214 targets 2026-02-21T09:48:58.9930334Z [269s] Generation 2 starting: 103 neighbors, 5 active search path(s) 2026-02-21T09:49:34.9814951Z [305s] Timeout after 30s compiling Config(block_sizes=[32, 32, 512], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:49:34.9838909Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 103/103 0.7 configs/s 2026-02-21T09:49:35.8807396Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:49:35.8881397Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 8], order = [1, 0]}> 2026-02-21T09:49:35.8889535Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 8], order = [2, 1, 0]}> 2026-02-21T09:49:35.8893392Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:49:35.8894984Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 8], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:49:35.8895579Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:49:35.8895923Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:49:35.8896218Z #smem = #ttg.shared_memory 2026-02-21T09:49:35.8896564Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:49:35.8897254Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:49:35.8898059Z %cst = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.8898696Z %cst_0 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.8898955Z %cst_1 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.8899211Z %cst_2 = arith.constant dense<8192> : tensor<1x512xi64, #mma> 2026-02-21T09:49:35.8899595Z %cst_3 = arith.constant dense<0> : tensor<1x512xi64, #mma> 2026-02-21T09:49:35.8899857Z %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8900116Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8900371Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8900633Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:49:35.8900865Z %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:49:35.8901050Z %cst_9 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:49:35.8901206Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:49:35.8901373Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<128x512xf32, #mma> 2026-02-21T09:49:35.8901612Z %cst_11 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.8901876Z %cst_12 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.8902071Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:49:35.8902192Z %c6_i32 = arith.constant 6 : i32 2026-02-21T09:49:35.8902321Z %c510_i32 = arith.constant 510 : i32 2026-02-21T09:49:35.8902440Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:49:35.8902563Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T09:49:35.8902692Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:49:35.8902811Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:49:35.8902961Z %cst_13 = arith.constant dense<0> : tensor<2x512xi8, #blocked> 2026-02-21T09:49:35.8903138Z %cst_14 = arith.constant dense<8192> : tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.8903323Z %cst_15 = arith.constant dense<0> : tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.8903593Z %cst_16 = arith.constant dense<0> : tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8903748Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:49:35.8903874Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:49:35.8904058Z %cst_17 = arith.constant dense<4> : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8904333Z %0 = tt.get_program_id x : i32 2026-02-21T09:49:35.8904449Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:49:35.8904571Z %2 = arith.minsi %1, %c2048_i32 : i32 2026-02-21T09:49:35.8904782Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:35.8905064Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:35.8905339Z %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.8905583Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.8905788Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.8906025Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.8906391Z %9 = arith.extsi %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.8906851Z %10 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:35.8907317Z %11 = arith.extsi %10 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:35.8907875Z %12 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:49:35.8908511Z %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:49:35.8909122Z %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:49:35.8909513Z %15 = arith.cmpi eq, %14, %cst_8 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:49:35.8909802Z %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x512xi1, #blocked1> 2026-02-21T09:49:35.8910085Z %17 = arith.cmpi eq, %14, %cst_7 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:49:35.8910366Z %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x512xi1, #blocked1> 2026-02-21T09:49:35.8910667Z %19 = tt.splat %arg2 : !tt.ptr -> tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:49:35.8911058Z %20 = arith.extsi %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:35.8911493Z %21 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:35.8911936Z %22 = arith.extsi %21 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:35.8912270Z %23 = arith.subi %2, %0 : i32 2026-02-21T09:49:35.8912428Z %24 = arith.remsi %23, %c3_i32 : i32 2026-02-21T09:49:35.8912596Z %25 = arith.subi %23, %24 : i32 2026-02-21T09:49:35.8912750Z %26 = arith.addi %0, %25 : i32 2026-02-21T09:49:35.8912985Z %27 = arith.addi %5, %cst_11 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.8913382Z %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:35.8913770Z %29 = tt.broadcast %28 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8914132Z %30 = arith.addi %9, %cst_12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.8914518Z %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8914898Z %32 = arith.muli %31, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8915164Z %33 = tt.broadcast %32 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8915438Z %34 = arith.cmpi sge, %31, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8915676Z %35 = arith.cmpi slt, %31, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8915900Z %36 = arith.andi %34, %35 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:35.8916159Z %37 = tt.broadcast %36 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.8916410Z scf.for %arg3 = %0 to %26 step %c3_i32 : i32 { 2026-02-21T09:49:35.8916612Z %38 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:49:35.8916796Z %39 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:49:35.8916969Z %40 = arith.muli %38, %c128_i32 : i32 2026-02-21T09:49:35.8917216Z %41 = tt.splat %40 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:35.8917552Z %42 = arith.addi %41, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:35.8917803Z %43 = arith.muli %39, %c512_i32 : i32 2026-02-21T09:49:35.8918125Z %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:49:35.8918492Z %45 = arith.muli %44, %cst_9 : tensor<128x1xi32, #blocked2> 2026-02-21T09:49:35.8918773Z %46 = tt.broadcast %45 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8919023Z %47 = arith.extsi %43 : i32 to i64 2026-02-21T09:49:35.8919290Z %48 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:35.8919604Z %49 = arith.addi %48, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:35.8920020Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.8920420Z %51 = tt.broadcast %50 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8920704Z %52 = arith.cmpi sge, %50, %cst_15 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.8920956Z %53 = arith.cmpi slt, %50, %cst_14 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.8921186Z %54 = arith.andi %52, %53 : tensor<1x512xi1, #blocked> 2026-02-21T09:49:35.8921446Z %55 = tt.broadcast %54 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.8921827Z %56 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>) : i32 { 2026-02-21T09:49:35.8922138Z %238 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:49:35.8922391Z %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.8922766Z %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.8923169Z %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:35.8923571Z %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8923903Z %243 = arith.addi %46, %242 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8924212Z %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8924514Z %245 = tt.load %244 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.8924873Z %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.8925367Z %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.8925981Z %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.8926447Z %249 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:35.8926695Z %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.8927016Z %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.8927415Z %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8927771Z %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8928050Z %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8928331Z %255 = arith.addi %254, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8928613Z %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8928917Z %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8929162Z %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8929400Z %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:35.8929675Z %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.8929950Z %261 = arith.andi %260, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.8930217Z %262 = tt.load %256, %261, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.8930618Z %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8931037Z %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8931412Z %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8931774Z %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8932208Z %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.8932709Z %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.8933143Z %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8933511Z %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8933869Z %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8934228Z %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8934570Z %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.8934895Z %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.8935259Z %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.8935759Z %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.8936502Z %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.8937025Z %278 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:49:35.8937205Z %279 = arith.muli %278, %c2_i32 : i32 2026-02-21T09:49:35.8937472Z %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.8937793Z %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.8938197Z %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:35.8938596Z %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8938883Z %284 = arith.addi %46, %283 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8939178Z %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8939477Z %286 = tt.load %285 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.8939805Z %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.8940291Z %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.8940918Z %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.8941332Z %290 = arith.extsi %278 : i32 to i64 2026-02-21T09:49:35.8941571Z %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.8941913Z %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.8942307Z %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8942685Z %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8942965Z %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8943244Z %296 = arith.addi %295, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8943552Z %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8943856Z %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8944104Z %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8944344Z %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:35.8944610Z %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.8944885Z %302 = arith.andi %301, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.8945124Z %303 = tt.load %297, %302, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.8945505Z %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8945854Z %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8946101Z %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8946436Z %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8946868Z %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.8947401Z %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.8947857Z %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8948324Z %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8948718Z %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8949085Z %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8949428Z %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.8949757Z %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.8950121Z %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.8950601Z %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.8951295Z %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.8951818Z %319 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:35.8951999Z %320 = arith.muli %319, %c2_i32 : i32 2026-02-21T09:49:35.8952243Z %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.8952570Z %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.8952978Z %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:35.8953403Z %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8953687Z %325 = arith.addi %46, %324 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8953996Z %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8954301Z %327 = tt.load %326 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.8954627Z %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.8955107Z %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.8955705Z %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.8956113Z %331 = arith.extsi %319 : i32 to i64 2026-02-21T09:49:35.8956360Z %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.8956681Z %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.8957075Z %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8957433Z %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8957706Z %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8957985Z %337 = arith.addi %336, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8958268Z %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8958566Z %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8958834Z %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.8959068Z %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:35.8959338Z %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.8959606Z %343 = arith.andi %342, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.8959867Z %344 = tt.load %338, %343, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.8960244Z %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8960659Z %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8961013Z %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8961367Z %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8961801Z %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.8962306Z %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.8962762Z %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8963121Z %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8963480Z %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8963829Z %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8964170Z %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.8964511Z %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.8964872Z %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.8965373Z %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.8966061Z %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.8966573Z scf.yield %359 : tensor<128x512xf32, #mma> 2026-02-21T09:49:35.8966759Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:49:35.8966959Z %57 = arith.addi %46, %29 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8967241Z %58 = tt.addptr %6, %57 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8967526Z %59 = tt.load %58 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.8967836Z %60 = ttg.local_alloc %59 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.8968301Z %61 = ttg.local_load %60 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.8968885Z %62 = arith.extf %61 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.8969305Z %63 = arith.addi %33, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8969571Z %64 = tt.addptr %7, %63 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8969846Z %65 = arith.andi %37, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.8970075Z %66 = tt.load %64, %65, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.8970461Z %67 = ttg.convert_layout %66 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8970863Z %68 = arith.shli %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8971215Z %69 = arith.shrsi %68, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8971553Z %70 = arith.shrsi %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.8971970Z %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.8972462Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.8972872Z %73 = tt.broadcast %71 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8973217Z %74 = arith.select %16, %73, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8973561Z %75 = tt.broadcast %72 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8973893Z %76 = arith.select %18, %75, %74 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.8974220Z %77 = tt.reshape %76 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.8974529Z %78 = arith.sitofp %77 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.8974873Z %79 = ttg.local_alloc %78 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.8975332Z %80 = ttg.local_load %79 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.8989184Z %81 = tt.dot %62, %80, %56, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.8989970Z %82 = arith.truncf %81 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma> 2026-02-21T09:49:35.8990220Z %83 = arith.extsi %40 : i32 to i64 2026-02-21T09:49:35.8990445Z %84 = tt.splat %83 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:35.8990738Z %85 = arith.addi %84, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:35.8991109Z %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:49:35.8991443Z %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.8991698Z %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:35.8991995Z %89 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:35.8992289Z %90 = arith.addi %89, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:35.8992670Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:49:35.8993068Z %92 = tt.broadcast %91 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:35.8993328Z %93 = arith.addi %88, %92 : tensor<128x512xi64, #mma> 2026-02-21T09:49:35.8993592Z %94 = tt.addptr %19, %93 : tensor<128x512x!tt.ptr, #mma>, tensor<128x512xi64, #mma> 2026-02-21T09:49:35.8993874Z %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.8994098Z %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.8994314Z %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma> 2026-02-21T09:49:35.8994556Z %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:35.8994850Z %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T09:49:35.8995081Z %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T09:49:35.8995303Z %101 = arith.andi %99, %100 : tensor<1x512xi1, #mma> 2026-02-21T09:49:35.8995573Z %102 = tt.broadcast %101 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:35.8995828Z %103 = arith.andi %98, %102 : tensor<128x512xi1, #mma> 2026-02-21T09:49:35.8996055Z tt.store %94, %82, %103 : tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:49:35.8996265Z %104 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:49:35.8996437Z %105 = arith.remsi %104, %c128_i32 : i32 2026-02-21T09:49:35.8996611Z %106 = arith.divsi %104, %c128_i32 : i32 2026-02-21T09:49:35.8996780Z %107 = arith.muli %105, %c128_i32 : i32 2026-02-21T09:49:35.8997026Z %108 = tt.splat %107 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:35.8997353Z %109 = arith.addi %108, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:35.8997621Z %110 = arith.muli %106, %c512_i32 : i32 2026-02-21T09:49:35.8997852Z %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:49:35.8998108Z %112 = arith.muli %111, %cst_9 : tensor<128x1xi32, #blocked2> 2026-02-21T09:49:35.8998307Z %113 = tt.broadcast %112 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.8998486Z %114 = arith.extsi %110 : i32 to i64 2026-02-21T09:49:35.8998660Z %115 = tt.splat %114 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:35.8998883Z %116 = arith.addi %115, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:35.8999183Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.8999466Z %118 = tt.broadcast %117 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.8999684Z %119 = arith.cmpi sge, %117, %cst_15 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.8999864Z %120 = arith.cmpi slt, %117, %cst_14 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.9000032Z %121 = arith.andi %119, %120 : tensor<1x512xi1, #blocked> 2026-02-21T09:49:35.9000227Z %122 = tt.broadcast %121 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9000498Z %123 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>) : i32 { 2026-02-21T09:49:35.9000716Z %238 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:49:35.9000890Z %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9001114Z %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9001392Z %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:35.9001673Z %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9001871Z %243 = arith.addi %113, %242 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9002079Z %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9002286Z %245 = tt.load %244 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.9002511Z %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.9002908Z %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9003344Z %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9003630Z %249 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:35.9003799Z %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9004037Z %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9004316Z %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9004562Z %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9004757Z %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9004951Z %255 = arith.addi %254, %118 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9005150Z %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9005358Z %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9005528Z %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9005693Z %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:35.9005879Z %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9006070Z %261 = arith.andi %260, %122 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9006244Z %262 = tt.load %256, %261, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.9006507Z %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9006804Z %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9007063Z %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9007310Z %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9007628Z %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9007975Z %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9008273Z %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9008526Z %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9008778Z %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9009025Z %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9009266Z %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.9009494Z %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.9009753Z %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.9010086Z %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9010572Z %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9010930Z %278 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:49:35.9011076Z %279 = arith.muli %278, %c2_i32 : i32 2026-02-21T09:49:35.9011247Z %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9011477Z %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9011768Z %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:35.9012045Z %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9012244Z %284 = arith.addi %113, %283 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9012446Z %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9012660Z %286 = tt.load %285 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.9012889Z %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.9013223Z %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9013635Z %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9013922Z %290 = arith.extsi %278 : i32 to i64 2026-02-21T09:49:35.9014090Z %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9014314Z %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9014585Z %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9014845Z %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9015036Z %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9015231Z %296 = arith.addi %295, %118 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9015443Z %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9015649Z %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9015820Z %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9015981Z %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:35.9016172Z %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9016361Z %302 = arith.andi %301, %122 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9016530Z %303 = tt.load %297, %302, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.9016794Z %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9017079Z %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9017324Z %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9017573Z %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9017871Z %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9018228Z %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9018524Z %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9018812Z %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9019069Z %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9019318Z %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9019579Z %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.9019808Z %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.9020066Z %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.9020400Z %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9020878Z %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9021237Z %319 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:35.9021369Z %320 = arith.muli %319, %c2_i32 : i32 2026-02-21T09:49:35.9021543Z %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9021773Z %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9022050Z %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:35.9022340Z %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9022543Z %325 = arith.addi %113, %324 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9022774Z %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9022991Z %327 = tt.load %326 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.9023233Z %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.9023577Z %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9023990Z %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9024275Z %331 = arith.extsi %319 : i32 to i64 2026-02-21T09:49:35.9024452Z %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9024675Z %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9024956Z %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9025210Z %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9025406Z %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9025606Z %337 = arith.addi %336, %118 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9025804Z %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9026019Z %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9026190Z %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9026360Z %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:35.9026553Z %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9026759Z %343 = arith.andi %342, %122 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9026936Z %344 = tt.load %338, %343, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.9027201Z %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9027511Z %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9027759Z %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9028007Z %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9028311Z %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9028658Z %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9028956Z %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9029212Z %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9029462Z %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9029709Z %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9029953Z %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.9030183Z %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.9030443Z %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.9030792Z %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9031287Z %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9031648Z scf.yield %359 : tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9031785Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:49:35.9031937Z %124 = arith.addi %113, %29 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9032143Z %125 = tt.addptr %6, %124 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9032359Z %126 = tt.load %125 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.9032588Z %127 = ttg.local_alloc %126 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.9032926Z %128 = ttg.local_load %127 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9033342Z %129 = arith.extf %128 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9033647Z %130 = arith.addi %33, %118 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9033851Z %131 = tt.addptr %7, %130 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9034058Z %132 = arith.andi %37, %122 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9034228Z %133 = tt.load %131, %132, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.9034500Z %134 = ttg.convert_layout %133 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9034811Z %135 = arith.shli %134, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9035061Z %136 = arith.shrsi %135, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9035309Z %137 = arith.shrsi %134, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9035624Z %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9035974Z %139 = tt.expand_dims %137 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9036271Z %140 = tt.broadcast %138 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9036524Z %141 = arith.select %16, %140, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9036779Z %142 = tt.broadcast %139 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9037022Z %143 = arith.select %18, %142, %141 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9037264Z %144 = tt.reshape %143 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.9037494Z %145 = arith.sitofp %144 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.9037752Z %146 = ttg.local_alloc %145 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.9038087Z %147 = ttg.local_load %146 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9038561Z %148 = tt.dot %129, %147, %123, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9038974Z %149 = arith.truncf %148 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma> 2026-02-21T09:49:35.9039154Z %150 = arith.extsi %107 : i32 to i64 2026-02-21T09:49:35.9039345Z %151 = tt.splat %150 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:35.9039565Z %152 = arith.addi %151, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:35.9039836Z %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:49:35.9040083Z %154 = arith.muli %153, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.9040269Z %155 = tt.broadcast %154 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:35.9040482Z %156 = tt.splat %114 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:35.9040699Z %157 = arith.addi %156, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:35.9040966Z %158 = tt.expand_dims %157 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:49:35.9041236Z %159 = tt.broadcast %158 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:35.9041419Z %160 = arith.addi %155, %159 : tensor<128x512xi64, #mma> 2026-02-21T09:49:35.9041619Z %161 = tt.addptr %19, %160 : tensor<128x512x!tt.ptr, #mma>, tensor<128x512xi64, #mma> 2026-02-21T09:49:35.9041827Z %162 = arith.cmpi sge, %153, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.9041994Z %163 = arith.cmpi slt, %153, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.9042156Z %164 = arith.andi %162, %163 : tensor<128x1xi1, #mma> 2026-02-21T09:49:35.9042334Z %165 = tt.broadcast %164 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:35.9042527Z %166 = arith.cmpi sge, %158, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T09:49:35.9042767Z %167 = arith.cmpi slt, %158, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T09:49:35.9042933Z %168 = arith.andi %166, %167 : tensor<1x512xi1, #mma> 2026-02-21T09:49:35.9043114Z %169 = tt.broadcast %168 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:35.9043298Z %170 = arith.andi %165, %169 : tensor<128x512xi1, #mma> 2026-02-21T09:49:35.9043606Z tt.store %161, %149, %170 : tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:49:35.9043758Z %171 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:49:35.9043889Z %172 = arith.remsi %171, %c128_i32 : i32 2026-02-21T09:49:35.9044015Z %173 = arith.divsi %171, %c128_i32 : i32 2026-02-21T09:49:35.9044148Z %174 = arith.muli %172, %c128_i32 : i32 2026-02-21T09:49:35.9044329Z %175 = tt.splat %174 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:35.9044559Z %176 = arith.addi %175, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:35.9044743Z %177 = arith.muli %173, %c512_i32 : i32 2026-02-21T09:49:35.9044972Z %178 = tt.expand_dims %176 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:49:35.9045236Z %179 = arith.muli %178, %cst_9 : tensor<128x1xi32, #blocked2> 2026-02-21T09:49:35.9045440Z %180 = tt.broadcast %179 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9045620Z %181 = arith.extsi %177 : i32 to i64 2026-02-21T09:49:35.9045794Z %182 = tt.splat %181 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:35.9046019Z %183 = arith.addi %182, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:35.9046306Z %184 = tt.expand_dims %183 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.9046610Z %185 = tt.broadcast %184 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9046818Z %186 = arith.cmpi sge, %184, %cst_15 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.9047003Z %187 = arith.cmpi slt, %184, %cst_14 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.9047196Z %188 = arith.andi %186, %187 : tensor<1x512xi1, #blocked> 2026-02-21T09:49:35.9047389Z %189 = tt.broadcast %188 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9047660Z %190 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>) : i32 { 2026-02-21T09:49:35.9047886Z %238 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:49:35.9048066Z %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9048293Z %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9048579Z %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:35.9048862Z %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9049068Z %243 = arith.addi %180, %242 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9049278Z %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9049530Z %245 = tt.load %244 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.9049856Z %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.9050341Z %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9050941Z %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9051382Z %249 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:35.9051627Z %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9051950Z %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9052379Z %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9052737Z %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9053013Z %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9053289Z %255 = arith.addi %254, %185 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9053574Z %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9053871Z %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9054120Z %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9054356Z %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:35.9054620Z %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9054895Z %261 = arith.andi %260, %189 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9055135Z %262 = tt.load %256, %261, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.9055557Z %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9055985Z %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9056336Z %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9056719Z %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9057150Z %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9057676Z %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9058102Z %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9058461Z %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9058823Z %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9059173Z %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9059522Z %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.9059847Z %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.9060207Z %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.9060687Z %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9061382Z %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9061895Z %278 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:49:35.9062075Z %279 = arith.muli %278, %c2_i32 : i32 2026-02-21T09:49:35.9062320Z %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9062667Z %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9063066Z %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:35.9063493Z %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9063781Z %284 = arith.addi %180, %283 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9064071Z %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9064375Z %286 = tt.load %285 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.9064695Z %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.9065182Z %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9065781Z %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9066206Z %290 = arith.extsi %278 : i32 to i64 2026-02-21T09:49:35.9066464Z %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9066783Z %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9067184Z %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9067542Z %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9067815Z %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9068126Z %296 = arith.addi %295, %185 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9068406Z %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9068725Z %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9068975Z %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9069211Z %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:35.9069479Z %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9069748Z %302 = arith.andi %301, %189 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9069994Z %303 = tt.load %297, %302, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.9070371Z %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9070794Z %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9071148Z %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9071501Z %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9071937Z %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9072444Z %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9072865Z %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9073230Z %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9073607Z %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9073975Z %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9074321Z %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.9074659Z %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.9075025Z %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.9075505Z %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9076206Z %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9076718Z %319 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:35.9076900Z %320 = arith.muli %319, %c2_i32 : i32 2026-02-21T09:49:35.9077170Z %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9077492Z %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9077896Z %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:35.9078304Z %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9078646Z %325 = arith.addi %180, %324 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9078945Z %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9079266Z %327 = tt.load %326 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.9079590Z %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.9080099Z %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9080699Z %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9081114Z %331 = arith.extsi %319 : i32 to i64 2026-02-21T09:49:35.9081360Z %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9081678Z %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9093267Z %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9093639Z %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9093918Z %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9094196Z %337 = arith.addi %336, %185 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9094485Z %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9094786Z %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9095032Z %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9095268Z %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:35.9095531Z %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9095805Z %343 = arith.andi %342, %189 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9096047Z %344 = tt.load %338, %343, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.9096494Z %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9096913Z %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9097288Z %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9097641Z %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9098070Z %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9098573Z %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9098998Z %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9099357Z %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9099720Z %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9100069Z %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9100412Z %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.9100735Z %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.9101090Z %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.9101567Z %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9102281Z %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9102837Z scf.yield %359 : tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9103033Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:49:35.9103235Z %191 = arith.addi %180, %29 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9103527Z %192 = tt.addptr %6, %191 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9103825Z %193 = tt.load %192 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.9104143Z %194 = ttg.local_alloc %193 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.9104627Z %195 = ttg.local_load %194 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9105219Z %196 = arith.extf %195 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9105650Z %197 = arith.addi %33, %185 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9105935Z %198 = tt.addptr %7, %197 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9106216Z %199 = arith.andi %37, %189 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9106460Z %200 = tt.load %198, %199, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.9106829Z %201 = ttg.convert_layout %200 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9107242Z %202 = arith.shli %201, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9107642Z %203 = arith.shrsi %202, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9107990Z %204 = arith.shrsi %201, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9108421Z %205 = tt.expand_dims %203 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9108935Z %206 = tt.expand_dims %204 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9109358Z %207 = tt.broadcast %205 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9109719Z %208 = arith.select %16, %207, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9110073Z %209 = tt.broadcast %206 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9110422Z %210 = arith.select %18, %209, %208 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9110759Z %211 = tt.reshape %210 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.9111081Z %212 = arith.sitofp %211 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.9111448Z %213 = ttg.local_alloc %212 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.9111915Z %214 = ttg.local_load %213 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9112622Z %215 = tt.dot %196, %214, %190, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9113194Z %216 = arith.truncf %215 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma> 2026-02-21T09:49:35.9113461Z %217 = arith.extsi %174 : i32 to i64 2026-02-21T09:49:35.9113701Z %218 = tt.splat %217 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:35.9114006Z %219 = arith.addi %218, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:35.9114409Z %220 = tt.expand_dims %219 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:49:35.9114753Z %221 = arith.muli %220, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.9115013Z %222 = tt.broadcast %221 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:35.9115315Z %223 = tt.splat %181 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:35.9115614Z %224 = arith.addi %223, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:35.9116001Z %225 = tt.expand_dims %224 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:49:35.9116381Z %226 = tt.broadcast %225 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:35.9116643Z %227 = arith.addi %222, %226 : tensor<128x512xi64, #mma> 2026-02-21T09:49:35.9116920Z %228 = tt.addptr %19, %227 : tensor<128x512x!tt.ptr, #mma>, tensor<128x512xi64, #mma> 2026-02-21T09:49:35.9117209Z %229 = arith.cmpi sge, %220, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.9117449Z %230 = arith.cmpi slt, %220, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.9117672Z %231 = arith.andi %229, %230 : tensor<128x1xi1, #mma> 2026-02-21T09:49:35.9117926Z %232 = tt.broadcast %231 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:35.9118195Z %233 = arith.cmpi sge, %225, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T09:49:35.9118429Z %234 = arith.cmpi slt, %225, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T09:49:35.9118655Z %235 = arith.andi %233, %234 : tensor<1x512xi1, #mma> 2026-02-21T09:49:35.9118919Z %236 = tt.broadcast %235 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:35.9119180Z %237 = arith.andi %232, %236 : tensor<128x512xi1, #mma> 2026-02-21T09:49:35.9119408Z tt.store %228, %216, %237 : tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:49:35.9119620Z } {tt.num_stages = 1 : i32} 2026-02-21T09:49:35.9119811Z scf.for %arg3 = %26 to %2 step %c1_i32 : i32 { 2026-02-21T09:49:35.9120001Z %38 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:49:35.9120183Z %39 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:49:35.9120355Z %40 = arith.muli %38, %c128_i32 : i32 2026-02-21T09:49:35.9120596Z %41 = tt.splat %40 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:35.9120911Z %42 = arith.addi %41, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:35.9121160Z %43 = arith.muli %39, %c512_i32 : i32 2026-02-21T09:49:35.9121485Z %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:49:35.9121842Z %45 = arith.muli %44, %cst_9 : tensor<128x1xi32, #blocked2> 2026-02-21T09:49:35.9122121Z %46 = tt.broadcast %45 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9122368Z %47 = arith.extsi %43 : i32 to i64 2026-02-21T09:49:35.9122646Z %48 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:35.9122953Z %49 = arith.addi %48, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:35.9123343Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.9123735Z %51 = tt.broadcast %50 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9124034Z %52 = arith.cmpi sge, %50, %cst_15 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.9124284Z %53 = arith.cmpi slt, %50, %cst_14 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:35.9124513Z %54 = arith.andi %52, %53 : tensor<1x512xi1, #blocked> 2026-02-21T09:49:35.9124789Z %55 = tt.broadcast %54 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9125169Z %56 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>) : i32 { 2026-02-21T09:49:35.9125476Z %104 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:49:35.9125726Z %105 = tt.splat %104 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9126049Z %106 = arith.addi %105, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9126449Z %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:35.9126855Z %108 = tt.broadcast %107 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9127132Z %109 = arith.addi %46, %108 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9127425Z %110 = tt.addptr %6, %109 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9127720Z %111 = tt.load %110 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.9128043Z %112 = ttg.local_alloc %111 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.9128531Z %113 = ttg.local_load %112 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9129124Z %114 = arith.extf %113 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9129537Z %115 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:35.9129796Z %116 = tt.splat %115 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9130114Z %117 = arith.addi %116, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9130508Z %118 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9130882Z %119 = arith.muli %118, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9131156Z %120 = tt.broadcast %119 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9131429Z %121 = arith.addi %120, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9131711Z %122 = tt.addptr %7, %121 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9132010Z %123 = arith.cmpi sge, %118, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9132260Z %124 = arith.cmpi slt, %118, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9132521Z %125 = arith.andi %123, %124 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:35.9132782Z %126 = tt.broadcast %125 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9133054Z %127 = arith.andi %126, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9133297Z %128 = tt.load %122, %127, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.9133672Z %129 = ttg.convert_layout %128 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9134089Z %130 = arith.shli %129, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9134437Z %131 = arith.shrsi %130, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9134791Z %132 = arith.shrsi %129, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9135247Z %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9135750Z %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9136198Z %135 = tt.broadcast %133 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9136558Z %136 = arith.select %16, %135, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9136917Z %137 = tt.broadcast %134 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9137267Z %138 = arith.select %18, %137, %136 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9137604Z %139 = tt.reshape %138 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.9137928Z %140 = arith.sitofp %139 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.9138284Z %141 = ttg.local_alloc %140 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.9138758Z %142 = ttg.local_load %141 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9139451Z %143 = tt.dot %114, %142, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9139955Z %144 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:49:35.9140133Z %145 = arith.muli %144, %c2_i32 : i32 2026-02-21T09:49:35.9140374Z %146 = tt.splat %145 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9140698Z %147 = arith.addi %146, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9141115Z %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:35.9141513Z %149 = tt.broadcast %148 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9141812Z %150 = arith.addi %46, %149 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9142098Z %151 = tt.addptr %6, %150 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9142397Z %152 = tt.load %151 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.9142717Z %153 = ttg.local_alloc %152 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.9143191Z %154 = ttg.local_load %153 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9143786Z %155 = arith.extf %154 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9144191Z %156 = arith.extsi %144 : i32 to i64 2026-02-21T09:49:35.9144436Z %157 = tt.splat %156 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9144755Z %158 = arith.addi %157, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9145145Z %159 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9145520Z %160 = arith.muli %159, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9145805Z %161 = tt.broadcast %160 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9146091Z %162 = arith.addi %161, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9146397Z %163 = tt.addptr %7, %162 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9146691Z %164 = arith.cmpi sge, %159, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9146958Z %165 = arith.cmpi slt, %159, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9147189Z %166 = arith.andi %164, %165 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:35.9147457Z %167 = tt.broadcast %166 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9147722Z %168 = arith.andi %167, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9147966Z %169 = tt.load %163, %168, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.9148340Z %170 = ttg.convert_layout %169 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9148750Z %171 = arith.shli %170, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9149105Z %172 = arith.shrsi %171, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9149455Z %173 = arith.shrsi %170, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9149888Z %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9150392Z %175 = tt.expand_dims %173 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9150810Z %176 = tt.broadcast %174 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9151170Z %177 = arith.select %16, %176, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9151529Z %178 = tt.broadcast %175 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9151892Z %179 = arith.select %18, %178, %177 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9152233Z %180 = tt.reshape %179 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.9152553Z %181 = arith.sitofp %180 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.9152929Z %182 = ttg.local_alloc %181 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.9153407Z %183 = ttg.local_load %182 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9154089Z %184 = tt.dot %155, %183, %143, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9154592Z %185 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:35.9154767Z %186 = arith.muli %185, %c2_i32 : i32 2026-02-21T09:49:35.9155014Z %187 = tt.splat %186 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9155333Z %188 = arith.addi %187, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:35.9155748Z %189 = tt.expand_dims %188 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:35.9156159Z %190 = tt.broadcast %189 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9156439Z %191 = arith.addi %46, %190 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9156721Z %192 = tt.addptr %6, %191 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9157017Z %193 = tt.load %192 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.9157355Z %194 = ttg.local_alloc %193 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.9157828Z %195 = ttg.local_load %194 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9158435Z %196 = arith.extf %195 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9158838Z %197 = arith.extsi %185 : i32 to i64 2026-02-21T09:49:35.9159075Z %198 = tt.splat %197 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9159388Z %199 = arith.addi %198, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:35.9159778Z %200 = tt.expand_dims %199 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9160126Z %201 = arith.muli %200, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9160397Z %202 = tt.broadcast %201 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9160671Z %203 = arith.addi %202, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9160947Z %204 = tt.addptr %7, %203 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9161239Z %205 = arith.cmpi sge, %200, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9161481Z %206 = arith.cmpi slt, %200, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:35.9161709Z %207 = arith.andi %205, %206 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:35.9161972Z %208 = tt.broadcast %207 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9162234Z %209 = arith.andi %208, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9162470Z %210 = tt.load %204, %209, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.9162891Z %211 = ttg.convert_layout %210 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9163299Z %212 = arith.shli %211, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9163648Z %213 = arith.shrsi %212, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9164012Z %214 = arith.shrsi %211, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9164438Z %215 = tt.expand_dims %213 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9164946Z %216 = tt.expand_dims %214 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9165382Z %217 = tt.broadcast %215 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9165742Z %218 = arith.select %16, %217, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9166100Z %219 = tt.broadcast %216 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9166447Z %220 = arith.select %18, %219, %218 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9166788Z %221 = tt.reshape %220 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.9167102Z %222 = arith.sitofp %221 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.9167454Z %223 = ttg.local_alloc %222 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.9167922Z %224 = ttg.local_load %223 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9168625Z %225 = tt.dot %196, %224, %184, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9169131Z scf.yield %225 : tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9169336Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:49:35.9169533Z %57 = arith.addi %46, %29 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9169809Z %58 = tt.addptr %6, %57 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:35.9170089Z %59 = tt.load %58 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:35.9170396Z %60 = ttg.local_alloc %59 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:35.9170861Z %61 = ttg.local_load %60 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9171441Z %62 = arith.extf %61 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9171860Z %63 = arith.addi %33, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9172125Z %64 = tt.addptr %7, %63 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:35.9172399Z %65 = arith.andi %37, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:35.9172623Z %66 = tt.load %64, %65, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:35.9172989Z %67 = ttg.convert_layout %66 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9173402Z %68 = arith.shli %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9173734Z %69 = arith.shrsi %68, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9174073Z %70 = arith.shrsi %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:35.9174512Z %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9175000Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:35.9175433Z %73 = tt.broadcast %71 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9175775Z %74 = arith.select %16, %73, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9176113Z %75 = tt.broadcast %72 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9176440Z %76 = arith.select %18, %75, %74 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:35.9176769Z %77 = tt.reshape %76 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:35.9177077Z %78 = arith.sitofp %77 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:35.9177418Z %79 = ttg.local_alloc %78 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:35.9177875Z %80 = ttg.local_load %79 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:35.9178540Z %81 = tt.dot %62, %80, %56, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:35.9179088Z %82 = arith.truncf %81 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma> 2026-02-21T09:49:35.9179330Z %83 = arith.extsi %40 : i32 to i64 2026-02-21T09:49:35.9179553Z %84 = tt.splat %83 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:35.9179862Z %85 = arith.addi %84, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:35.9180232Z %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:49:35.9180587Z %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.9180839Z %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:35.9181123Z %89 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:35.9181417Z %90 = arith.addi %89, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:35.9181783Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:49:35.9182152Z %92 = tt.broadcast %91 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:35.9182404Z %93 = arith.addi %88, %92 : tensor<128x512xi64, #mma> 2026-02-21T09:49:35.9182669Z %94 = tt.addptr %19, %93 : tensor<128x512x!tt.ptr, #mma>, tensor<128x512xi64, #mma> 2026-02-21T09:49:35.9182945Z %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.9183168Z %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:49:35.9183384Z %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma> 2026-02-21T09:49:35.9183617Z %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:35.9183872Z %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T09:49:35.9184102Z %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T09:49:35.9184320Z %101 = arith.andi %99, %100 : tensor<1x512xi1, #mma> 2026-02-21T09:49:35.9184562Z %102 = tt.broadcast %101 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:35.9184813Z %103 = arith.andi %98, %102 : tensor<128x512xi1, #mma> 2026-02-21T09:49:35.9185080Z tt.store %94, %82, %103 : tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:49:35.9185284Z } {tt.num_stages = 1 : i32} 2026-02-21T09:49:35.9185423Z tt.return 2026-02-21T09:49:35.9185537Z } 2026-02-21T09:49:35.9185642Z } 2026-02-21T09:49:35.9185702Z 2026-02-21T09:49:35.9185748Z {-# 2026-02-21T09:49:35.9185859Z external_resources: { 2026-02-21T09:49:35.9186016Z mlir_reproducer: { 2026-02-21T09:49:35.9187448Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:49:35.9188893Z disable_threading: false, 2026-02-21T09:49:35.9189039Z verify_each: true 2026-02-21T09:49:35.9189167Z } 2026-02-21T09:49:35.9189271Z } 2026-02-21T09:49:35.9189373Z #-} 2026-02-21T09:49:35.9189769Z /tmp/torchinductor_root/lv/clvxvl5r5sksdzmg7eet3qcymrlwe633kmps7qxlh4cpuwd6dba6.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:49:35.9190780Z /tmp/torchinductor_root/lv/clvxvl5r5sksdzmg7eet3qcymrlwe633kmps7qxlh4cpuwd6dba6.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:49:35.9191578Z [306s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:49:35.9192733Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 512], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[2, 1], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:49:35.9193784Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:49:35.9194022Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:49:36.6643709Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:49:36.6653643Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 8], order = [1, 0]}> 2026-02-21T09:49:36.6654036Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 8], order = [2, 1, 0]}> 2026-02-21T09:49:36.6654421Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:49:36.6654721Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 8], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:49:36.6654996Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:49:36.6655238Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:49:36.6655421Z #smem = #ttg.shared_memory 2026-02-21T09:49:36.6655662Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:49:36.6656136Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:49:36.6657190Z %cst = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6657374Z %cst_0 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6657556Z %cst_1 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6657733Z %cst_2 = arith.constant dense<8192> : tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6658016Z %cst_3 = arith.constant dense<0> : tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6658188Z %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6658359Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6658534Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6658722Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:49:36.6658895Z %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:49:36.6659079Z %cst_9 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:49:36.6659236Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:49:36.6659397Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6659634Z %cst_11 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6659894Z %cst_12 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6660092Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:49:36.6660209Z %c6_i32 = arith.constant 6 : i32 2026-02-21T09:49:36.6660336Z %c510_i32 = arith.constant 510 : i32 2026-02-21T09:49:36.6660455Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:49:36.6660578Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T09:49:36.6660698Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:49:36.6660821Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:49:36.6661041Z %cst_13 = arith.constant dense<0> : tensor<2x512xi8, #blocked> 2026-02-21T09:49:36.6661220Z %cst_14 = arith.constant dense<8192> : tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6661407Z %cst_15 = arith.constant dense<0> : tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6661675Z %cst_16 = arith.constant dense<0> : tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6661833Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:49:36.6661946Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:49:36.6662135Z %cst_17 = arith.constant dense<4> : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6662336Z %0 = tt.get_program_id x : i32 2026-02-21T09:49:36.6662451Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:49:36.6662573Z %2 = arith.minsi %1, %c2048_i32 : i32 2026-02-21T09:49:36.6662782Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:36.6663126Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:36.6663398Z %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6663647Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6663847Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6664087Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6664400Z %9 = arith.extsi %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6664721Z %10 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:36.6665055Z %11 = arith.extsi %10 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:36.6665436Z %12 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:49:36.6665862Z %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:49:36.6666287Z %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:49:36.6666544Z %15 = arith.cmpi eq, %14, %cst_8 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:49:36.6666753Z %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x512xi1, #blocked1> 2026-02-21T09:49:36.6666953Z %17 = arith.cmpi eq, %14, %cst_7 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:49:36.6667153Z %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x512xi1, #blocked1> 2026-02-21T09:49:36.6667376Z %19 = tt.splat %arg2 : !tt.ptr -> tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:49:36.6667649Z %20 = arith.extsi %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:36.6667963Z %21 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:36.6668271Z %22 = arith.extsi %21 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:36.6668504Z %23 = arith.subi %2, %0 : i32 2026-02-21T09:49:36.6668621Z %24 = arith.remsi %23, %c3_i32 : i32 2026-02-21T09:49:36.6668761Z %25 = arith.subi %23, %24 : i32 2026-02-21T09:49:36.6668877Z %26 = arith.addi %0, %25 : i32 2026-02-21T09:49:36.6669044Z %27 = arith.addi %5, %cst_11 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6669343Z %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:36.6669623Z %29 = tt.broadcast %28 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6669872Z %30 = arith.addi %9, %cst_12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6670148Z %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6670388Z %32 = arith.muli %31, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6670577Z %33 = tt.broadcast %32 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6670777Z %34 = arith.cmpi sge, %31, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6670948Z %35 = arith.cmpi slt, %31, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6671115Z %36 = arith.andi %34, %35 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:36.6671293Z %37 = tt.broadcast %36 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6671479Z scf.for %arg3 = %0 to %26 step %c3_i32 : i32 { 2026-02-21T09:49:36.6671618Z %38 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:49:36.6671750Z %39 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:49:36.6671875Z %40 = arith.muli %38, %c128_i32 : i32 2026-02-21T09:49:36.6672054Z %41 = tt.splat %40 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:36.6672280Z %42 = arith.addi %41, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:36.6672453Z %43 = arith.muli %39, %c512_i32 : i32 2026-02-21T09:49:36.6672683Z %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:49:36.6672935Z %45 = arith.muli %44, %cst_9 : tensor<128x1xi32, #blocked2> 2026-02-21T09:49:36.6673134Z %46 = tt.broadcast %45 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6673331Z %47 = arith.extsi %43 : i32 to i64 2026-02-21T09:49:36.6673505Z %48 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:36.6673729Z %49 = arith.addi %48, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:36.6674019Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6674301Z %51 = tt.broadcast %50 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6674504Z %52 = arith.cmpi sge, %50, %cst_15 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6674682Z %53 = arith.cmpi slt, %50, %cst_14 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6674854Z %54 = arith.andi %52, %53 : tensor<1x512xi1, #blocked> 2026-02-21T09:49:36.6675037Z %55 = tt.broadcast %54 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6675308Z %56 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>) : i32 { 2026-02-21T09:49:36.6675526Z %238 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:49:36.6675708Z %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6675940Z %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6676223Z %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:36.6676508Z %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6676707Z %243 = arith.addi %46, %242 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6676931Z %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6677145Z %245 = tt.load %244 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6677369Z %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6677730Z %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6678144Z %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6678430Z %249 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:36.6678603Z %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6678818Z %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6679099Z %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6679343Z %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6679536Z %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6679729Z %255 = arith.addi %254, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6679927Z %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6680140Z %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6680317Z %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6680483Z %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:36.6680670Z %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6680861Z %261 = arith.andi %260, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6681050Z %262 = tt.load %256, %261, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6681315Z %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6681627Z %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6681868Z %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6682114Z %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6682415Z %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6682837Z %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6683139Z %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6683391Z %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6683638Z %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6683887Z %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6684124Z %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6684349Z %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6684608Z %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6684988Z %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6685469Z %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6685851Z %278 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:49:36.6685979Z %279 = arith.muli %278, %c2_i32 : i32 2026-02-21T09:49:36.6686154Z %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6686376Z %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6686651Z %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:36.6686931Z %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6687128Z %284 = arith.addi %46, %283 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6687329Z %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6687537Z %286 = tt.load %285 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6687760Z %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6688094Z %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6688505Z %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6688787Z %290 = arith.extsi %278 : i32 to i64 2026-02-21T09:49:36.6688976Z %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6689196Z %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6689468Z %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6689738Z %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6689931Z %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6690120Z %296 = arith.addi %295, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6690318Z %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6690523Z %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6690694Z %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6690862Z %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:36.6691055Z %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6691243Z %302 = arith.andi %301, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6691411Z %303 = tt.load %297, %302, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6691672Z %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6691957Z %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6692198Z %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6692447Z %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6692764Z %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6693110Z %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6693423Z %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6693675Z %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6693921Z %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6694162Z %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6694397Z %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6694624Z %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6694871Z %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6695197Z %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6695670Z %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6696020Z %319 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:36.6696147Z %320 = arith.muli %319, %c2_i32 : i32 2026-02-21T09:49:36.6696319Z %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6696545Z %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6696840Z %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:36.6697119Z %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6697314Z %325 = arith.addi %46, %324 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6697529Z %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6697742Z %327 = tt.load %326 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6697963Z %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6698298Z %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6698705Z %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6698984Z %331 = arith.extsi %319 : i32 to i64 2026-02-21T09:49:36.6699154Z %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6699374Z %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6699648Z %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6699895Z %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6700085Z %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6700277Z %337 = arith.addi %336, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6700485Z %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6700694Z %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6700864Z %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6701047Z %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:36.6701233Z %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6701417Z %343 = arith.andi %342, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6701585Z %344 = tt.load %338, %343, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6701848Z %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6702133Z %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6702378Z %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6702621Z %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6702923Z %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6703274Z %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6703564Z %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6703821Z %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6704066Z %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6704312Z %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6704566Z %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6704788Z %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6705040Z %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6705381Z %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6705854Z %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6706203Z scf.yield %359 : tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6706336Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:49:36.6706484Z %57 = arith.addi %46, %29 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6706678Z %58 = tt.addptr %6, %57 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6706882Z %59 = tt.load %58 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6707102Z %60 = ttg.local_alloc %59 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6707429Z %61 = ttg.local_load %60 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6707830Z %62 = arith.extf %61 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6708127Z %63 = arith.addi %33, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6708330Z %64 = tt.addptr %7, %63 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6708525Z %65 = arith.andi %37, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6708686Z %66 = tt.load %64, %65, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6708958Z %67 = ttg.convert_layout %66 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6709234Z %68 = arith.shli %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6709469Z %69 = arith.shrsi %68, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6709705Z %70 = arith.shrsi %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6709999Z %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6710340Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6710752Z %73 = tt.broadcast %71 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6710995Z %74 = arith.select %16, %73, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6711236Z %75 = tt.broadcast %72 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6711470Z %76 = arith.select %18, %75, %74 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6711700Z %77 = tt.reshape %76 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6711917Z %78 = arith.sitofp %77 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6712159Z %79 = ttg.local_alloc %78 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6712498Z %80 = ttg.local_load %79 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6712956Z %81 = tt.dot %62, %80, %56, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6713360Z %82 = arith.truncf %81 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma> 2026-02-21T09:49:36.6713533Z %83 = arith.extsi %40 : i32 to i64 2026-02-21T09:49:36.6713694Z %84 = tt.splat %83 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:36.6713904Z %85 = arith.addi %84, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:36.6714167Z %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6714406Z %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6714587Z %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6714800Z %89 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:36.6715011Z %90 = arith.addi %89, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:36.6715273Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6715541Z %92 = tt.broadcast %91 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6715727Z %93 = arith.addi %88, %92 : tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6715920Z %94 = tt.addptr %19, %93 : tensor<128x512x!tt.ptr, #mma>, tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6716125Z %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6716305Z %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6716466Z %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma> 2026-02-21T09:49:36.6716640Z %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:36.6716859Z %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6717033Z %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6717190Z %101 = arith.andi %99, %100 : tensor<1x512xi1, #mma> 2026-02-21T09:49:36.6717368Z %102 = tt.broadcast %101 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:36.6717550Z %103 = arith.andi %98, %102 : tensor<128x512xi1, #mma> 2026-02-21T09:49:36.6717716Z tt.store %94, %82, %103 : tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:49:36.6717865Z %104 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:49:36.6717995Z %105 = arith.remsi %104, %c128_i32 : i32 2026-02-21T09:49:36.6718123Z %106 = arith.divsi %104, %c128_i32 : i32 2026-02-21T09:49:36.6718250Z %107 = arith.muli %105, %c128_i32 : i32 2026-02-21T09:49:36.6718427Z %108 = tt.splat %107 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:36.6718658Z %109 = arith.addi %108, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:36.6718841Z %110 = arith.muli %106, %c512_i32 : i32 2026-02-21T09:49:36.6719076Z %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:49:36.6719335Z %112 = arith.muli %111, %cst_9 : tensor<128x1xi32, #blocked2> 2026-02-21T09:49:36.6719537Z %113 = tt.broadcast %112 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6719718Z %114 = arith.extsi %110 : i32 to i64 2026-02-21T09:49:36.6719894Z %115 = tt.splat %114 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:36.6720141Z %116 = arith.addi %115, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:36.6720423Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6720712Z %118 = tt.broadcast %117 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6720933Z %119 = arith.cmpi sge, %117, %cst_15 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6721116Z %120 = arith.cmpi slt, %117, %cst_14 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6721285Z %121 = arith.andi %119, %120 : tensor<1x512xi1, #blocked> 2026-02-21T09:49:36.6721479Z %122 = tt.broadcast %121 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6721757Z %123 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>) : i32 { 2026-02-21T09:49:36.6721978Z %238 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:49:36.6722161Z %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6722387Z %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6722710Z %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:36.6722997Z %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6723197Z %243 = arith.addi %113, %242 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6723405Z %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6723617Z %245 = tt.load %244 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6723848Z %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6724210Z %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6724804Z %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6725097Z %249 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:36.6725271Z %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6725498Z %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6725776Z %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6726024Z %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6726228Z %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6726424Z %255 = arith.addi %254, %118 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6726629Z %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6726844Z %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6727017Z %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6727185Z %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:36.6727371Z %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6727565Z %261 = arith.andi %260, %122 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6727737Z %262 = tt.load %256, %261, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6728005Z %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6728326Z %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6728575Z %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6728828Z %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6729164Z %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6729517Z %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6729815Z %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6730069Z %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6730325Z %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6730575Z %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6730812Z %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6731043Z %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6731297Z %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6731630Z %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6732130Z %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6732484Z %278 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:49:36.6732636Z %279 = arith.muli %278, %c2_i32 : i32 2026-02-21T09:49:36.6732809Z %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6733039Z %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6733320Z %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:36.6733602Z %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6733806Z %284 = arith.addi %113, %283 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6734015Z %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6734236Z %286 = tt.load %285 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6734469Z %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6734804Z %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6735217Z %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6735500Z %290 = arith.extsi %278 : i32 to i64 2026-02-21T09:49:36.6735676Z %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6735903Z %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6736198Z %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6736451Z %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6736645Z %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6736859Z %296 = arith.addi %295, %118 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6737059Z %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6737272Z %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6737447Z %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6737615Z %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:36.6737808Z %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6738007Z %302 = arith.andi %301, %122 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6738181Z %303 = tt.load %297, %302, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6738449Z %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6738738Z %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6738989Z %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6739239Z %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6739545Z %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6739913Z %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6740208Z %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6740468Z %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6740739Z %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6740984Z %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6741225Z %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6741451Z %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6741708Z %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6742040Z %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6742515Z %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6742874Z %319 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:36.6743002Z %320 = arith.muli %319, %c2_i32 : i32 2026-02-21T09:49:36.6743180Z %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6743408Z %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6743688Z %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:36.6743975Z %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6744199Z %325 = arith.addi %113, %324 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6744413Z %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6744630Z %327 = tt.load %326 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6744880Z %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6745222Z %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6745637Z %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6745926Z %331 = arith.extsi %319 : i32 to i64 2026-02-21T09:49:36.6746103Z %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6746325Z %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6746605Z %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6746854Z %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6747053Z %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6747251Z %337 = arith.addi %336, %118 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6747449Z %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6747663Z %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6747836Z %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6748023Z %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:36.6748217Z %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6748420Z %343 = arith.andi %342, %122 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6748594Z %344 = tt.load %338, %343, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6748860Z %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6749155Z %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6749403Z %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6749650Z %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6749958Z %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6750307Z %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6750610Z %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6750872Z %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6751123Z %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6751373Z %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6751611Z %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6751844Z %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6752141Z %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6752471Z %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6752968Z %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6753327Z scf.yield %359 : tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6753462Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:49:36.6753612Z %124 = arith.addi %113, %29 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6753816Z %125 = tt.addptr %6, %124 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6754029Z %126 = tt.load %125 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6754255Z %127 = ttg.local_alloc %126 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6754594Z %128 = ttg.local_load %127 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6755006Z %129 = arith.extf %128 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6755307Z %130 = arith.addi %33, %118 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6755511Z %131 = tt.addptr %7, %130 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6755711Z %132 = arith.andi %37, %122 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6755886Z %133 = tt.load %131, %132, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6756175Z %134 = ttg.convert_layout %133 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6756462Z %135 = arith.shli %134, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6756729Z %136 = arith.shrsi %135, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6756977Z %137 = arith.shrsi %134, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6757282Z %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6757633Z %139 = tt.expand_dims %137 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6757927Z %140 = tt.broadcast %138 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6758183Z %141 = arith.select %16, %140, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6758438Z %142 = tt.broadcast %139 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6758684Z %143 = arith.select %18, %142, %141 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6758928Z %144 = tt.reshape %143 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6759154Z %145 = arith.sitofp %144 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6759413Z %146 = ttg.local_alloc %145 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6759742Z %147 = ttg.local_load %146 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6760241Z %148 = tt.dot %129, %147, %123, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6760637Z %149 = arith.truncf %148 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma> 2026-02-21T09:49:36.6760814Z %150 = arith.extsi %107 : i32 to i64 2026-02-21T09:49:36.6760999Z %151 = tt.splat %150 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:36.6761217Z %152 = arith.addi %151, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:36.6761485Z %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6761732Z %154 = arith.muli %153, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6761915Z %155 = tt.broadcast %154 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6762135Z %156 = tt.splat %114 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:36.6762353Z %157 = arith.addi %156, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:36.6762654Z %158 = tt.expand_dims %157 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6762924Z %159 = tt.broadcast %158 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6763113Z %160 = arith.addi %155, %159 : tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6763320Z %161 = tt.addptr %19, %160 : tensor<128x512x!tt.ptr, #mma>, tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6763535Z %162 = arith.cmpi sge, %153, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6763705Z %163 = arith.cmpi slt, %153, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6763868Z %164 = arith.andi %162, %163 : tensor<128x1xi1, #mma> 2026-02-21T09:49:36.6764073Z %165 = tt.broadcast %164 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:36.6764269Z %166 = arith.cmpi sge, %158, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6764434Z %167 = arith.cmpi slt, %158, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6764618Z %168 = arith.andi %166, %167 : tensor<1x512xi1, #mma> 2026-02-21T09:49:36.6764807Z %169 = tt.broadcast %168 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:36.6764996Z %170 = arith.andi %165, %169 : tensor<128x512xi1, #mma> 2026-02-21T09:49:36.6765169Z tt.store %161, %149, %170 : tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:49:36.6765325Z %171 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:49:36.6765459Z %172 = arith.remsi %171, %c128_i32 : i32 2026-02-21T09:49:36.6765586Z %173 = arith.divsi %171, %c128_i32 : i32 2026-02-21T09:49:36.6765718Z %174 = arith.muli %172, %c128_i32 : i32 2026-02-21T09:49:36.6777311Z %175 = tt.splat %174 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:36.6777566Z %176 = arith.addi %175, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:36.6777751Z %177 = arith.muli %173, %c512_i32 : i32 2026-02-21T09:49:36.6777994Z %178 = tt.expand_dims %176 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:49:36.6778262Z %179 = arith.muli %178, %cst_9 : tensor<128x1xi32, #blocked2> 2026-02-21T09:49:36.6778470Z %180 = tt.broadcast %179 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6778656Z %181 = arith.extsi %177 : i32 to i64 2026-02-21T09:49:36.6778830Z %182 = tt.splat %181 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:36.6779060Z %183 = arith.addi %182, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:36.6779417Z %184 = tt.expand_dims %183 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6779704Z %185 = tt.broadcast %184 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6779918Z %186 = arith.cmpi sge, %184, %cst_15 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6780121Z %187 = arith.cmpi slt, %184, %cst_14 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6780295Z %188 = arith.andi %186, %187 : tensor<1x512xi1, #blocked> 2026-02-21T09:49:36.6780485Z %189 = tt.broadcast %188 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6780760Z %190 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>) : i32 { 2026-02-21T09:49:36.6780985Z %238 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:49:36.6781161Z %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6781394Z %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6781673Z %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:36.6781958Z %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6782163Z %243 = arith.addi %180, %242 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6782371Z %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6782588Z %245 = tt.load %244 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6782817Z %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6783163Z %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6783610Z %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6783917Z %249 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:36.6784095Z %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6784318Z %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6784598Z %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6784853Z %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6785048Z %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6785250Z %255 = arith.addi %254, %185 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6785451Z %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6785664Z %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6785841Z %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6786009Z %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:36.6786201Z %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6786393Z %261 = arith.andi %260, %189 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6786572Z %262 = tt.load %256, %261, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6786836Z %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6787132Z %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6787403Z %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6787652Z %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6787960Z %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6788329Z %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6788626Z %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6788886Z %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6789139Z %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6789393Z %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6789635Z %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6789861Z %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6790119Z %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6790448Z %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6790939Z %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6791325Z %278 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:49:36.6791454Z %279 = arith.muli %278, %c2_i32 : i32 2026-02-21T09:49:36.6791631Z %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6791875Z %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6792158Z %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:36.6792441Z %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6792641Z %284 = arith.addi %180, %283 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6792850Z %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6793058Z %286 = tt.load %285 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6793292Z %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6793630Z %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6794040Z %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6794331Z %290 = arith.extsi %278 : i32 to i64 2026-02-21T09:49:36.6794503Z %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6794730Z %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6795008Z %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6795256Z %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6795480Z %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6795675Z %296 = arith.addi %295, %185 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6795879Z %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6796106Z %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6796279Z %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6796449Z %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:36.6796638Z %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6796835Z %302 = arith.andi %301, %189 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6797008Z %303 = tt.load %297, %302, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6797279Z %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6797570Z %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6797814Z %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6798066Z %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6798371Z %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6798720Z %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6799019Z %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6799295Z %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6799548Z %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6799815Z %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6800051Z %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6800279Z %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6800533Z %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6800866Z %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6801345Z %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6801697Z %319 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:36.6801830Z %320 = arith.muli %319, %c2_i32 : i32 2026-02-21T09:49:36.6802008Z %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6802240Z %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6802525Z %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:36.6802870Z %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6803075Z %325 = arith.addi %180, %324 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6803305Z %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6803520Z %327 = tt.load %326 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6803755Z %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6804107Z %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6804519Z %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6804805Z %331 = arith.extsi %319 : i32 to i64 2026-02-21T09:49:36.6804979Z %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6805205Z %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6805485Z %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6805736Z %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6805929Z %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6806127Z %337 = arith.addi %336, %185 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6806328Z %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6806535Z %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6806713Z %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6806877Z %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:36.6807090Z %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6807287Z %343 = arith.andi %342, %189 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6807459Z %344 = tt.load %338, %343, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6807745Z %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6808034Z %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6808283Z %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6808533Z %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6808831Z %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6809186Z %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6809478Z %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6809735Z %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6809988Z %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6810234Z %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6810478Z %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6810702Z %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6810957Z %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6811309Z %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6811784Z %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6812158Z scf.yield %359 : tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6812294Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:49:36.6812443Z %191 = arith.addi %180, %29 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6812649Z %192 = tt.addptr %6, %191 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6812853Z %193 = tt.load %192 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6813082Z %194 = ttg.local_alloc %193 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6813416Z %195 = ttg.local_load %194 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6813827Z %196 = arith.extf %195 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6814137Z %197 = arith.addi %33, %185 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6814336Z %198 = tt.addptr %7, %197 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6814541Z %199 = arith.andi %37, %189 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6814710Z %200 = tt.load %198, %199, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6814977Z %201 = ttg.convert_layout %200 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6815286Z %202 = arith.shli %201, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6815530Z %203 = arith.shrsi %202, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6815793Z %204 = arith.shrsi %201, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6816095Z %205 = tt.expand_dims %203 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6816445Z %206 = tt.expand_dims %204 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6816742Z %207 = tt.broadcast %205 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6816990Z %208 = arith.select %16, %207, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6817242Z %209 = tt.broadcast %206 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6817486Z %210 = arith.select %18, %209, %208 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6817724Z %211 = tt.reshape %210 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6817952Z %212 = arith.sitofp %211 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6818202Z %213 = ttg.local_alloc %212 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6818534Z %214 = ttg.local_load %213 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6819009Z %215 = tt.dot %196, %214, %190, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6819420Z %216 = arith.truncf %215 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma> 2026-02-21T09:49:36.6819602Z %217 = arith.extsi %174 : i32 to i64 2026-02-21T09:49:36.6819769Z %218 = tt.splat %217 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:36.6820004Z %219 = arith.addi %218, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:36.6820282Z %220 = tt.expand_dims %219 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6820525Z %221 = arith.muli %220, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6820711Z %222 = tt.broadcast %221 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6820922Z %223 = tt.splat %181 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:36.6821139Z %224 = arith.addi %223, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:36.6821409Z %225 = tt.expand_dims %224 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6821674Z %226 = tt.broadcast %225 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6821863Z %227 = arith.addi %222, %226 : tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6822059Z %228 = tt.addptr %19, %227 : tensor<128x512x!tt.ptr, #mma>, tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6822267Z %229 = arith.cmpi sge, %220, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6822434Z %230 = arith.cmpi slt, %220, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6822599Z %231 = arith.andi %229, %230 : tensor<128x1xi1, #mma> 2026-02-21T09:49:36.6822784Z %232 = tt.broadcast %231 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:36.6822972Z %233 = arith.cmpi sge, %225, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6823159Z %234 = arith.cmpi slt, %225, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6823318Z %235 = arith.andi %233, %234 : tensor<1x512xi1, #mma> 2026-02-21T09:49:36.6823511Z %236 = tt.broadcast %235 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:36.6823695Z %237 = arith.andi %232, %236 : tensor<128x512xi1, #mma> 2026-02-21T09:49:36.6823858Z tt.store %228, %216, %237 : tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:49:36.6824031Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:49:36.6824182Z scf.for %arg3 = %26 to %2 step %c1_i32 : i32 { 2026-02-21T09:49:36.6824320Z %38 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:49:36.6824447Z %39 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:49:36.6824575Z %40 = arith.muli %38, %c128_i32 : i32 2026-02-21T09:49:36.6824752Z %41 = tt.splat %40 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:36.6824978Z %42 = arith.addi %41, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:36.6825157Z %43 = arith.muli %39, %c512_i32 : i32 2026-02-21T09:49:36.6825307Z %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:49:36.6825372Z %45 = arith.muli %44, %cst_9 : tensor<128x1xi32, #blocked2> 2026-02-21T09:49:36.6825471Z %46 = tt.broadcast %45 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6825516Z %47 = arith.extsi %43 : i32 to i64 2026-02-21T09:49:36.6825605Z %48 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:36.6825698Z %49 = arith.addi %48, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:49:36.6825841Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6825950Z %51 = tt.broadcast %50 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6826024Z %52 = arith.cmpi sge, %50, %cst_15 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6826096Z %53 = arith.cmpi slt, %50, %cst_14 : tensor<1x512xi64, #blocked> 2026-02-21T09:49:36.6826170Z %54 = arith.andi %52, %53 : tensor<1x512xi1, #blocked> 2026-02-21T09:49:36.6826257Z %55 = tt.broadcast %54 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6826398Z %56 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>) : i32 { 2026-02-21T09:49:36.6826444Z %104 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:49:36.6826540Z %105 = tt.splat %104 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6826637Z %106 = arith.addi %105, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6826785Z %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:36.6826879Z %108 = tt.broadcast %107 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6826946Z %109 = arith.addi %46, %108 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6827052Z %110 = tt.addptr %6, %109 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6827118Z %111 = tt.load %110 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6827246Z %112 = ttg.local_alloc %111 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6827420Z %113 = ttg.local_load %112 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6827642Z %114 = arith.extf %113 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6827694Z %115 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:36.6827805Z %116 = tt.splat %115 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6827895Z %117 = arith.addi %116, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6828042Z %118 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6828107Z %119 = arith.muli %118, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6828197Z %120 = tt.broadcast %119 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6828262Z %121 = arith.addi %120, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6828362Z %122 = tt.addptr %7, %121 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6828433Z %123 = arith.cmpi sge, %118, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6828504Z %124 = arith.cmpi slt, %118, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6828568Z %125 = arith.andi %123, %124 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:36.6828658Z %126 = tt.broadcast %125 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6828722Z %127 = arith.andi %126, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6828796Z %128 = tt.load %122, %127, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6828944Z %129 = ttg.convert_layout %128 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6829047Z %130 = arith.shli %129, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6829154Z %131 = arith.shrsi %130, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6829273Z %132 = arith.shrsi %129, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6829429Z %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6829605Z %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6829706Z %135 = tt.broadcast %133 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6829819Z %136 = arith.select %16, %135, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6829921Z %137 = tt.broadcast %134 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6830027Z %138 = arith.select %18, %137, %136 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6830119Z %139 = tt.reshape %138 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6830216Z %140 = arith.sitofp %139 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6830336Z %141 = ttg.local_alloc %140 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6830508Z %142 = ttg.local_load %141 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6830781Z %143 = tt.dot %114, %142, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6830830Z %144 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:49:36.6830876Z %145 = arith.muli %144, %c2_i32 : i32 2026-02-21T09:49:36.6830986Z %146 = tt.splat %145 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6831080Z %147 = arith.addi %146, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6831243Z %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:36.6831341Z %149 = tt.broadcast %148 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6831406Z %150 = arith.addi %46, %149 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6831510Z %151 = tt.addptr %6, %150 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6831580Z %152 = tt.load %151 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6831704Z %153 = ttg.local_alloc %152 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6831881Z %154 = ttg.local_load %153 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6832084Z %155 = arith.extf %154 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6832130Z %156 = arith.extsi %144 : i32 to i64 2026-02-21T09:49:36.6832223Z %157 = tt.splat %156 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6832316Z %158 = arith.addi %157, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6832459Z %159 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6832523Z %160 = arith.muli %159, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6832620Z %161 = tt.broadcast %160 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6832701Z %162 = arith.addi %161, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6832802Z %163 = tt.addptr %7, %162 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6832877Z %164 = arith.cmpi sge, %159, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6832961Z %165 = arith.cmpi slt, %159, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6833022Z %166 = arith.andi %164, %165 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:36.6833115Z %167 = tt.broadcast %166 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6833175Z %168 = arith.andi %167, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6833248Z %169 = tt.load %163, %168, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6833396Z %170 = ttg.convert_layout %169 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6833501Z %171 = arith.shli %170, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6833603Z %172 = arith.shrsi %171, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6833705Z %173 = arith.shrsi %170, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6833865Z %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6834016Z %175 = tt.expand_dims %173 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6834116Z %176 = tt.broadcast %174 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6834231Z %177 = arith.select %16, %176, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6834344Z %178 = tt.broadcast %175 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6834450Z %179 = arith.select %18, %178, %177 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6834560Z %180 = tt.reshape %179 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6834653Z %181 = arith.sitofp %180 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6834772Z %182 = ttg.local_alloc %181 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6834946Z %183 = ttg.local_load %182 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6835210Z %184 = tt.dot %155, %183, %143, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6835260Z %185 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:36.6835309Z %186 = arith.muli %185, %c2_i32 : i32 2026-02-21T09:49:36.6835401Z %187 = tt.splat %186 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6835495Z %188 = arith.addi %187, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:36.6835644Z %189 = tt.expand_dims %188 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:49:36.6835738Z %190 = tt.broadcast %189 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6835800Z %191 = arith.addi %46, %190 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6835908Z %192 = tt.addptr %6, %191 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6835998Z %193 = tt.load %192 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6836122Z %194 = ttg.local_alloc %193 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6836298Z %195 = ttg.local_load %194 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6836516Z %196 = arith.extf %195 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6836562Z %197 = arith.extsi %185 : i32 to i64 2026-02-21T09:49:36.6836657Z %198 = tt.splat %197 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6836747Z %199 = arith.addi %198, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:36.6836892Z %200 = tt.expand_dims %199 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6836959Z %201 = arith.muli %200, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6837053Z %202 = tt.broadcast %201 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6837115Z %203 = arith.addi %202, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6837220Z %204 = tt.addptr %7, %203 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6837289Z %205 = arith.cmpi sge, %200, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6837356Z %206 = arith.cmpi slt, %200, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:49:36.6837418Z %207 = arith.andi %205, %206 : tensor<2x1xi1, #blocked> 2026-02-21T09:49:36.6837507Z %208 = tt.broadcast %207 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6837567Z %209 = arith.andi %208, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6837656Z %210 = tt.load %204, %209, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6837808Z %211 = ttg.convert_layout %210 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6837923Z %212 = arith.shli %211, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6838026Z %213 = arith.shrsi %212, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6838131Z %214 = arith.shrsi %211, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6838283Z %215 = tt.expand_dims %213 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6838435Z %216 = tt.expand_dims %214 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6838540Z %217 = tt.broadcast %215 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6838651Z %218 = arith.select %16, %217, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6838749Z %219 = tt.broadcast %216 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6838858Z %220 = arith.select %18, %219, %218 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6838949Z %221 = tt.reshape %220 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6839041Z %222 = arith.sitofp %221 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6839163Z %223 = ttg.local_alloc %222 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6839335Z %224 = ttg.local_load %223 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6839616Z %225 = tt.dot %196, %224, %184, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6839693Z scf.yield %225 : tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6839741Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:49:36.6839804Z %57 = arith.addi %46, %29 : tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6839912Z %58 = tt.addptr %6, %57 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:49:36.6839975Z %59 = tt.load %58 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:49:36.6840093Z %60 = ttg.local_alloc %59 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:36.6840266Z %61 = ttg.local_load %60 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6840462Z %62 = arith.extf %61 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6840524Z %63 = arith.addi %33, %51 : tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6840628Z %64 = tt.addptr %7, %63 : tensor<2x512x!tt.ptr, #blocked>, tensor<2x512xi64, #blocked> 2026-02-21T09:49:36.6840688Z %65 = arith.andi %37, %55 : tensor<2x512xi1, #blocked> 2026-02-21T09:49:36.6840759Z %66 = tt.load %64, %65, %cst_13 : tensor<2x512x!tt.ptr, #blocked> 2026-02-21T09:49:36.6840910Z %67 = ttg.convert_layout %66 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6841008Z %68 = arith.shli %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6841292Z %69 = arith.shrsi %68, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6841396Z %70 = arith.shrsi %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:36.6841562Z %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6841711Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1> 2026-02-21T09:49:36.6841815Z %73 = tt.broadcast %71 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6841921Z %74 = arith.select %16, %73, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6842014Z %75 = tt.broadcast %72 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6842119Z %76 = arith.select %18, %75, %74 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1> 2026-02-21T09:49:36.6842207Z %77 = tt.reshape %76 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked> 2026-02-21T09:49:36.6842296Z %78 = arith.sitofp %77 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked> 2026-02-21T09:49:36.6842415Z %79 = ttg.local_alloc %78 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem> 2026-02-21T09:49:36.6842623Z %80 = ttg.local_load %79 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:36.6842884Z %81 = tt.dot %62, %80, %56, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma> 2026-02-21T09:49:36.6842976Z %82 = arith.truncf %81 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma> 2026-02-21T09:49:36.6843021Z %83 = arith.extsi %40 : i32 to i64 2026-02-21T09:49:36.6843131Z %84 = tt.splat %83 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:36.6843223Z %85 = arith.addi %84, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:36.6843362Z %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6843440Z %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6843528Z %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6843616Z %89 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:36.6843699Z %90 = arith.addi %89, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:36.6843834Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6843920Z %92 = tt.broadcast %91 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6843977Z %93 = arith.addi %88, %92 : tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6844075Z %94 = tt.addptr %19, %93 : tensor<128x512x!tt.ptr, #mma>, tensor<128x512xi64, #mma> 2026-02-21T09:49:36.6844144Z %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6844207Z %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:49:36.6844264Z %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma> 2026-02-21T09:49:36.6844350Z %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:36.6844414Z %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6844481Z %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x512xi64, #mma> 2026-02-21T09:49:36.6844543Z %101 = arith.andi %99, %100 : tensor<1x512xi1, #mma> 2026-02-21T09:49:36.6844642Z %102 = tt.broadcast %101 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma> 2026-02-21T09:49:36.6844701Z %103 = arith.andi %98, %102 : tensor<128x512xi1, #mma> 2026-02-21T09:49:36.6844792Z tt.store %94, %82, %103 : tensor<128x512x!tt.ptr, #mma> 2026-02-21T09:49:36.6844856Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:49:36.6844895Z tt.return 2026-02-21T09:49:36.6844929Z } 2026-02-21T09:49:36.6844973Z } 2026-02-21T09:49:36.6844977Z 2026-02-21T09:49:36.6845009Z {-# 2026-02-21T09:49:36.6845052Z external_resources: { 2026-02-21T09:49:36.6845096Z mlir_reproducer: { 2026-02-21T09:49:36.6846035Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:49:36.6846081Z disable_threading: false, 2026-02-21T09:49:36.6846124Z verify_each: true 2026-02-21T09:49:36.6846158Z } 2026-02-21T09:49:36.6846190Z } 2026-02-21T09:49:36.6846226Z #-} 2026-02-21T09:49:36.6846472Z /tmp/torchinductor_root/wk/cwktnkmg5v3y2kafb7nz4m3aoac7nvgsfypwxtyj22sga5evnyda.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:49:36.6846898Z /tmp/torchinductor_root/wk/cwktnkmg5v3y2kafb7nz4m3aoac7nvgsfypwxtyj22sga5evnyda.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:49:36.6847037Z [307s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:49:36.6847678Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 512], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[2, 1], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:49:36.6847749Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:49:36.6847835Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:49:46.6744086Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:49:46.6753083Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:49:46.6753866Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:49:46.6754506Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:49:46.6755139Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:49:46.6755723Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:49:46.6756257Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:49:46.6756637Z #smem = #ttg.shared_memory 2026-02-21T09:49:46.6757115Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:49:46.6758576Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:49:46.6759503Z %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma> 2026-02-21T09:49:46.6759884Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:46.6760244Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:46.6760619Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma> 2026-02-21T09:49:46.6760962Z %c37631_i32 = arith.constant 37631 : i32 2026-02-21T09:49:46.6761213Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:49:46.6761453Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:49:46.6761689Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:49:46.6762108Z %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:46.6762733Z %cst_4 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6763167Z %cst_5 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6763528Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6763832Z %cst_7 = arith.constant dense<0> : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6764123Z %cst_8 = arith.constant dense<512> : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6764408Z %cst_9 = arith.constant dense<0> : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6764711Z %cst_10 = arith.constant dense<8192> : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6765014Z %cst_11 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1> 2026-02-21T09:49:46.6765270Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:49:46.6765462Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:49:46.6765659Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:49:46.6765976Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T09:49:46.6766179Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:49:46.6766375Z %c14592_i32 = arith.constant 14592 : i32 2026-02-21T09:49:46.6766616Z %cst_12 = arith.constant dense<0> : tensor<2x64xi8, #blocked2> 2026-02-21T09:49:46.6766950Z %c9728_i32 = arith.constant 9728 : i32 2026-02-21T09:49:46.6767191Z %cst_13 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6767441Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:49:46.6767636Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:49:46.6767834Z %c4864_i32 = arith.constant 4864 : i32 2026-02-21T09:49:46.6768145Z %cst_14 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6768460Z %0 = tt.get_program_id x : i32 2026-02-21T09:49:46.6768791Z %1 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:46.6769253Z %2 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:46.6769705Z %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:46.6770154Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:46.6770591Z %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:46.6771001Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:46.6771331Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:46.6771721Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6772291Z %9 = arith.extsi %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6772903Z %10 = arith.extsi %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:46.6773422Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:49:46.6773945Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:49:46.6774454Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:46.6774774Z %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:46.6775021Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:49:46.6775271Z %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:46.6775508Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:49:46.6775767Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:49:46.6775970Z %19 = arith.subi %c37631_i32, %0 : i32 2026-02-21T09:49:46.6776117Z %20 = arith.divui %19, %c4864_i32 : i32 2026-02-21T09:49:46.6776263Z %21 = arith.remsi %20, %c3_i32 : i32 2026-02-21T09:49:46.6776404Z %22 = arith.subi %20, %21 : i32 2026-02-21T09:49:46.6776547Z %23 = arith.muli %22, %c4864_i32 : i32 2026-02-21T09:49:46.6776689Z %24 = arith.addi %0, %23 : i32 2026-02-21T09:49:46.6776849Z scf.for %arg3 = %0 to %24 step %c14592_i32 : i32 { 2026-02-21T09:49:46.6777027Z %25 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:49:46.6777176Z %26 = arith.muli %25, %c2_i32 : i32 2026-02-21T09:49:46.6777348Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:49:46.6777491Z %28 = arith.minsi %27, %c2_i32 : i32 2026-02-21T09:49:46.6777641Z %29 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:49:46.6777787Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:49:46.6777927Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:49:46.6778085Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:49:46.6778232Z %33 = arith.muli %31, %c64_i32 : i32 2026-02-21T09:49:46.6778433Z %34 = tt.splat %33 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:46.6778685Z %35 = arith.addi %34, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:46.6778888Z %36 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:49:46.6779092Z %37 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:46.6779359Z %38 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:46.6779618Z %39 = arith.addi %37, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:46.6779890Z %40 = arith.addi %38, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:46.6780229Z %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:49:46.6780541Z %42 = arith.muli %41, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:49:46.6780782Z %43 = tt.broadcast %42 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6780995Z %44 = arith.extsi %33 : i32 to i64 2026-02-21T09:49:46.6781207Z %45 = tt.splat %44 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:46.6781478Z %46 = arith.addi %45, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:46.6781838Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6782182Z %48 = tt.broadcast %47 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6782444Z %49 = arith.cmpi sge, %47, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6782657Z %50 = arith.cmpi slt, %47, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6782869Z %51 = arith.andi %49, %50 : tensor<1x64xi1, #blocked2> 2026-02-21T09:49:46.6783091Z %52 = tt.broadcast %51 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6783357Z %53 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:46.6783688Z %54 = tt.expand_dims %5 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:46.6784030Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6784273Z %56 = arith.addi %43, %55 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6784478Z %57 = tt.addptr %6, %56 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6784686Z %58 = tt.load %57 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:46.6784968Z %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6785422Z ttg.local_store %58, %59 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6785697Z %60 = arith.addi %5, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:46.6785981Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:46.6786256Z %62 = tt.broadcast %61 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6786463Z %63 = arith.addi %43, %62 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6786663Z %64 = tt.addptr %6, %63 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6786864Z %65 = tt.load %64 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:46.6787156Z %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6787509Z ttg.local_store %65, %66 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6788030Z %67:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:49:46.6788456Z %306 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:46.6788588Z %307 = arith.muli %306, %c2_i32 : i32 2026-02-21T09:49:46.6788761Z %308 = tt.splat %307 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:46.6788994Z %309 = arith.addi %308, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:46.6789274Z %310 = tt.expand_dims %309 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:46.6789561Z %311 = tt.broadcast %310 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6789758Z %312 = arith.addi %43, %311 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6789958Z %313 = tt.addptr %6, %312 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6790166Z %314 = tt.load %313 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:46.6790480Z %315 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6790936Z %316 = arith.extf %315 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6791223Z %317 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:46.6791397Z %318 = tt.splat %317 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6791624Z %319 = arith.addi %318, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6791902Z %320 = tt.expand_dims %319 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6792153Z %321 = arith.muli %320, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6792352Z %322 = tt.broadcast %321 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6792547Z %323 = arith.addi %322, %48 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6792745Z %324 = tt.addptr %7, %323 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6792955Z %325 = arith.cmpi sge, %320, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6793130Z %326 = arith.cmpi slt, %320, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6793293Z %327 = arith.andi %325, %326 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:46.6793477Z %328 = tt.broadcast %327 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6793668Z %329 = arith.andi %328, %52 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6793833Z %330 = tt.load %324, %329, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:46.6794092Z %331 = ttg.convert_layout %330 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6794411Z %332 = arith.shli %331, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6794649Z %333 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6794885Z %334 = arith.shrsi %331, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6795187Z %335 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6795521Z %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6795803Z %337 = tt.broadcast %335 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6796040Z %338 = arith.select %15, %337, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6796277Z %339 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6796505Z %340 = arith.select %17, %339, %338 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6796735Z %341 = tt.reshape %340 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:49:46.6796954Z %342 = arith.sitofp %341 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:49:46.6797246Z %343 = ttg.convert_layout %342 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6797713Z %344 = tt.dot %316, %343, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:46.6798060Z %345 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:49:46.6798204Z %346 = arith.cmpi slt, %345, %c2_i32 : i32 2026-02-21T09:49:46.6798341Z %347 = arith.select %346, %345, %c0_i32 : i32 2026-02-21T09:49:46.6798601Z %348 = ttg.memdesc_index %53[%347] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6798971Z ttg.local_store %314, %348 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6799357Z scf.yield %344, %347, %arg8, %348 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6799651Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:49:46.6799924Z %68 = ttg.local_load %67#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6800340Z %69 = arith.extf %68 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6800664Z %70 = arith.addi %9, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6800939Z %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6801178Z %72 = arith.muli %71, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6801366Z %73 = tt.broadcast %72 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6801549Z %74 = arith.addi %73, %48 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6801738Z %75 = tt.addptr %7, %74 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6801937Z %76 = arith.cmpi sge, %71, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6802102Z %77 = arith.cmpi slt, %71, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6802280Z %78 = arith.andi %76, %77 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:46.6802455Z %79 = tt.broadcast %78 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6802681Z %80 = arith.andi %79, %52 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6802859Z %81 = tt.load %75, %80, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:46.6803108Z %82 = ttg.convert_layout %81 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6803385Z %83 = arith.shli %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6803612Z %84 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6803842Z %85 = arith.shrsi %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6804122Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6804445Z %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6804717Z %88 = tt.broadcast %86 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6804944Z %89 = arith.select %15, %88, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6805172Z %90 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6805392Z %91 = arith.select %17, %90, %89 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6805610Z %92 = tt.reshape %91 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:49:46.6805821Z %93 = arith.sitofp %92 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:49:46.6806121Z %94 = ttg.convert_layout %93 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6806572Z %95 = tt.dot %69, %94, %67#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:46.6807069Z %96 = ttg.local_load %67#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6807479Z %97 = arith.extf %96 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6807800Z %98 = arith.addi %9, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6808071Z %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6808315Z %100 = arith.muli %99, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6808508Z %101 = tt.broadcast %100 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6808696Z %102 = arith.addi %101, %48 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6808893Z %103 = tt.addptr %7, %102 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6809095Z %104 = arith.cmpi sge, %99, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6809270Z %105 = arith.cmpi slt, %99, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6809435Z %106 = arith.andi %104, %105 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:46.6809615Z %107 = tt.broadcast %106 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6809805Z %108 = arith.andi %107, %52 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6809990Z %109 = tt.load %103, %108, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:46.6810244Z %110 = ttg.convert_layout %109 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6810523Z %111 = arith.shli %110, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6810772Z %112 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6811009Z %113 = arith.shrsi %110, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6811290Z %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6811620Z %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6811899Z %116 = tt.broadcast %114 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6812134Z %117 = arith.select %15, %116, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6812371Z %118 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6812610Z %119 = arith.select %17, %118, %117 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6812846Z %120 = tt.reshape %119 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:49:46.6813071Z %121 = arith.sitofp %120 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:49:46.6813365Z %122 = ttg.convert_layout %121 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6813844Z %123 = tt.dot %97, %122, %95, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:46.6814224Z ttg.local_dealloc %53 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:46.6814456Z %124 = arith.truncf %123 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:49:46.6814725Z %125 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:49:46.6814965Z %126 = arith.muli %125, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:49:46.6815198Z %127 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:49:46.6815457Z %128 = tt.broadcast %126 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6815664Z %129 = tt.broadcast %127 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6815852Z %130 = arith.addi %128, %129 : tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6816050Z %131 = tt.addptr %18, %130 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6816253Z tt.store %131, %124 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:49:46.6816400Z %132 = arith.addi %arg3, %c4864_i32 : i32 2026-02-21T09:49:46.6816532Z %133 = arith.divsi %132, %c512_i32 : i32 2026-02-21T09:49:46.6816657Z %134 = arith.muli %133, %c2_i32 : i32 2026-02-21T09:49:46.6816783Z %135 = arith.subi %c128_i32, %134 : i32 2026-02-21T09:49:46.6816908Z %136 = arith.minsi %135, %c2_i32 : i32 2026-02-21T09:49:46.6817030Z %137 = arith.remsi %132, %c512_i32 : i32 2026-02-21T09:49:46.6817154Z %138 = arith.remsi %137, %136 : i32 2026-02-21T09:49:46.6817271Z %139 = arith.addi %134, %138 : i32 2026-02-21T09:49:46.6817391Z %140 = arith.divsi %137, %136 : i32 2026-02-21T09:49:46.6817508Z %141 = arith.muli %139, %c64_i32 : i32 2026-02-21T09:49:46.6817682Z %142 = tt.splat %141 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:46.6817916Z %143 = arith.addi %142, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:46.6818087Z %144 = arith.muli %140, %c64_i32 : i32 2026-02-21T09:49:46.6818259Z %145 = tt.splat %144 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:46.6818491Z %146 = tt.splat %144 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:46.6818710Z %147 = arith.addi %145, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:46.6818924Z %148 = arith.addi %146, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:46.6819200Z %149 = tt.expand_dims %147 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:49:46.6819461Z %150 = arith.muli %149, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:49:46.6819660Z %151 = tt.broadcast %150 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6819841Z %152 = arith.extsi %141 : i32 to i64 2026-02-21T09:49:46.6820011Z %153 = tt.splat %152 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:46.6820241Z %154 = arith.addi %153, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:46.6820525Z %155 = tt.expand_dims %154 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6820809Z %156 = tt.broadcast %155 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6821017Z %157 = arith.cmpi sge, %155, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6821193Z %158 = arith.cmpi slt, %155, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6821367Z %159 = arith.andi %157, %158 : tensor<1x64xi1, #blocked2> 2026-02-21T09:49:46.6821579Z %160 = tt.broadcast %159 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6821794Z %161 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:46.6822007Z %162 = arith.addi %151, %55 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6822203Z %163 = tt.addptr %6, %162 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6822412Z %164 = tt.load %163 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:46.6822696Z %165 = ttg.memdesc_index %161[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6823054Z ttg.local_store %164, %165 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6823295Z %166 = arith.addi %151, %62 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6823494Z %167 = tt.addptr %6, %166 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6823703Z %168 = tt.load %167 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:46.6823988Z %169 = ttg.memdesc_index %161[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6824347Z ttg.local_store %168, %169 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6824867Z %170:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %165, %arg8 = %169) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:49:46.6825285Z %306 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:46.6825418Z %307 = arith.muli %306, %c2_i32 : i32 2026-02-21T09:49:46.6825616Z %308 = tt.splat %307 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:46.6825843Z %309 = arith.addi %308, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:46.6826128Z %310 = tt.expand_dims %309 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:46.6826422Z %311 = tt.broadcast %310 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6826628Z %312 = arith.addi %151, %311 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6826834Z %313 = tt.addptr %6, %312 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6827042Z %314 = tt.load %313 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:46.6827346Z %315 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6827782Z %316 = arith.extf %315 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6828073Z %317 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:46.6828253Z %318 = tt.splat %317 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6828478Z %319 = arith.addi %318, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6828762Z %320 = tt.expand_dims %319 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6829015Z %321 = arith.muli %320, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6829217Z %322 = tt.broadcast %321 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6829418Z %323 = arith.addi %322, %156 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6829644Z %324 = tt.addptr %7, %323 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6829858Z %325 = arith.cmpi sge, %320, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6830050Z %326 = arith.cmpi slt, %320, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6830222Z %327 = arith.andi %325, %326 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:46.6830413Z %328 = tt.broadcast %327 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6830605Z %329 = arith.andi %328, %160 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6830783Z %330 = tt.load %324, %329, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:46.6831040Z %331 = ttg.convert_layout %330 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6831325Z %332 = arith.shli %331, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6831580Z %333 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6831825Z %334 = arith.shrsi %331, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6832124Z %335 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6832464Z %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6832752Z %337 = tt.broadcast %335 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6832994Z %338 = arith.select %15, %337, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6833235Z %339 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6833495Z %340 = arith.select %17, %339, %338 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6833723Z %341 = tt.reshape %340 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:49:46.6833948Z %342 = arith.sitofp %341 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:49:46.6834265Z %343 = ttg.convert_layout %342 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6834733Z %344 = tt.dot %316, %343, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:46.6835083Z %345 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:49:46.6835213Z %346 = arith.cmpi slt, %345, %c2_i32 : i32 2026-02-21T09:49:46.6835353Z %347 = arith.select %346, %345, %c0_i32 : i32 2026-02-21T09:49:46.6835631Z %348 = ttg.memdesc_index %161[%347] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6835985Z ttg.local_store %314, %348 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6836378Z scf.yield %344, %347, %arg8, %348 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6836680Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:49:46.6836961Z %171 = ttg.local_load %170#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6837392Z %172 = arith.extf %171 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6837706Z %173 = arith.addi %73, %156 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6837910Z %174 = tt.addptr %7, %173 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6838129Z %175 = arith.andi %79, %160 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6838300Z %176 = tt.load %174, %175, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:46.6838569Z %177 = ttg.convert_layout %176 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6838859Z %178 = arith.shli %177, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6839105Z %179 = arith.shrsi %178, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6839352Z %180 = arith.shrsi %177, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6839649Z %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6839992Z %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6840274Z %183 = tt.broadcast %181 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6840518Z %184 = arith.select %15, %183, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6840760Z %185 = tt.broadcast %182 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6840991Z %186 = arith.select %17, %185, %184 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6841224Z %187 = tt.reshape %186 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:49:46.6841449Z %188 = arith.sitofp %187 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:49:46.6841909Z %189 = ttg.convert_layout %188 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6842375Z %190 = tt.dot %172, %189, %170#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:46.6842914Z %191 = ttg.local_load %170#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6843346Z %192 = arith.extf %191 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6843646Z %193 = arith.addi %101, %156 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6843845Z %194 = tt.addptr %7, %193 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6844051Z %195 = arith.andi %107, %160 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6844221Z %196 = tt.load %194, %195, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:46.6844484Z %197 = ttg.convert_layout %196 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6844767Z %198 = arith.shli %197, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6845003Z %199 = arith.shrsi %198, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6845246Z %200 = arith.shrsi %197, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6845531Z %201 = tt.expand_dims %199 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6845892Z %202 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6846176Z %203 = tt.broadcast %201 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6846436Z %204 = arith.select %15, %203, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6846681Z %205 = tt.broadcast %202 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6846911Z %206 = arith.select %17, %205, %204 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6847142Z %207 = tt.reshape %206 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:49:46.6847366Z %208 = arith.sitofp %207 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:49:46.6847662Z %209 = ttg.convert_layout %208 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6848134Z %210 = tt.dot %192, %209, %190, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:46.6848525Z ttg.local_dealloc %161 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:46.6848744Z %211 = arith.truncf %210 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:49:46.6849017Z %212 = tt.expand_dims %148 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:49:46.6849257Z %213 = arith.muli %212, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:49:46.6849493Z %214 = tt.expand_dims %143 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:49:46.6849754Z %215 = tt.broadcast %213 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6849963Z %216 = tt.broadcast %214 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6850168Z %217 = arith.addi %215, %216 : tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6850356Z %218 = tt.addptr %18, %217 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6850556Z tt.store %218, %211 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:49:46.6850719Z %219 = arith.addi %arg3, %c9728_i32 : i32 2026-02-21T09:49:46.6850852Z %220 = arith.divsi %219, %c512_i32 : i32 2026-02-21T09:49:46.6850981Z %221 = arith.muli %220, %c2_i32 : i32 2026-02-21T09:49:46.6851107Z %222 = arith.subi %c128_i32, %221 : i32 2026-02-21T09:49:46.6851233Z %223 = arith.minsi %222, %c2_i32 : i32 2026-02-21T09:49:46.6851353Z %224 = arith.remsi %219, %c512_i32 : i32 2026-02-21T09:49:46.6851477Z %225 = arith.remsi %224, %223 : i32 2026-02-21T09:49:46.6851595Z %226 = arith.addi %221, %225 : i32 2026-02-21T09:49:46.6851716Z %227 = arith.divsi %224, %223 : i32 2026-02-21T09:49:46.6851835Z %228 = arith.muli %226, %c64_i32 : i32 2026-02-21T09:49:46.6852004Z %229 = tt.splat %228 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:46.6852220Z %230 = arith.addi %229, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:46.6852389Z %231 = arith.muli %227, %c64_i32 : i32 2026-02-21T09:49:46.6852564Z %232 = tt.splat %231 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:46.6852780Z %233 = tt.splat %231 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:46.6853002Z %234 = arith.addi %232, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:46.6853216Z %235 = arith.addi %233, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:46.6853501Z %236 = tt.expand_dims %234 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:49:46.6853786Z %237 = arith.muli %236, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:49:46.6853985Z %238 = tt.broadcast %237 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6854195Z %239 = arith.extsi %228 : i32 to i64 2026-02-21T09:49:46.6854365Z %240 = tt.splat %239 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:46.6854592Z %241 = arith.addi %240, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:46.6854874Z %242 = tt.expand_dims %241 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6855153Z %243 = tt.broadcast %242 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6855363Z %244 = arith.cmpi sge, %242, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6855540Z %245 = arith.cmpi slt, %242, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6855718Z %246 = arith.andi %244, %245 : tensor<1x64xi1, #blocked2> 2026-02-21T09:49:46.6855910Z %247 = tt.broadcast %246 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6856129Z %248 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:46.6856323Z %249 = arith.addi %238, %55 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6856521Z %250 = tt.addptr %6, %249 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6856730Z %251 = tt.load %250 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:46.6857012Z %252 = ttg.memdesc_index %248[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6857374Z ttg.local_store %251, %252 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6857619Z %253 = arith.addi %238, %62 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6857841Z %254 = tt.addptr %6, %253 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6858050Z %255 = tt.load %254 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:46.6858331Z %256 = ttg.memdesc_index %248[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6858700Z ttg.local_store %255, %256 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6859217Z %257:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %252, %arg8 = %256) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:49:46.6859631Z %306 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:46.6859769Z %307 = arith.muli %306, %c2_i32 : i32 2026-02-21T09:49:46.6859950Z %308 = tt.splat %307 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:46.6860182Z %309 = arith.addi %308, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:46.6860470Z %310 = tt.expand_dims %309 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:46.6860750Z %311 = tt.broadcast %310 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6860954Z %312 = arith.addi %238, %311 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6861162Z %313 = tt.addptr %6, %312 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6861369Z %314 = tt.load %313 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:46.6861685Z %315 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6862118Z %316 = arith.extf %315 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6862420Z %317 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:46.6862599Z %318 = tt.splat %317 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6862823Z %319 = arith.addi %318, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6863103Z %320 = tt.expand_dims %319 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6863357Z %321 = arith.muli %320, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6863556Z %322 = tt.broadcast %321 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6863758Z %323 = arith.addi %322, %243 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6863959Z %324 = tt.addptr %7, %323 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6864176Z %325 = arith.cmpi sge, %320, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6864352Z %326 = arith.cmpi slt, %320, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6864525Z %327 = arith.andi %325, %326 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:46.6864723Z %328 = tt.broadcast %327 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6864914Z %329 = arith.andi %328, %247 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6865094Z %330 = tt.load %324, %329, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:46.6865359Z %331 = ttg.convert_layout %330 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6865672Z %332 = arith.shli %331, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6865913Z %333 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6866236Z %334 = arith.shrsi %331, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6866576Z %335 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6875121Z %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6875423Z %337 = tt.broadcast %335 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6875669Z %338 = arith.select %15, %337, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6875922Z %339 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6876160Z %340 = arith.select %17, %339, %338 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6876393Z %341 = tt.reshape %340 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:49:46.6876625Z %342 = arith.sitofp %341 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:49:46.6876921Z %343 = ttg.convert_layout %342 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6877396Z %344 = tt.dot %316, %343, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:46.6877750Z %345 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:49:46.6877929Z %346 = arith.cmpi slt, %345, %c2_i32 : i32 2026-02-21T09:49:46.6878074Z %347 = arith.select %346, %345, %c0_i32 : i32 2026-02-21T09:49:46.6878347Z %348 = ttg.memdesc_index %248[%347] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6878728Z ttg.local_store %314, %348 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6879127Z scf.yield %344, %347, %arg8, %348 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6879427Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:49:46.6879709Z %258 = ttg.local_load %257#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6880145Z %259 = arith.extf %258 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6880445Z %260 = arith.addi %73, %243 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6880652Z %261 = tt.addptr %7, %260 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6880855Z %262 = arith.andi %79, %247 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6881030Z %263 = tt.load %261, %262, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:46.6881292Z %264 = ttg.convert_layout %263 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6881575Z %265 = arith.shli %264, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6881818Z %266 = arith.shrsi %265, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6882057Z %267 = arith.shrsi %264, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6882369Z %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6882761Z %269 = tt.expand_dims %267 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6883063Z %270 = tt.broadcast %268 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6883307Z %271 = arith.select %15, %270, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6883547Z %272 = tt.broadcast %269 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6883787Z %273 = arith.select %17, %272, %271 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6884026Z %274 = tt.reshape %273 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:49:46.6884255Z %275 = arith.sitofp %274 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:49:46.6884558Z %276 = ttg.convert_layout %275 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6885029Z %277 = tt.dot %259, %276, %257#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:46.6885532Z %278 = ttg.local_load %257#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6885969Z %279 = arith.extf %278 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6886292Z %280 = arith.addi %101, %243 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6886503Z %281 = tt.addptr %7, %280 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6886708Z %282 = arith.andi %107, %247 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6886898Z %283 = tt.load %281, %282, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:46.6887160Z %284 = ttg.convert_layout %283 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6887442Z %285 = arith.shli %284, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6887683Z %286 = arith.shrsi %285, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6887928Z %287 = arith.shrsi %284, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6888218Z %288 = tt.expand_dims %286 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6888558Z %289 = tt.expand_dims %287 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6888841Z %290 = tt.broadcast %288 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6889086Z %291 = arith.select %15, %290, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6889326Z %292 = tt.broadcast %289 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6889559Z %293 = arith.select %17, %292, %291 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6889793Z %294 = tt.reshape %293 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:49:46.6890013Z %295 = arith.sitofp %294 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:49:46.6890331Z %296 = ttg.convert_layout %295 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6890795Z %297 = tt.dot %279, %296, %277, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:46.6891197Z ttg.local_dealloc %248 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:46.6891414Z %298 = arith.truncf %297 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:49:46.6891683Z %299 = tt.expand_dims %235 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:49:46.6891921Z %300 = arith.muli %299, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:49:46.6892156Z %301 = tt.expand_dims %230 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:49:46.6892417Z %302 = tt.broadcast %300 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6892625Z %303 = tt.broadcast %301 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6892808Z %304 = arith.addi %302, %303 : tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6893004Z %305 = tt.addptr %18, %304 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6893208Z tt.store %305, %298 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:49:46.6893351Z } {tt.num_stages = 1 : i32} 2026-02-21T09:49:46.6893490Z scf.for %arg3 = %24 to %c32768_i32 step %c4864_i32 : i32 { 2026-02-21T09:49:46.6893640Z %25 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:49:46.6893768Z %26 = arith.muli %25, %c2_i32 : i32 2026-02-21T09:49:46.6893890Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:49:46.6894013Z %28 = arith.minsi %27, %c2_i32 : i32 2026-02-21T09:49:46.6894138Z %29 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:49:46.6894275Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:49:46.6894394Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:49:46.6894506Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:49:46.6894638Z %33 = arith.muli %31, %c64_i32 : i32 2026-02-21T09:49:46.6894797Z %34 = tt.splat %33 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:46.6895008Z %35 = arith.addi %34, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:46.6895176Z %36 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:49:46.6895343Z %37 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:46.6895558Z %38 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:46.6895766Z %39 = arith.addi %37, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:46.6895979Z %40 = arith.addi %38, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:46.6896248Z %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:49:46.6896503Z %42 = arith.muli %41, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:49:46.6896699Z %43 = tt.broadcast %42 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6896872Z %44 = arith.extsi %33 : i32 to i64 2026-02-21T09:49:46.6897041Z %45 = tt.splat %44 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:46.6897261Z %46 = arith.addi %45, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:46.6897537Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6897817Z %48 = tt.broadcast %47 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6898037Z %49 = arith.cmpi sge, %47, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6898214Z %50 = arith.cmpi slt, %47, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:46.6898381Z %51 = arith.andi %49, %50 : tensor<1x64xi1, #blocked2> 2026-02-21T09:49:46.6898571Z %52 = tt.broadcast %51 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6898805Z %53 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:46.6899072Z %54 = tt.expand_dims %5 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:46.6899347Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6899535Z %56 = arith.addi %43, %55 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6899734Z %57 = tt.addptr %6, %56 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6899942Z %58 = tt.load %57 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:46.6900223Z %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6900581Z ttg.local_store %58, %59 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6900853Z %60 = arith.addi %5, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:46.6901132Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:46.6901406Z %62 = tt.broadcast %61 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6901593Z %63 = arith.addi %43, %62 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6901790Z %64 = tt.addptr %6, %63 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6902004Z %65 = tt.load %64 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:46.6902284Z %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6902651Z ttg.local_store %65, %66 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6903165Z %67:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:49:46.6903582Z %132 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:46.6903711Z %133 = arith.muli %132, %c2_i32 : i32 2026-02-21T09:49:46.6903889Z %134 = tt.splat %133 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:46.6904120Z %135 = arith.addi %134, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:46.6904400Z %136 = tt.expand_dims %135 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:46.6904689Z %137 = tt.broadcast %136 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6904885Z %138 = arith.addi %43, %137 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6905096Z %139 = tt.addptr %6, %138 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:46.6905307Z %140 = tt.load %139 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:46.6905609Z %141 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6906074Z %142 = arith.extf %141 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6906355Z %143 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:46.6906534Z %144 = tt.splat %143 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6906775Z %145 = arith.addi %144, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6907053Z %146 = tt.expand_dims %145 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6907308Z %147 = arith.muli %146, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6907502Z %148 = tt.broadcast %147 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6907701Z %149 = arith.addi %148, %48 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6907906Z %150 = tt.addptr %7, %149 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6908118Z %151 = arith.cmpi sge, %146, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6908297Z %152 = arith.cmpi slt, %146, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6908465Z %153 = arith.andi %151, %152 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:46.6908659Z %154 = tt.broadcast %153 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6908856Z %155 = arith.andi %154, %52 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6909029Z %156 = tt.load %150, %155, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:46.6909297Z %157 = ttg.convert_layout %156 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6909583Z %158 = arith.shli %157, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6909844Z %159 = arith.shrsi %158, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6910086Z %160 = arith.shrsi %157, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6910400Z %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6910743Z %162 = tt.expand_dims %160 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6911027Z %163 = tt.broadcast %161 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6911275Z %164 = arith.select %15, %163, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6911520Z %165 = tt.broadcast %162 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6911754Z %166 = arith.select %17, %165, %164 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6911989Z %167 = tt.reshape %166 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:49:46.6912211Z %168 = arith.sitofp %167 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:49:46.6912512Z %169 = ttg.convert_layout %168 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6912986Z %170 = tt.dot %142, %169, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:46.6913335Z %171 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:49:46.6913474Z %172 = arith.cmpi slt, %171, %c2_i32 : i32 2026-02-21T09:49:46.6913609Z %173 = arith.select %172, %171, %c0_i32 : i32 2026-02-21T09:49:46.6913896Z %174 = ttg.memdesc_index %53[%173] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6914253Z ttg.local_store %140, %174 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6914644Z scf.yield %170, %173, %arg8, %174 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:46.6914960Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:49:46.6915237Z %68 = ttg.local_load %67#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6915659Z %69 = arith.extf %68 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6915991Z %70 = arith.addi %9, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6916268Z %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6916517Z %72 = arith.muli %71, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6916710Z %73 = tt.broadcast %72 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6916901Z %74 = arith.addi %73, %48 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6917094Z %75 = tt.addptr %7, %74 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6917294Z %76 = arith.cmpi sge, %71, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6917467Z %77 = arith.cmpi slt, %71, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6917626Z %78 = arith.andi %76, %77 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:46.6917811Z %79 = tt.broadcast %78 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6918017Z %80 = arith.andi %79, %52 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6918181Z %81 = tt.load %75, %80, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:46.6918452Z %82 = ttg.convert_layout %81 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6918727Z %83 = arith.shli %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6918961Z %84 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6919195Z %85 = arith.shrsi %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6919475Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6919809Z %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6920083Z %88 = tt.broadcast %86 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6920323Z %89 = arith.select %15, %88, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6920558Z %90 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6920788Z %91 = arith.select %17, %90, %89 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6921003Z %92 = tt.reshape %91 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:49:46.6921216Z %93 = arith.sitofp %92 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:49:46.6921498Z %94 = ttg.convert_layout %93 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6921966Z %95 = tt.dot %69, %94, %67#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:46.6922447Z %96 = ttg.local_load %67#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6922913Z %97 = arith.extf %96 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6923232Z %98 = arith.addi %9, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:46.6923509Z %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6923752Z %100 = arith.muli %99, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6923945Z %101 = tt.broadcast %100 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6924137Z %102 = arith.addi %101, %48 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6924331Z %103 = tt.addptr %7, %102 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:46.6924539Z %104 = arith.cmpi sge, %99, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6924710Z %105 = arith.cmpi slt, %99, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:46.6924875Z %106 = arith.andi %104, %105 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:46.6925056Z %107 = tt.broadcast %106 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6925243Z %108 = arith.andi %107, %52 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:46.6925414Z %109 = tt.load %103, %108, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:46.6925668Z %110 = ttg.convert_layout %109 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6925969Z %111 = arith.shli %110, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6926204Z %112 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6926456Z %113 = arith.shrsi %110, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:46.6926742Z %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6927069Z %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:46.6927347Z %116 = tt.broadcast %114 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6927581Z %117 = arith.select %15, %116, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6927821Z %118 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6928051Z %119 = arith.select %17, %118, %117 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:46.6928277Z %120 = tt.reshape %119 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:49:46.6928498Z %121 = arith.sitofp %120 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:49:46.6928785Z %122 = ttg.convert_layout %121 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:46.6929237Z %123 = tt.dot %97, %122, %95, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:46.6929615Z ttg.local_dealloc %53 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:46.6929843Z %124 = arith.truncf %123 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:49:46.6930104Z %125 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:49:46.6930339Z %126 = arith.muli %125, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:49:46.6930578Z %127 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:49:46.6930836Z %128 = tt.broadcast %126 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6931033Z %129 = tt.broadcast %127 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6931210Z %130 = arith.addi %128, %129 : tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6931394Z %131 = tt.addptr %18, %130 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:49:46.6931588Z tt.store %131, %124 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:49:46.6931727Z } {tt.num_stages = 1 : i32} 2026-02-21T09:49:46.6931830Z tt.return 2026-02-21T09:49:46.6931914Z } 2026-02-21T09:49:46.6931990Z } 2026-02-21T09:49:46.6932035Z 2026-02-21T09:49:46.6932066Z {-# 2026-02-21T09:49:46.6932145Z external_resources: { 2026-02-21T09:49:46.6932247Z mlir_reproducer: { 2026-02-21T09:49:46.6933236Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:49:46.6934229Z disable_threading: false, 2026-02-21T09:49:46.6934349Z verify_each: true 2026-02-21T09:49:46.6934440Z } 2026-02-21T09:49:46.6934510Z } 2026-02-21T09:49:46.6934579Z #-} 2026-02-21T09:49:46.6934873Z /tmp/torchinductor_root/jc/cjc6bibmitqfnd7irt5tdsf256e6evk3ednxdyldf7plkxktmfxs.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:49:46.6935556Z /tmp/torchinductor_root/jc/cjc6bibmitqfnd7irt5tdsf256e6evk3ednxdyldf7plkxktmfxs.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:49:46.6936112Z [317s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:49:46.6936890Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[1, 3], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:49:46.6937598Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:49:46.6937767Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:49:47.0202034Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:49:47.0211908Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:49:47.0212772Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:49:47.0213411Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:49:47.0213846Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:49:47.0214238Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:49:47.0214638Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:49:47.0214909Z #smem = #ttg.shared_memory 2026-02-21T09:49:47.0215264Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:49:47.0215983Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:49:47.0216557Z %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma> 2026-02-21T09:49:47.0216826Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:47.0217089Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:47.0217358Z %cst_2 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1> 2026-02-21T09:49:47.0217639Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma> 2026-02-21T09:49:47.0217893Z %c42495_i32 = arith.constant 42495 : i32 2026-02-21T09:49:47.0218081Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:49:47.0218254Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:49:47.0218430Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:49:47.0218709Z %cst_4 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:47.0219099Z %cst_5 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0219522Z %cst_6 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0219849Z %cst_7 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0220117Z %cst_8 = arith.constant dense<0> : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0220406Z %cst_9 = arith.constant dense<512> : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0220673Z %cst_10 = arith.constant dense<0> : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0220945Z %cst_11 = arith.constant dense<8192> : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0221215Z %cst_12 = arith.constant dense<0> : tensor<2x64xi8, #blocked2> 2026-02-21T09:49:47.0221441Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:49:47.0221615Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:49:47.0221793Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:49:47.0221975Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T09:49:47.0222153Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:49:47.0222326Z %c29184_i32 = arith.constant 29184 : i32 2026-02-21T09:49:47.0222514Z %c19456_i32 = arith.constant 19456 : i32 2026-02-21T09:49:47.0222732Z %cst_13 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0222958Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:49:47.0223123Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:49:47.0223270Z %c9728_i32 = arith.constant 9728 : i32 2026-02-21T09:49:47.0223515Z %cst_14 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0223751Z %0 = tt.get_program_id x : i32 2026-02-21T09:49:47.0223996Z %1 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:47.0224340Z %2 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:47.0224672Z %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:47.0225036Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:47.0225375Z %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:47.0225681Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:47.0225956Z %7 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:47.0226240Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0226641Z %9 = arith.extsi %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0227093Z %10 = arith.extsi %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:47.0227549Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:49:47.0228073Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:49:47.0228582Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:47.0228912Z %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:47.0229158Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:49:47.0229403Z %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:47.0229639Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:49:47.0229913Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:49:47.0230121Z %19 = arith.subi %c42495_i32, %0 : i32 2026-02-21T09:49:47.0230273Z %20 = arith.divui %19, %c9728_i32 : i32 2026-02-21T09:49:47.0230443Z %21 = arith.remsi %20, %c3_i32 : i32 2026-02-21T09:49:47.0230588Z %22 = arith.subi %20, %21 : i32 2026-02-21T09:49:47.0230729Z %23 = arith.muli %22, %c9728_i32 : i32 2026-02-21T09:49:47.0230874Z %24 = arith.addi %0, %23 : i32 2026-02-21T09:49:47.0231034Z scf.for %arg3 = %0 to %24 step %c29184_i32 : i32 { 2026-02-21T09:49:47.0231210Z %25 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:49:47.0231359Z %26 = arith.muli %25, %c2_i32 : i32 2026-02-21T09:49:47.0231509Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:49:47.0231652Z %28 = arith.minsi %27, %c2_i32 : i32 2026-02-21T09:49:47.0231803Z %29 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:49:47.0231951Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:49:47.0232091Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:49:47.0232234Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:49:47.0232373Z %33 = arith.muli %31, %c64_i32 : i32 2026-02-21T09:49:47.0232572Z %34 = tt.splat %33 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:47.0232832Z %35 = arith.addi %34, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:47.0233037Z %36 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:49:47.0233247Z %37 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:47.0233513Z %38 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:47.0233780Z %39 = arith.addi %37, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:47.0234043Z %40 = arith.addi %38, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:47.0234401Z %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:49:47.0234779Z %42 = arith.muli %41, %cst_2 : tensor<64x1xi32, #blocked1> 2026-02-21T09:49:47.0235019Z %43 = tt.broadcast %42 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0235218Z %44 = arith.extsi %33 : i32 to i64 2026-02-21T09:49:47.0235408Z %45 = tt.splat %44 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:47.0235634Z %46 = arith.addi %45, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:47.0235908Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0236191Z %48 = tt.broadcast %47 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0236393Z %49 = arith.cmpi sge, %47, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0236568Z %50 = arith.cmpi slt, %47, %cst_11 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0236742Z %51 = arith.andi %49, %50 : tensor<1x64xi1, #blocked2> 2026-02-21T09:49:47.0236928Z %52 = tt.broadcast %51 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0237146Z %53 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:47.0237420Z %54 = tt.expand_dims %5 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:47.0237690Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0237881Z %56 = arith.addi %43, %55 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0238077Z %57 = tt.addptr %6, %56 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0238289Z %58 = tt.load %57 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:47.0238589Z %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0238946Z ttg.local_store %58, %59 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0239238Z %60 = arith.addi %5, %cst_4 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:47.0239524Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:47.0239804Z %62 = tt.broadcast %61 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0239993Z %63 = arith.addi %43, %62 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0240200Z %64 = tt.addptr %6, %63 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0240405Z %65 = tt.load %64 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:47.0240687Z %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0241051Z ttg.local_store %65, %66 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0241576Z %67:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:49:47.0242003Z %312 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:47.0242136Z %313 = arith.muli %312, %c2_i32 : i32 2026-02-21T09:49:47.0242313Z %314 = tt.splat %313 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:47.0242552Z %315 = arith.addi %314, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:47.0243054Z %316 = tt.expand_dims %315 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:47.0243341Z %317 = tt.broadcast %316 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0243554Z %318 = arith.addi %43, %317 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0243753Z %319 = tt.addptr %6, %318 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0243962Z %320 = tt.load %319 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:47.0244259Z %321 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0244696Z %322 = arith.extf %321 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0244981Z %323 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:47.0245154Z %324 = tt.splat %323 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0245378Z %325 = arith.addi %324, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0245652Z %326 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0245905Z %327 = arith.muli %326, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0246099Z %328 = tt.broadcast %327 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0246290Z %329 = arith.addi %328, %48 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0246491Z %330 = tt.addptr %7, %329 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0246729Z %331 = arith.cmpi sge, %326, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0246910Z %332 = arith.cmpi slt, %326, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0247077Z %333 = arith.andi %331, %332 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:47.0247285Z %334 = tt.broadcast %333 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0247481Z %335 = arith.andi %334, %52 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0247649Z %336 = tt.load %330, %335, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:47.0247906Z %337 = ttg.convert_layout %336 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0248189Z %338 = arith.shli %337, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0248428Z %339 = arith.shrsi %338, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0248668Z %340 = arith.shrsi %337, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0248959Z %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0249302Z %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0249585Z %343 = tt.broadcast %341 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0249823Z %344 = arith.select %15, %343, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0250069Z %345 = tt.broadcast %342 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0250301Z %346 = arith.select %17, %345, %344 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0250534Z %347 = tt.reshape %346 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:47.0250775Z %348 = arith.sitofp %347 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:47.0251021Z %349 = ttg.local_alloc %348 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:47.0251347Z %350 = ttg.local_load %349 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0251830Z %351 = tt.dot %322, %350, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:47.0252180Z %352 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:49:47.0252307Z %353 = arith.cmpi slt, %352, %c2_i32 : i32 2026-02-21T09:49:47.0252438Z %354 = arith.select %353, %352, %c0_i32 : i32 2026-02-21T09:49:47.0252699Z %355 = ttg.memdesc_index %53[%354] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0253048Z ttg.local_store %320, %355 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0253439Z scf.yield %351, %354, %arg8, %355 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0253738Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:49:47.0254007Z %68 = ttg.local_load %67#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0254425Z %69 = arith.extf %68 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0254768Z %70 = arith.addi %9, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0255040Z %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0255297Z %72 = arith.muli %71, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0255480Z %73 = tt.broadcast %72 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0255666Z %74 = arith.addi %73, %48 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0255853Z %75 = tt.addptr %7, %74 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0256052Z %76 = arith.cmpi sge, %71, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0256216Z %77 = arith.cmpi slt, %71, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0256371Z %78 = arith.andi %76, %77 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:47.0256549Z %79 = tt.broadcast %78 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0256731Z %80 = arith.andi %79, %52 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0256891Z %81 = tt.load %75, %80, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:47.0257142Z %82 = ttg.convert_layout %81 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0257415Z %83 = arith.shli %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0257641Z %84 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0257866Z %85 = arith.shrsi %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0258146Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0258497Z %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0258768Z %88 = tt.broadcast %86 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0259001Z %89 = arith.select %15, %88, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0259239Z %90 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0259458Z %91 = arith.select %17, %90, %89 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0259673Z %92 = tt.reshape %91 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:47.0259883Z %93 = arith.sitofp %92 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:47.0260122Z %94 = ttg.local_alloc %93 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:47.0260435Z %95 = ttg.local_load %94 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0260894Z %96 = tt.dot %69, %95, %67#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:47.0261377Z %97 = ttg.local_load %67#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0261790Z %98 = arith.extf %97 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0262112Z %99 = arith.addi %9, %cst_6 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0262397Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0262645Z %101 = arith.muli %100, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0262836Z %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0263037Z %103 = arith.addi %102, %48 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0263232Z %104 = tt.addptr %7, %103 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0263439Z %105 = arith.cmpi sge, %100, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0263608Z %106 = arith.cmpi slt, %100, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0263771Z %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:47.0263951Z %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0264136Z %109 = arith.andi %108, %52 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0264302Z %110 = tt.load %104, %109, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:47.0264555Z %111 = ttg.convert_layout %110 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0264837Z %112 = arith.shli %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0265071Z %113 = arith.shrsi %112, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0265303Z %114 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0265589Z %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0265918Z %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0266198Z %117 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0266450Z %118 = arith.select %15, %117, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0266685Z %119 = tt.broadcast %116 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0266913Z %120 = arith.select %17, %119, %118 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0267151Z %121 = tt.reshape %120 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:47.0267369Z %122 = arith.sitofp %121 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:47.0267611Z %123 = ttg.local_alloc %122 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:47.0267931Z %124 = ttg.local_load %123 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0268393Z %125 = tt.dot %98, %124, %96, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:47.0268765Z ttg.local_dealloc %53 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:47.0268972Z %126 = arith.truncf %125 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:49:47.0269230Z %127 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:49:47.0269462Z %128 = arith.muli %127, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:49:47.0269688Z %129 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:49:47.0269936Z %130 = tt.broadcast %128 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0270150Z %131 = tt.broadcast %129 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0270324Z %132 = arith.addi %130, %131 : tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0270507Z %133 = tt.addptr %18, %132 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0270715Z tt.store %133, %126 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:49:47.0270858Z %134 = arith.addi %arg3, %c9728_i32 : i32 2026-02-21T09:49:47.0270984Z %135 = arith.divsi %134, %c512_i32 : i32 2026-02-21T09:49:47.0271103Z %136 = arith.muli %135, %c2_i32 : i32 2026-02-21T09:49:47.0271224Z %137 = arith.subi %c128_i32, %136 : i32 2026-02-21T09:49:47.0271340Z %138 = arith.minsi %137, %c2_i32 : i32 2026-02-21T09:49:47.0271457Z %139 = arith.remsi %134, %c512_i32 : i32 2026-02-21T09:49:47.0271573Z %140 = arith.remsi %139, %138 : i32 2026-02-21T09:49:47.0271684Z %141 = arith.addi %136, %140 : i32 2026-02-21T09:49:47.0271799Z %142 = arith.divsi %139, %138 : i32 2026-02-21T09:49:47.0271913Z %143 = arith.muli %141, %c64_i32 : i32 2026-02-21T09:49:47.0272073Z %144 = tt.splat %143 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:47.0272286Z %145 = arith.addi %144, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:47.0272451Z %146 = arith.muli %142, %c64_i32 : i32 2026-02-21T09:49:47.0272617Z %147 = tt.splat %146 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:47.0272826Z %148 = tt.splat %146 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:47.0273039Z %149 = arith.addi %147, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:47.0273248Z %150 = arith.addi %148, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:47.0273517Z %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:49:47.0273783Z %152 = arith.muli %151, %cst_2 : tensor<64x1xi32, #blocked1> 2026-02-21T09:49:47.0273974Z %153 = tt.broadcast %152 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0274149Z %154 = arith.extsi %143 : i32 to i64 2026-02-21T09:49:47.0274311Z %155 = tt.splat %154 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:47.0274556Z %156 = arith.addi %155, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:47.0274830Z %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0275109Z %158 = tt.broadcast %157 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0275310Z %159 = arith.cmpi sge, %157, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0275488Z %160 = arith.cmpi slt, %157, %cst_11 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0275657Z %161 = arith.andi %159, %160 : tensor<1x64xi1, #blocked2> 2026-02-21T09:49:47.0275845Z %162 = tt.broadcast %161 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0276060Z %163 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:47.0276244Z %164 = arith.addi %153, %55 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0276440Z %165 = tt.addptr %6, %164 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0276645Z %166 = tt.load %165 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:47.0276925Z %167 = ttg.memdesc_index %163[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0277280Z ttg.local_store %166, %167 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0277544Z %168 = arith.addi %153, %62 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0277738Z %169 = tt.addptr %6, %168 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0277958Z %170 = tt.load %169 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:47.0278234Z %171 = ttg.memdesc_index %163[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0278586Z ttg.local_store %170, %171 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0279098Z %172:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %167, %arg8 = %171) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:49:47.0279508Z %312 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:47.0279635Z %313 = arith.muli %312, %c2_i32 : i32 2026-02-21T09:49:47.0279805Z %314 = tt.splat %313 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:47.0280028Z %315 = arith.addi %314, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:47.0280304Z %316 = tt.expand_dims %315 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:47.0280580Z %317 = tt.broadcast %316 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0280772Z %318 = arith.addi %153, %317 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0280968Z %319 = tt.addptr %6, %318 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0281170Z %320 = tt.load %319 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:47.0281481Z %321 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0281909Z %322 = arith.extf %321 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0282188Z %323 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:47.0282374Z %324 = tt.splat %323 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0282633Z %325 = arith.addi %324, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0282909Z %326 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0283155Z %327 = arith.muli %326, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0283346Z %328 = tt.broadcast %327 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0283538Z %329 = arith.addi %328, %158 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0283733Z %330 = tt.addptr %7, %329 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0283941Z %331 = arith.cmpi sge, %326, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0284109Z %332 = arith.cmpi slt, %326, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0284274Z %333 = arith.andi %331, %332 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:47.0284459Z %334 = tt.broadcast %333 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0284647Z %335 = arith.andi %334, %162 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0284812Z %336 = tt.load %330, %335, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:47.0285066Z %337 = ttg.convert_layout %336 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0285365Z %338 = arith.shli %337, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0285598Z %339 = arith.shrsi %338, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0285851Z %340 = arith.shrsi %337, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0286138Z %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0286469Z %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0286748Z %343 = tt.broadcast %341 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0286984Z %344 = arith.select %15, %343, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0287221Z %345 = tt.broadcast %342 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0287451Z %346 = arith.select %17, %345, %344 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0287689Z %347 = tt.reshape %346 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:47.0287915Z %348 = arith.sitofp %347 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:47.0288167Z %349 = ttg.local_alloc %348 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:47.0288498Z %350 = ttg.local_load %349 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0288976Z %351 = tt.dot %322, %350, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:47.0289325Z %352 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:49:47.0289485Z %353 = arith.cmpi slt, %352, %c2_i32 : i32 2026-02-21T09:49:47.0289620Z %354 = arith.select %353, %352, %c0_i32 : i32 2026-02-21T09:49:47.0289891Z %355 = ttg.memdesc_index %163[%354] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0290268Z ttg.local_store %320, %355 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0290656Z scf.yield %351, %354, %arg8, %355 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0290958Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:49:47.0291239Z %173 = ttg.local_load %172#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0291672Z %174 = arith.extf %173 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0291974Z %175 = arith.addi %73, %158 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0292177Z %176 = tt.addptr %7, %175 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0292380Z %177 = arith.andi %79, %162 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0292553Z %178 = tt.load %176, %177, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:47.0292809Z %179 = ttg.convert_layout %178 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0293092Z %180 = arith.shli %179, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0293344Z %181 = arith.shrsi %180, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0293585Z %182 = arith.shrsi %179, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0293894Z %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0294230Z %184 = tt.expand_dims %182 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0294514Z %185 = tt.broadcast %183 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0294753Z %186 = arith.select %15, %185, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0294995Z %187 = tt.broadcast %184 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0295233Z %188 = arith.select %17, %187, %186 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0295462Z %189 = tt.reshape %188 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:47.0295688Z %190 = arith.sitofp %189 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:47.0295939Z %191 = ttg.local_alloc %190 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:47.0296264Z %192 = ttg.local_load %191 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0296735Z %193 = tt.dot %174, %192, %172#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:47.0297225Z %194 = ttg.local_load %172#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0297674Z %195 = arith.extf %194 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0297977Z %196 = arith.addi %102, %158 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0298177Z %197 = tt.addptr %7, %196 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0298397Z %198 = arith.andi %108, %162 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0298569Z %199 = tt.load %197, %198, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:47.0298831Z %200 = ttg.convert_layout %199 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0299117Z %201 = arith.shli %200, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0299356Z %202 = arith.shrsi %201, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0299600Z %203 = arith.shrsi %200, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0299892Z %204 = tt.expand_dims %202 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0300236Z %205 = tt.expand_dims %203 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0300526Z %206 = tt.broadcast %204 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0300764Z %207 = arith.select %15, %206, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0301006Z %208 = tt.broadcast %205 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0301238Z %209 = arith.select %17, %208, %207 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0301491Z %210 = tt.reshape %209 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:47.0301719Z %211 = arith.sitofp %210 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:47.0301985Z %212 = ttg.local_alloc %211 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:47.0302315Z %213 = ttg.local_load %212 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0302779Z %214 = tt.dot %195, %213, %193, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:47.0303163Z ttg.local_dealloc %163 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:47.0303378Z %215 = arith.truncf %214 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:49:47.0303642Z %216 = tt.expand_dims %150 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:49:47.0303884Z %217 = arith.muli %216, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:49:47.0304116Z %218 = tt.expand_dims %145 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:49:47.0304378Z %219 = tt.broadcast %217 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0304583Z %220 = tt.broadcast %218 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0304762Z %221 = arith.addi %219, %220 : tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0304957Z %222 = tt.addptr %18, %221 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0305157Z tt.store %222, %215 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:49:47.0305307Z %223 = arith.addi %arg3, %c19456_i32 : i32 2026-02-21T09:49:47.0305442Z %224 = arith.divsi %223, %c512_i32 : i32 2026-02-21T09:49:47.0305590Z %225 = arith.muli %224, %c2_i32 : i32 2026-02-21T09:49:47.0305715Z %226 = arith.subi %c128_i32, %225 : i32 2026-02-21T09:49:47.0305837Z %227 = arith.minsi %226, %c2_i32 : i32 2026-02-21T09:49:47.0305961Z %228 = arith.remsi %223, %c512_i32 : i32 2026-02-21T09:49:47.0306105Z %229 = arith.remsi %228, %227 : i32 2026-02-21T09:49:47.0306228Z %230 = arith.addi %225, %229 : i32 2026-02-21T09:49:47.0306343Z %231 = arith.divsi %228, %227 : i32 2026-02-21T09:49:47.0306465Z %232 = arith.muli %230, %c64_i32 : i32 2026-02-21T09:49:47.0306631Z %233 = tt.splat %232 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:47.0306844Z %234 = arith.addi %233, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:47.0307020Z %235 = arith.muli %231, %c64_i32 : i32 2026-02-21T09:49:47.0307191Z %236 = tt.splat %235 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:47.0307413Z %237 = tt.splat %235 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:47.0307633Z %238 = arith.addi %236, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:47.0307856Z %239 = arith.addi %237, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:47.0308133Z %240 = tt.expand_dims %238 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:49:47.0308388Z %241 = arith.muli %240, %cst_2 : tensor<64x1xi32, #blocked1> 2026-02-21T09:49:47.0308588Z %242 = tt.broadcast %241 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0308764Z %243 = arith.extsi %232 : i32 to i64 2026-02-21T09:49:47.0308937Z %244 = tt.splat %243 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:47.0309182Z %245 = arith.addi %244, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:47.0309460Z %246 = tt.expand_dims %245 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0309761Z %247 = tt.broadcast %246 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0309965Z %248 = arith.cmpi sge, %246, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0310147Z %249 = arith.cmpi slt, %246, %cst_11 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0310319Z %250 = arith.andi %248, %249 : tensor<1x64xi1, #blocked2> 2026-02-21T09:49:47.0310507Z %251 = tt.broadcast %250 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0310726Z %252 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:47.0310911Z %253 = arith.addi %242, %55 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0311120Z %254 = tt.addptr %6, %253 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0311330Z %255 = tt.load %254 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:47.0311613Z %256 = ttg.memdesc_index %252[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0311982Z ttg.local_store %255, %256 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0312220Z %257 = arith.addi %242, %62 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0312422Z %258 = tt.addptr %6, %257 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0312631Z %259 = tt.load %258 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:47.0312909Z %260 = ttg.memdesc_index %252[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0313283Z ttg.local_store %259, %260 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0313795Z %261:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %256, %arg8 = %260) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:49:47.0314230Z %312 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:47.0314362Z %313 = arith.muli %312, %c2_i32 : i32 2026-02-21T09:49:47.0314536Z %314 = tt.splat %313 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:47.0314766Z %315 = arith.addi %314, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:47.0315045Z %316 = tt.expand_dims %315 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:47.0315331Z %317 = tt.broadcast %316 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0315533Z %318 = arith.addi %242, %317 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0315733Z %319 = tt.addptr %6, %318 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0315944Z %320 = tt.load %319 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:47.0316242Z %321 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0316677Z %322 = arith.extf %321 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0316965Z %323 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:47.0317151Z %324 = tt.splat %323 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0317383Z %325 = arith.addi %324, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0317676Z %326 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0317930Z %327 = arith.muli %326, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0318128Z %328 = tt.broadcast %327 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0318323Z %329 = arith.addi %328, %247 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0318526Z %330 = tt.addptr %7, %329 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0318734Z %331 = arith.cmpi sge, %326, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0318913Z %332 = arith.cmpi slt, %326, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0319084Z %333 = arith.andi %331, %332 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:47.0319274Z %334 = tt.broadcast %333 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0319472Z %335 = arith.andi %334, %251 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0319641Z %336 = tt.load %330, %335, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:47.0319907Z %337 = ttg.convert_layout %336 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0320194Z %338 = arith.shli %337, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0320433Z %339 = arith.shrsi %338, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0320682Z %340 = arith.shrsi %337, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0321001Z %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0321343Z %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0321631Z %343 = tt.broadcast %341 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0321896Z %344 = arith.select %15, %343, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0322136Z %345 = tt.broadcast %342 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0322369Z %346 = arith.select %17, %345, %344 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0322642Z %347 = tt.reshape %346 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:47.0322872Z %348 = arith.sitofp %347 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:47.0323126Z %349 = ttg.local_alloc %348 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:47.0323455Z %350 = ttg.local_load %349 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0323929Z %351 = tt.dot %322, %350, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:47.0324287Z %352 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:49:47.0324422Z %353 = arith.cmpi slt, %352, %c2_i32 : i32 2026-02-21T09:49:47.0324557Z %354 = arith.select %353, %352, %c0_i32 : i32 2026-02-21T09:49:47.0324831Z %355 = ttg.memdesc_index %252[%354] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0325212Z ttg.local_store %320, %355 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0325609Z scf.yield %351, %354, %arg8, %355 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0325930Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:49:47.0326207Z %262 = ttg.local_load %261#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0326638Z %263 = arith.extf %262 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0326939Z %264 = arith.addi %73, %247 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0327139Z %265 = tt.addptr %7, %264 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0327348Z %266 = arith.andi %79, %251 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0327517Z %267 = tt.load %265, %266, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:47.0327781Z %268 = ttg.convert_layout %267 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0328068Z %269 = arith.shli %268, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0328304Z %270 = arith.shrsi %269, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0328546Z %271 = arith.shrsi %268, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0328832Z %272 = tt.expand_dims %270 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0329175Z %273 = tt.expand_dims %271 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0329485Z %274 = tt.broadcast %272 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0329724Z %275 = arith.select %15, %274, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0329981Z %276 = tt.broadcast %273 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0330210Z %277 = arith.select %17, %276, %275 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0330444Z %278 = tt.reshape %277 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:47.0330672Z %279 = arith.sitofp %278 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:47.0330931Z %280 = ttg.local_alloc %279 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:47.0331263Z %281 = ttg.local_load %280 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0331731Z %282 = tt.dot %263, %281, %261#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:47.0332228Z %283 = ttg.local_load %261#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0332656Z %284 = arith.extf %283 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0332951Z %285 = arith.addi %102, %247 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0333154Z %286 = tt.addptr %7, %285 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0333374Z %287 = arith.andi %108, %251 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0333543Z %288 = tt.load %286, %287, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:47.0333829Z %289 = ttg.convert_layout %288 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0334111Z %290 = arith.shli %289, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0334348Z %291 = arith.shrsi %290, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0334584Z %292 = arith.shrsi %289, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0334869Z %293 = tt.expand_dims %291 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0335203Z %294 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0335483Z %295 = tt.broadcast %293 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0335721Z %296 = arith.select %15, %295, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0335958Z %297 = tt.broadcast %294 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0336191Z %298 = arith.select %17, %297, %296 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0336418Z %299 = tt.reshape %298 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:47.0336637Z %300 = arith.sitofp %299 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:47.0336888Z %301 = ttg.local_alloc %300 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:47.0337213Z %302 = ttg.local_load %301 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0337692Z %303 = tt.dot %284, %302, %282, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:47.0338075Z ttg.local_dealloc %252 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:47.0338303Z %304 = arith.truncf %303 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:49:47.0338568Z %305 = tt.expand_dims %239 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:49:47.0338804Z %306 = arith.muli %305, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:49:47.0339035Z %307 = tt.expand_dims %234 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:49:47.0339295Z %308 = tt.broadcast %306 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0339496Z %309 = tt.broadcast %307 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0339674Z %310 = arith.addi %308, %309 : tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0339864Z %311 = tt.addptr %18, %310 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0340063Z tt.store %311, %304 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:49:47.0340207Z } {tt.num_stages = 1 : i32} 2026-02-21T09:49:47.0340340Z scf.for %arg3 = %24 to %c32768_i32 step %c9728_i32 : i32 { 2026-02-21T09:49:47.0340489Z %25 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:49:47.0340612Z %26 = arith.muli %25, %c2_i32 : i32 2026-02-21T09:49:47.0340733Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:49:47.0340853Z %28 = arith.minsi %27, %c2_i32 : i32 2026-02-21T09:49:47.0346666Z %29 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:49:47.0346796Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:49:47.0346963Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:49:47.0347074Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:49:47.0347187Z %33 = arith.muli %31, %c64_i32 : i32 2026-02-21T09:49:47.0347367Z %34 = tt.splat %33 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:47.0347580Z %35 = arith.addi %34, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:47.0347745Z %36 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:49:47.0347911Z %37 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:47.0348123Z %38 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:47.0348333Z %39 = arith.addi %37, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:47.0348544Z %40 = arith.addi %38, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:47.0348835Z %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:49:47.0349087Z %42 = arith.muli %41, %cst_2 : tensor<64x1xi32, #blocked1> 2026-02-21T09:49:47.0349279Z %43 = tt.broadcast %42 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0349454Z %44 = arith.extsi %33 : i32 to i64 2026-02-21T09:49:47.0349618Z %45 = tt.splat %44 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:47.0349835Z %46 = arith.addi %45, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:49:47.0350108Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0350379Z %48 = tt.broadcast %47 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0350579Z %49 = arith.cmpi sge, %47, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0350769Z %50 = arith.cmpi slt, %47, %cst_11 : tensor<1x64xi64, #blocked2> 2026-02-21T09:49:47.0350933Z %51 = arith.andi %49, %50 : tensor<1x64xi1, #blocked2> 2026-02-21T09:49:47.0351113Z %52 = tt.broadcast %51 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0351340Z %53 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:47.0351607Z %54 = tt.expand_dims %5 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:47.0351875Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0352063Z %56 = arith.addi %43, %55 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0352259Z %57 = tt.addptr %6, %56 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0352465Z %58 = tt.load %57 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:47.0352747Z %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0353094Z ttg.local_store %58, %59 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0353367Z %60 = arith.addi %5, %cst_4 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:47.0353642Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:47.0353913Z %62 = tt.broadcast %61 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0354099Z %63 = arith.addi %43, %62 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0354290Z %64 = tt.addptr %6, %63 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0354504Z %65 = tt.load %64 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:47.0354780Z %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0355145Z ttg.local_store %65, %66 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0355656Z %67:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:49:47.0356073Z %134 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:49:47.0356200Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:49:47.0356372Z %136 = tt.splat %135 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:47.0356599Z %137 = arith.addi %136, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:47.0356876Z %138 = tt.expand_dims %137 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:47.0357152Z %139 = tt.broadcast %138 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0357347Z %140 = arith.addi %43, %139 : tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0357547Z %141 = tt.addptr %6, %140 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:49:47.0357752Z %142 = tt.load %141 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:49:47.0358052Z %143 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0358499Z %144 = arith.extf %143 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0358782Z %145 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:49:47.0358957Z %146 = tt.splat %145 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0359182Z %147 = arith.addi %146, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0359472Z %148 = tt.expand_dims %147 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0359721Z %149 = arith.muli %148, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0359915Z %150 = tt.broadcast %149 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0360109Z %151 = arith.addi %150, %48 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0360302Z %152 = tt.addptr %7, %151 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0360516Z %153 = arith.cmpi sge, %148, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0360689Z %154 = arith.cmpi slt, %148, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0360856Z %155 = arith.andi %153, %154 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:47.0361044Z %156 = tt.broadcast %155 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0361239Z %157 = arith.andi %156, %52 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0361409Z %158 = tt.load %152, %157, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:47.0361666Z %159 = ttg.convert_layout %158 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0361949Z %160 = arith.shli %159, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0362187Z %161 = arith.shrsi %160, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0362443Z %162 = arith.shrsi %159, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0362777Z %163 = tt.expand_dims %161 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0363135Z %164 = tt.expand_dims %162 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0363419Z %165 = tt.broadcast %163 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0363659Z %166 = arith.select %15, %165, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0363898Z %167 = tt.broadcast %164 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0364134Z %168 = arith.select %17, %167, %166 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0364366Z %169 = tt.reshape %168 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:47.0364589Z %170 = arith.sitofp %169 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:47.0364840Z %171 = ttg.local_alloc %170 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:47.0365165Z %172 = ttg.local_load %171 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0365639Z %173 = tt.dot %144, %172, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:47.0365982Z %174 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:49:47.0366112Z %175 = arith.cmpi slt, %174, %c2_i32 : i32 2026-02-21T09:49:47.0366248Z %176 = arith.select %175, %174, %c0_i32 : i32 2026-02-21T09:49:47.0366537Z %177 = ttg.memdesc_index %53[%176] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0366891Z ttg.local_store %142, %177 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0367294Z scf.yield %173, %176, %arg8, %177 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:49:47.0367596Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:49:47.0367872Z %68 = ttg.local_load %67#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0368296Z %69 = arith.extf %68 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0368622Z %70 = arith.addi %9, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0368895Z %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0369143Z %72 = arith.muli %71, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0369335Z %73 = tt.broadcast %72 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0369521Z %74 = arith.addi %73, %48 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0369713Z %75 = tt.addptr %7, %74 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0369913Z %76 = arith.cmpi sge, %71, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0370083Z %77 = arith.cmpi slt, %71, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0370243Z %78 = arith.andi %76, %77 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:47.0370440Z %79 = tt.broadcast %78 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0370625Z %80 = arith.andi %79, %52 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0370949Z %81 = tt.load %75, %80, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:47.0371198Z %82 = ttg.convert_layout %81 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0371477Z %83 = arith.shli %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0371706Z %84 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0371938Z %85 = arith.shrsi %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0372215Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0372545Z %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0372819Z %88 = tt.broadcast %86 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0373052Z %89 = arith.select %15, %88, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0373283Z %90 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0373503Z %91 = arith.select %17, %90, %89 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0373722Z %92 = tt.reshape %91 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:47.0373933Z %93 = arith.sitofp %92 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:47.0374173Z %94 = ttg.local_alloc %93 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:47.0374511Z %95 = ttg.local_load %94 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0374970Z %96 = tt.dot %69, %95, %67#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:47.0375465Z %97 = ttg.local_load %67#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0375884Z %98 = arith.extf %97 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0376206Z %99 = arith.addi %9, %cst_6 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:49:47.0376484Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0376734Z %101 = arith.muli %100, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0376925Z %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0377118Z %103 = arith.addi %102, %48 : tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0377313Z %104 = tt.addptr %7, %103 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:49:47.0377521Z %105 = arith.cmpi sge, %100, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0377690Z %106 = arith.cmpi slt, %100, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:49:47.0377856Z %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2> 2026-02-21T09:49:47.0378041Z %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0378228Z %109 = arith.andi %108, %52 : tensor<2x64xi1, #blocked2> 2026-02-21T09:49:47.0378411Z %110 = tt.load %104, %109, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:49:47.0378665Z %111 = ttg.convert_layout %110 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0378963Z %112 = arith.shli %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0379205Z %113 = arith.shrsi %112, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0379440Z %114 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:47.0379729Z %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0380060Z %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:47.0380340Z %117 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0380579Z %118 = arith.select %15, %117, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0380816Z %119 = tt.broadcast %116 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0381049Z %120 = arith.select %17, %119, %118 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:47.0381276Z %121 = tt.reshape %120 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:47.0381496Z %122 = arith.sitofp %121 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:47.0381746Z %123 = ttg.local_alloc %122 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:47.0382069Z %124 = ttg.local_load %123 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:47.0382550Z %125 = tt.dot %98, %124, %96, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:49:47.0382930Z ttg.local_dealloc %53 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:49:47.0383154Z %126 = arith.truncf %125 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:49:47.0383418Z %127 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:49:47.0383650Z %128 = arith.muli %127, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:49:47.0383875Z %129 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:49:47.0384130Z %130 = tt.broadcast %128 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0384333Z %131 = tt.broadcast %129 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0384514Z %132 = arith.addi %130, %131 : tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0384698Z %133 = tt.addptr %18, %132 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:49:47.0384891Z tt.store %133, %126 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:49:47.0385030Z } {tt.num_stages = 1 : i32} 2026-02-21T09:49:47.0385133Z tt.return 2026-02-21T09:49:47.0385214Z } 2026-02-21T09:49:47.0385285Z } 2026-02-21T09:49:47.0385328Z 2026-02-21T09:49:47.0385363Z {-# 2026-02-21T09:49:47.0385443Z external_resources: { 2026-02-21T09:49:47.0385544Z mlir_reproducer: { 2026-02-21T09:49:47.0386562Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:49:47.0387566Z disable_threading: false, 2026-02-21T09:49:47.0387671Z verify_each: true 2026-02-21T09:49:47.0387758Z } 2026-02-21T09:49:47.0387833Z } 2026-02-21T09:49:47.0387900Z #-} 2026-02-21T09:49:47.0388178Z /tmp/torchinductor_root/jq/cjqqzavugghhhp6334vkuw67nomwtxn6gnd35x5ktrwiogdnqiuk.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:49:47.0388863Z /tmp/torchinductor_root/jq/cjqqzavugghhhp6334vkuw67nomwtxn6gnd35x5ktrwiogdnqiuk.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:49:47.0389408Z [317s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:49:47.0390190Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:49:47.0390901Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:49:47.0391068Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:49:49.4794708Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:49:49.4797009Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [8, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:49:49.4797894Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:49:49.4798815Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:49:49.4799646Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 4], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:49:49.4800343Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:49:49.4800989Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:49:49.4801400Z #smem = #ttg.shared_memory 2026-02-21T09:49:49.4801888Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:49:49.4803021Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:49:49.4803840Z %cst = arith.constant dense<0.000000e+00> : tensor<128x64xf32, #mma> 2026-02-21T09:49:49.4804178Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:49:49.4804426Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:49:49.4804672Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:49:49.4804978Z %cst_0 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:49.4805290Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:49:49.4805526Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:49:49.4805772Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:49:49.4806226Z %cst_1 = arith.constant dense<0> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4806775Z %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4807377Z %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4807906Z %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4808436Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4808967Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4809429Z %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:49:49.4809876Z %cst_8 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4810316Z %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:49.4810697Z %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:49.4811058Z %cst_11 = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:49:49.4811360Z %0 = tt.get_program_id x : i32 2026-02-21T09:49:49.4811541Z %1 = arith.divsi %0, %c256_i32 : i32 2026-02-21T09:49:49.4811712Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:49:49.4811894Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T09:49:49.4812060Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:49:49.4812233Z %5 = arith.remsi %0, %c256_i32 : i32 2026-02-21T09:49:49.4812405Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:49:49.4812571Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:49:49.4812724Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:49:49.4812891Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:49:49.4813199Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:49.4813670Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:49.4814047Z %12 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:49.4814362Z %13 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:49.4814712Z %14 = arith.addi %12, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:49.4815032Z %15 = arith.addi %13, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:49.4815274Z %16 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:49:49.4815628Z %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:49.4816089Z %18 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:49.4816437Z %19 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:49.4816736Z %20 = arith.addi %19, %18 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:49.4817087Z %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:49.4817553Z %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:49:49.4817934Z %23 = arith.muli %22, %cst_7 : tensor<128x1xi32, #blocked1> 2026-02-21T09:49:49.4818222Z %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:49:49.4818548Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:49:49.4818795Z %26 = arith.extsi %16 : i32 to i64 2026-02-21T09:49:49.4819081Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4819571Z %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:49.4820238Z %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:49.4820831Z %30 = tt.splat %26 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:49.4821428Z %31 = arith.extsi %17 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:49.4822021Z %32 = arith.addi %30, %31 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:49.4822521Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4823006Z %34 = tt.broadcast %33 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4823366Z %35 = arith.cmpi sge, %33, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4823637Z %36 = arith.cmpi slt, %33, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4823899Z %37 = arith.andi %35, %36 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4824237Z %38 = tt.broadcast %37 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4824644Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:49:49.4825144Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:49:49.4825605Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:49.4825908Z %42 = arith.cmpi eq, %41, %cst_9 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:49.4826135Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:49:49.4826362Z %44 = arith.cmpi eq, %41, %cst_10 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:49.4826581Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:49:49.4826875Z %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg4 = %cst) -> (tensor<128x64xf32, #mma>) : i32 { 2026-02-21T09:49:49.4827124Z %56 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:49:49.4827324Z %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:49.4827574Z %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:49.4827890Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:49.4828202Z %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:49:49.4828425Z %61 = arith.addi %24, %60 : tensor<128x4xi32, #blocked1> 2026-02-21T09:49:49.4828654Z %62 = tt.addptr %25, %61 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:49:49.4828884Z %63 = tt.load %62 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:49:49.4829139Z %64 = ttg.local_alloc %63 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:49.4829531Z %65 = ttg.local_load %64 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:49.4830017Z %66 = arith.extf %65 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:49.4830343Z %67 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:49:49.4830581Z %68 = tt.splat %67 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:49.4830917Z %69 = arith.addi %68, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:49.4831359Z %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4831763Z %71 = arith.muli %70, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4832095Z %72 = tt.broadcast %71 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4832392Z %73 = arith.addi %72, %34 : tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4832700Z %74 = tt.addptr %27, %73 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4833016Z %75 = arith.cmpi sge, %70, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4833255Z %76 = arith.cmpi slt, %70, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4833484Z %77 = arith.andi %75, %76 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4833799Z %78 = tt.broadcast %77 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4834092Z %79 = arith.andi %78, %38 : tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4834336Z %80 = tt.load %74, %79, %cst_1 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4834592Z %81 = arith.shli %80, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4834826Z %82 = arith.shrsi %81, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4835055Z %83 = arith.shrsi %80, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:49.4835336Z %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:49.4835666Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:49.4835940Z %86 = tt.broadcast %84 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:49.4836194Z %87 = arith.select %43, %86, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:49.4836425Z %88 = tt.broadcast %85 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:49.4836647Z %89 = arith.select %45, %88, %87 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:49.4836872Z %90 = tt.reshape %89 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:49.4837083Z %91 = arith.sitofp %90 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:49.4837330Z %92 = ttg.local_alloc %91 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:49.4837685Z %93 = ttg.local_load %92 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:49.4838156Z %94 = tt.dot %66, %93, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:49:49.4838527Z scf.yield %94 : tensor<128x64xf32, #mma> 2026-02-21T09:49:49.4838653Z } {tt.num_stages = 3 : i32} 2026-02-21T09:49:49.4838810Z %47 = arith.truncf %46 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma> 2026-02-21T09:49:49.4839079Z %48 = tt.expand_dims %15 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:49:49.4839315Z %49 = arith.muli %48, %cst_11 : tensor<128x1xi32, #mma> 2026-02-21T09:49:49.4839546Z %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:49:49.4839800Z %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:49:49.4840002Z %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:49:49.4840179Z %53 = arith.addi %51, %52 : tensor<128x64xi32, #mma> 2026-02-21T09:49:49.4840348Z %54 = tt.splat %arg2 : !tt.ptr -> tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:49:49.4840562Z %55 = tt.addptr %54, %53 : tensor<128x64x!tt.ptr, #mma>, tensor<128x64xi32, #mma> 2026-02-21T09:49:49.4840753Z tt.store %55, %47 : tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:49:49.4840886Z tt.return 2026-02-21T09:49:49.4840968Z } 2026-02-21T09:49:49.4841048Z } 2026-02-21T09:49:49.4841093Z 2026-02-21T09:49:49.4841130Z {-# 2026-02-21T09:49:49.4841212Z external_resources: { 2026-02-21T09:49:49.4841317Z mlir_reproducer: { 2026-02-21T09:49:49.4842363Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:49:49.4843419Z disable_threading: false, 2026-02-21T09:49:49.4843530Z verify_each: true 2026-02-21T09:49:49.4843620Z } 2026-02-21T09:49:49.4843699Z } 2026-02-21T09:49:49.4843771Z #-} 2026-02-21T09:49:49.4844056Z /tmp/torchinductor_root/xc/cxc3puzgq6o6wvakcbzttp54fjlz6dp2bmya5kqyaks6j4rqzuvd.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:49:49.4844737Z /tmp/torchinductor_root/xc/cxc3puzgq6o6wvakcbzttp54fjlz6dp2bmya5kqyaks6j4rqzuvd.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:49:49.4845285Z [320s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:49:49.4846006Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 64], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:49:49.4846657Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:49:49.4846823Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:49:50.6785251Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:49:50.6786774Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [8, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:49:50.6787657Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:49:50.6788498Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:49:50.6789277Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 4], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:49:50.6790011Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:49:50.6790668Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:49:50.6791076Z #smem = #ttg.shared_memory 2026-02-21T09:49:50.6791552Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:49:50.6792510Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:49:50.6793294Z %cst = arith.constant dense<0.000000e+00> : tensor<128x64xf32, #mma> 2026-02-21T09:49:50.6793623Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:49:50.6793861Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:49:50.6794098Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:49:50.6794391Z %cst_0 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:50.6794693Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:49:50.6794917Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:49:50.6795151Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:49:50.6795604Z %cst_1 = arith.constant dense<0> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6796126Z %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6796646Z %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6797196Z %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6797703Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6798204Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6798654Z %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:49:50.6799089Z %cst_8 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6799515Z %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:50.6799881Z %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:50.6800231Z %cst_11 = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:49:50.6800530Z %0 = tt.get_program_id x : i32 2026-02-21T09:49:50.6800757Z %1 = arith.divsi %0, %c256_i32 : i32 2026-02-21T09:49:50.6800956Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:49:50.6801134Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T09:49:50.6801295Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:49:50.6801491Z %5 = arith.remsi %0, %c256_i32 : i32 2026-02-21T09:49:50.6801655Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:49:50.6801815Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:49:50.6801975Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:49:50.6802142Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:49:50.6802465Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:50.6802969Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:50.6803360Z %12 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:50.6803669Z %13 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:50.6803980Z %14 = arith.addi %12, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:49:50.6804293Z %15 = arith.addi %13, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:49:50.6804522Z %16 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:49:50.6804874Z %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:50.6805326Z %18 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:50.6805655Z %19 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:50.6805955Z %20 = arith.addi %19, %18 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:49:50.6806305Z %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:50.6806755Z %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:49:50.6807124Z %23 = arith.muli %22, %cst_7 : tensor<128x1xi32, #blocked1> 2026-02-21T09:49:50.6807397Z %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:49:50.6807717Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:49:50.6807948Z %26 = arith.extsi %16 : i32 to i64 2026-02-21T09:49:50.6808257Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6808702Z %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:50.6809323Z %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:50.6809937Z %30 = tt.splat %26 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:50.6810512Z %31 = arith.extsi %17 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:50.6811092Z %32 = arith.addi %30, %31 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:50.6811612Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6812086Z %34 = tt.broadcast %33 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6812432Z %35 = arith.cmpi sge, %33, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6812704Z %36 = arith.cmpi slt, %33, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6812956Z %37 = arith.andi %35, %36 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6813281Z %38 = tt.broadcast %37 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6813699Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:49:50.6814157Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:49:50.6814623Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:50.6814907Z %42 = arith.cmpi eq, %41, %cst_9 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:50.6815128Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:49:50.6815346Z %44 = arith.cmpi eq, %41, %cst_10 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:49:50.6815555Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:49:50.6815850Z %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg4 = %cst) -> (tensor<128x64xf32, #mma>) : i32 { 2026-02-21T09:49:50.6816090Z %56 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:49:50.6816278Z %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:50.6816519Z %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:49:50.6816822Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:49:50.6817125Z %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:49:50.6817337Z %61 = arith.addi %24, %60 : tensor<128x4xi32, #blocked1> 2026-02-21T09:49:50.6817557Z %62 = tt.addptr %25, %61 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:49:50.6817784Z %63 = tt.load %62 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:49:50.6818049Z %64 = ttg.local_alloc %63 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:49:50.6818416Z %65 = ttg.local_load %64 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:50.6818872Z %66 = arith.extf %65 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:50.6819209Z %67 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:49:50.6819439Z %68 = tt.splat %67 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:50.6819771Z %69 = arith.addi %68, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:49:50.6820202Z %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6820601Z %71 = arith.muli %70, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6820944Z %72 = tt.broadcast %71 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6821269Z %73 = arith.addi %72, %34 : tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6821570Z %74 = tt.addptr %27, %73 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6821883Z %75 = arith.cmpi sge, %70, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6822119Z %76 = arith.cmpi slt, %70, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6822344Z %77 = arith.andi %75, %76 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6822650Z %78 = tt.broadcast %77 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6822955Z %79 = arith.andi %78, %38 : tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6823189Z %80 = tt.load %74, %79, %cst_1 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6823425Z %81 = arith.shli %80, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6823654Z %82 = arith.shrsi %81, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6823886Z %83 = arith.shrsi %80, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:49:50.6824163Z %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:50.6824494Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:49:50.6824770Z %86 = tt.broadcast %84 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:50.6824997Z %87 = arith.select %43, %86, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:50.6825228Z %88 = tt.broadcast %85 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:50.6825448Z %89 = arith.select %45, %88, %87 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:49:50.6825668Z %90 = tt.reshape %89 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:49:50.6825879Z %91 = arith.sitofp %90 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:49:50.6826122Z %92 = ttg.local_alloc %91 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:49:50.6826473Z %93 = ttg.local_load %92 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:49:50.6826936Z %94 = tt.dot %66, %93, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:49:50.6827304Z scf.yield %94 : tensor<128x64xf32, #mma> 2026-02-21T09:49:50.6827453Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32} 2026-02-21T09:49:50.6827635Z %47 = arith.truncf %46 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma> 2026-02-21T09:49:50.6827897Z %48 = tt.expand_dims %15 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:49:50.6828132Z %49 = arith.muli %48, %cst_11 : tensor<128x1xi32, #mma> 2026-02-21T09:49:50.6828361Z %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:49:50.6828616Z %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:49:50.6828812Z %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:49:50.6828984Z %53 = arith.addi %51, %52 : tensor<128x64xi32, #mma> 2026-02-21T09:49:50.6829155Z %54 = tt.splat %arg2 : !tt.ptr -> tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:49:50.6829363Z %55 = tt.addptr %54, %53 : tensor<128x64x!tt.ptr, #mma>, tensor<128x64xi32, #mma> 2026-02-21T09:49:50.6829550Z tt.store %55, %47 : tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:49:50.6829682Z tt.return 2026-02-21T09:49:50.6829764Z } 2026-02-21T09:49:50.6829835Z } 2026-02-21T09:49:50.6829877Z 2026-02-21T09:49:50.6829910Z {-# 2026-02-21T09:49:50.6829988Z external_resources: { 2026-02-21T09:49:50.6830087Z mlir_reproducer: { 2026-02-21T09:49:50.6831088Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:49:50.6832094Z disable_threading: false, 2026-02-21T09:49:50.6832211Z verify_each: true 2026-02-21T09:49:50.6832315Z } 2026-02-21T09:49:50.6832387Z } 2026-02-21T09:49:50.6832453Z #-} 2026-02-21T09:49:50.6832730Z /tmp/torchinductor_root/e2/ce2keubxat3ohvkvcbr4g2kv75vu3b4vxfk6rolgmc3rzn6ikxlu.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:49:50.6833413Z /tmp/torchinductor_root/e2/ce2keubxat3ohvkvcbr4g2kv75vu3b4vxfk6rolgmc3rzn6ikxlu.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:49:50.6833955Z [321s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:49:50.6834679Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:49:50.6835342Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:49:50.6835511Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:49:52.0443776Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 103/103 6.0 configs/s 2026-02-21T09:49:56.6689545Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 179/179 22.1 configs/s 2026-02-21T09:49:59.9110415Z [330s] Generation 2 complete: 2026-02-21T09:49:59.9110655Z error=19 2026-02-21T09:49:59.9111254Z timeout=1 2026-02-21T09:49:59.9111367Z ok=88 2026-02-21T09:49:59.9111507Z min=1.1387 2026-02-21T09:49:59.9111620Z mid=1.8622 2026-02-21T09:49:59.9111729Z max=438.1238 2026-02-21T09:49:59.9111860Z best={'block_sizes': [8, 128, 256], 2026-02-21T09:49:59.9112062Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T09:49:59.9112257Z 'l2_groupings': [2], 2026-02-21T09:49:59.9112404Z 'load_eviction_policies': ['', ''], 2026-02-21T09:49:59.9112565Z 'loop_orders': [[0, 1]], 2026-02-21T09:49:59.9112712Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:49:59.9112853Z 'num_stages': 1, 2026-02-21T09:49:59.9112973Z 'num_warps': 4, 2026-02-21T09:49:59.9113112Z 'pid_type': 'flat', 2026-02-21T09:49:59.9113256Z 'range_flattens': [None, None], 2026-02-21T09:49:59.9113418Z 'range_multi_buffers': [None, False], 2026-02-21T09:49:59.9113592Z 'range_num_stages': [0, 3], 2026-02-21T09:49:59.9113738Z 'range_unroll_factors': [0, 0], 2026-02-21T09:49:59.9113893Z 'range_warp_specializes': [], 2026-02-21T09:49:59.9114042Z 'waves_per_eu': 2} 2026-02-21T09:49:59.9160444Z [330s] Fitting surrogate: 322 points, 322 targets 2026-02-21T09:50:00.9209802Z [331s] Generation 3 starting: 98 neighbors, 5 active search path(s) 2026-02-21T09:50:24.1191574Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 1.2 configs/s 2026-02-21T09:50:29.6463916Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:50:29.6474809Z #blocked = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:50:29.6478046Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:50:29.6479313Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:50:29.6480116Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:50:29.6480812Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:50:29.6481450Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:50:29.6481896Z #smem = #ttg.shared_memory 2026-02-21T09:50:29.6482479Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:50:29.6483809Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:50:29.6484795Z %cst = arith.constant dense<4> : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6485189Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:50:29.6485478Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:50:29.6485765Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:50:29.6486219Z %cst_0 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6486510Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:50:29.6486706Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:50:29.6486910Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T09:50:29.6487124Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:50:29.6487328Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:50:29.6487529Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:50:29.6487739Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:50:29.6488079Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:50:29.6488341Z %cst_1 = arith.constant dense<1024> : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.6488662Z %cst_2 = arith.constant dense<8192> : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6489029Z %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6489455Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:50:29.6489768Z %cst_4 = arith.constant dense<2> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6490152Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6490469Z %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:50:29.6490771Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:50:29.6491073Z %cst_8 = arith.constant dense<8192> : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6491377Z %cst_9 = arith.constant dense<0> : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6491669Z %cst_10 = arith.constant dense<16384> : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6492047Z %cst_11 = arith.constant dense<0> : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6492351Z %cst_12 = arith.constant dense<8192> : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6492613Z %0 = tt.get_program_id x : i32 2026-02-21T09:50:29.6492878Z %1 = arith.muli %0, %c2_i32 : i32 2026-02-21T09:50:29.6493082Z %2 = arith.addi %1, %c2_i32 : i32 2026-02-21T09:50:29.6493286Z %3 = arith.minsi %2, %c16384_i32 : i32 2026-02-21T09:50:29.6493714Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.6494196Z %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.6494716Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.6495206Z %7 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.6495697Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6496164Z %9 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6496506Z %10 = tt.splat %arg0 : !tt.ptr -> tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6496776Z %11 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6497140Z %12 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:50:29.6497696Z %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:50:29.6498240Z %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:50:29.6498583Z %15 = arith.cmpi eq, %14, %cst_6 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:50:29.6498857Z %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1> 2026-02-21T09:50:29.6499121Z %17 = arith.cmpi eq, %14, %cst_7 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:50:29.6499376Z %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1> 2026-02-21T09:50:29.6499650Z %19 = tt.splat %arg2 : !tt.ptr -> tensor<64x128x!tt.ptr, #mma> 2026-02-21T09:50:29.6500005Z %20 = arith.extsi %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.6500483Z %21 = arith.extsi %7 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.6500791Z %22 = arith.subi %3, %1 : i32 2026-02-21T09:50:29.6500942Z %23 = arith.remsi %22, %c4_i32 : i32 2026-02-21T09:50:29.6501103Z %24 = arith.subi %22, %23 : i32 2026-02-21T09:50:29.6501244Z %25 = arith.addi %1, %24 : i32 2026-02-21T09:50:29.6501429Z scf.for %arg3 = %1 to %25 step %c4_i32 : i32 { 2026-02-21T09:50:29.6501614Z %26 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:50:29.6501780Z %27 = arith.muli %26, %c8_i32 : i32 2026-02-21T09:50:29.6501940Z %28 = arith.subi %c256_i32, %27 : i32 2026-02-21T09:50:29.6502091Z %29 = arith.minsi %28, %c8_i32 : i32 2026-02-21T09:50:29.6502254Z %30 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:50:29.6502412Z %31 = arith.remsi %30, %29 : i32 2026-02-21T09:50:29.6502564Z %32 = arith.addi %27, %31 : i32 2026-02-21T09:50:29.6502709Z %33 = arith.divsi %30, %29 : i32 2026-02-21T09:50:29.6502867Z %34 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:50:29.6503094Z %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.6503391Z %36 = arith.addi %35, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.6503633Z %37 = arith.muli %33, %c128_i32 : i32 2026-02-21T09:50:29.6503860Z %38 = tt.splat %37 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.6504151Z %39 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.6504511Z %40 = tt.expand_dims %36 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.6504850Z %41 = arith.muli %40, %cst_1 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.6505113Z %42 = tt.broadcast %41 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6505509Z %43 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked> 2026-02-21T09:50:29.6505866Z %44 = tt.broadcast %43 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6506179Z %45 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.6506537Z %46 = tt.expand_dims %9 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.6506892Z %47 = tt.broadcast %46 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6507155Z %48 = arith.addi %42, %47 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6507363Z %49 = tt.addptr %10, %48 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6507575Z %50 = tt.load %49 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6507832Z %51 = tt.expand_dims %8 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6508092Z %52 = arith.muli %51, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6508288Z %53 = tt.broadcast %52 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6508488Z %54 = arith.addi %53, %44 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6508690Z %55 = tt.addptr %11, %54 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6508901Z %56 = tt.load %55 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6509191Z %57 = ttg.memdesc_index %45[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6509568Z ttg.local_store %50, %57 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6509859Z %58 = arith.addi %8, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6510128Z %59 = arith.addi %9, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6510419Z %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.6510732Z %61 = tt.broadcast %60 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6510927Z %62 = arith.addi %42, %61 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6511136Z %63 = tt.addptr %10, %62 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6511349Z %64 = tt.load %63 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6511605Z %65 = tt.expand_dims %58 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6511860Z %66 = arith.muli %65, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6512060Z %67 = tt.broadcast %66 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6512263Z %68 = arith.addi %67, %44 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6512464Z %69 = tt.addptr %11, %68 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6512672Z %70 = tt.load %69 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6512959Z %71 = ttg.memdesc_index %45[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6513329Z ttg.local_store %64, %71 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6514017Z %72:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %57, %arg8 = %71, %arg9 = %56, %arg10 = %70) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:50:29.6514569Z %409 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:29.6514776Z %410 = tt.splat %409 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6515016Z %411 = arith.addi %410, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6515201Z %412 = arith.muli %409, %c2_i32 : i32 2026-02-21T09:50:29.6515383Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6515618Z %414 = arith.addi %413, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6515915Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.6516213Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6516412Z %417 = arith.addi %42, %416 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6516621Z %418 = tt.addptr %10, %417 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6516832Z %419 = tt.load %418 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6517143Z %420 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6517596Z %421 = arith.extf %420 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6517986Z %422 = tt.expand_dims %411 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6518241Z %423 = arith.muli %422, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6518461Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6518663Z %425 = arith.addi %424, %44 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6518873Z %426 = tt.addptr %11, %425 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6519121Z %427 = tt.load %426 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6519291Z %428 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6519455Z %429 = arith.shrsi %428, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6519711Z %430 = ttg.convert_layout %429 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6519972Z %431 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6520224Z %432 = ttg.convert_layout %431 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6520573Z %433 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6520925Z %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6521232Z %435 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6521493Z %436 = arith.select %16, %435, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6521746Z %437 = tt.broadcast %434 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6521999Z %438 = arith.select %18, %437, %436 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6522254Z %439 = tt.reshape %438 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6522494Z %440 = arith.sitofp %439 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6522868Z %441 = ttg.convert_layout %440 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6523378Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6523742Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:29.6523876Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:50:29.6524014Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:50:29.6524287Z %446 = ttg.memdesc_index %45[%445] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6524647Z ttg.local_store %419, %446 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6525139Z scf.yield %442, %445, %arg8, %446, %arg10, %427 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6525536Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:29.6525812Z %73 = ttg.local_load %72#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6526244Z %74 = arith.extf %73 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6526545Z %75 = arith.shli %72#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6526738Z %76 = arith.shrsi %75, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6526985Z %77 = ttg.convert_layout %76 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6527232Z %78 = arith.shrsi %72#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6527502Z %79 = ttg.convert_layout %78 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6527839Z %80 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6528187Z %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6528483Z %82 = tt.broadcast %80 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6528730Z %83 = arith.select %16, %82, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6528979Z %84 = tt.broadcast %81 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6529222Z %85 = arith.select %18, %84, %83 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6529455Z %86 = tt.reshape %85 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6529686Z %87 = arith.sitofp %86 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6529985Z %88 = ttg.convert_layout %87 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6530455Z %89 = tt.dot %74, %88, %72#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6530968Z %90 = ttg.local_load %72#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6531391Z %91 = arith.extf %90 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6531709Z %92 = arith.shli %72#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6531871Z %93 = arith.shrsi %92, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6532116Z %94 = ttg.convert_layout %93 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6532365Z %95 = arith.shrsi %72#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6532606Z %96 = ttg.convert_layout %95 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6532945Z %97 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6533291Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6533585Z %99 = tt.broadcast %97 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6533838Z %100 = arith.select %16, %99, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6534082Z %101 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6534327Z %102 = arith.select %18, %101, %100 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6534570Z %103 = tt.reshape %102 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6534801Z %104 = arith.sitofp %103 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6535124Z %105 = ttg.convert_layout %104 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6535587Z %106 = tt.dot %91, %105, %89, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6535991Z ttg.local_dealloc %45 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.6536222Z %107 = arith.truncf %106 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma> 2026-02-21T09:50:29.6536398Z %108 = arith.extsi %34 : i32 to i64 2026-02-21T09:50:29.6536522Z %109 = arith.extsi %37 : i32 to i64 2026-02-21T09:50:29.6536688Z %110 = tt.splat %108 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.6536906Z %111 = arith.addi %110, %20 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.6537180Z %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6537423Z %113 = arith.muli %112, %cst_8 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6537608Z %114 = tt.broadcast %113 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6537822Z %115 = tt.splat %109 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.6538041Z %116 = arith.addi %115, %21 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.6538312Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6538582Z %118 = tt.broadcast %117 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6538772Z %119 = arith.addi %114, %118 : tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6538991Z %120 = tt.addptr %19, %119 : tensor<64x128x!tt.ptr, #mma>, tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6539199Z %121 = arith.cmpi sge, %112, %cst_9 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6539387Z %122 = arith.cmpi slt, %112, %cst_10 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6539553Z %123 = arith.andi %121, %122 : tensor<64x1xi1, #mma> 2026-02-21T09:50:29.6539733Z %124 = tt.broadcast %123 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6539922Z %125 = arith.cmpi sge, %117, %cst_11 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6540097Z %126 = arith.cmpi slt, %117, %cst_12 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6540259Z %127 = arith.andi %125, %126 : tensor<1x128xi1, #mma> 2026-02-21T09:50:29.6540441Z %128 = tt.broadcast %127 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6540622Z %129 = arith.andi %124, %128 : tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6549481Z tt.store %120, %107, %129 : tensor<64x128x!tt.ptr, #mma> 2026-02-21T09:50:29.6549649Z %130 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:50:29.6549786Z %131 = arith.divsi %130, %c512_i32 : i32 2026-02-21T09:50:29.6549925Z %132 = arith.muli %131, %c8_i32 : i32 2026-02-21T09:50:29.6550056Z %133 = arith.subi %c256_i32, %132 : i32 2026-02-21T09:50:29.6550190Z %134 = arith.minsi %133, %c8_i32 : i32 2026-02-21T09:50:29.6550318Z %135 = arith.remsi %130, %c512_i32 : i32 2026-02-21T09:50:29.6550438Z %136 = arith.remsi %135, %134 : i32 2026-02-21T09:50:29.6550564Z %137 = arith.addi %132, %136 : i32 2026-02-21T09:50:29.6550681Z %138 = arith.divsi %135, %134 : i32 2026-02-21T09:50:29.6550806Z %139 = arith.muli %137, %c64_i32 : i32 2026-02-21T09:50:29.6550982Z %140 = tt.splat %139 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.6551215Z %141 = arith.addi %140, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.6551398Z %142 = arith.muli %138, %c128_i32 : i32 2026-02-21T09:50:29.6551624Z %143 = tt.splat %142 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.6551854Z %144 = arith.addi %143, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.6552138Z %145 = tt.expand_dims %141 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.6552418Z %146 = arith.muli %145, %cst_1 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.6552620Z %147 = tt.broadcast %146 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6552908Z %148 = tt.expand_dims %144 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked> 2026-02-21T09:50:29.6553194Z %149 = tt.broadcast %148 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6553421Z %150 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.6553617Z %151 = arith.addi %147, %47 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6553879Z %152 = tt.addptr %10, %151 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6554088Z %153 = tt.load %152 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6554255Z %154 = arith.addi %53, %149 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6554455Z %155 = tt.addptr %11, %154 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6554661Z %156 = tt.load %155 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6554944Z %157 = ttg.memdesc_index %150[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6555326Z ttg.local_store %153, %157 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6555575Z %158 = arith.addi %147, %61 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6555775Z %159 = tt.addptr %10, %158 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6556005Z %160 = tt.load %159 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6556166Z %161 = arith.addi %67, %149 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6556365Z %162 = tt.addptr %11, %161 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6556570Z %163 = tt.load %162 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6556847Z %164 = ttg.memdesc_index %150[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6557211Z ttg.local_store %160, %164 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6557849Z %165:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %157, %arg8 = %164, %arg9 = %156, %arg10 = %163) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:50:29.6558380Z %409 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:29.6558561Z %410 = tt.splat %409 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6558787Z %411 = arith.addi %410, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6558965Z %412 = arith.muli %409, %c2_i32 : i32 2026-02-21T09:50:29.6559141Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6559363Z %414 = arith.addi %413, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6559665Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.6559947Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6560150Z %417 = arith.addi %147, %416 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6560377Z %418 = tt.addptr %10, %417 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6560587Z %419 = tt.load %418 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6560894Z %420 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6561334Z %421 = arith.extf %420 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6561726Z %422 = tt.expand_dims %411 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6561981Z %423 = arith.muli %422, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6562177Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6562379Z %425 = arith.addi %424, %149 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6562640Z %426 = tt.addptr %11, %425 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6562852Z %427 = tt.load %426 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6563018Z %428 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6563179Z %429 = arith.shrsi %428, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6563453Z %430 = ttg.convert_layout %429 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6563711Z %431 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6563964Z %432 = ttg.convert_layout %431 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6564333Z %433 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6564683Z %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6564986Z %435 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6565239Z %436 = arith.select %16, %435, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6565494Z %437 = tt.broadcast %434 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6565750Z %438 = arith.select %18, %437, %436 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6565990Z %439 = tt.reshape %438 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6566228Z %440 = arith.sitofp %439 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6566534Z %441 = ttg.convert_layout %440 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6567017Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6567378Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:29.6567510Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:50:29.6567655Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:50:29.6567946Z %446 = ttg.memdesc_index %150[%445] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6568305Z ttg.local_store %419, %446 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6568810Z scf.yield %442, %445, %arg8, %446, %arg10, %427 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6569196Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:29.6569479Z %166 = ttg.local_load %165#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6569913Z %167 = arith.extf %166 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6570214Z %168 = arith.shli %165#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6570387Z %169 = arith.shrsi %168, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6570635Z %170 = ttg.convert_layout %169 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6570891Z %171 = arith.shrsi %165#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6571145Z %172 = ttg.convert_layout %171 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6571491Z %173 = tt.expand_dims %170 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6571864Z %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6572159Z %175 = tt.broadcast %173 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6572425Z %176 = arith.select %16, %175, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6572675Z %177 = tt.broadcast %174 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6572915Z %178 = arith.select %18, %177, %176 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6573156Z %179 = tt.reshape %178 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6573387Z %180 = arith.sitofp %179 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6573688Z %181 = ttg.convert_layout %180 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6574163Z %182 = tt.dot %167, %181, %165#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6574656Z %183 = ttg.local_load %165#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6575087Z %184 = arith.extf %183 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6575390Z %185 = arith.shli %165#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6575553Z %186 = arith.shrsi %185, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6575803Z %187 = ttg.convert_layout %186 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6576059Z %188 = arith.shrsi %165#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6576330Z %189 = ttg.convert_layout %188 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6576674Z %190 = tt.expand_dims %187 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6577032Z %191 = tt.expand_dims %189 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6577327Z %192 = tt.broadcast %190 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6577580Z %193 = arith.select %16, %192, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6577825Z %194 = tt.broadcast %191 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6578070Z %195 = arith.select %18, %194, %193 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6578310Z %196 = tt.reshape %195 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6578544Z %197 = arith.sitofp %196 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6578848Z %198 = ttg.convert_layout %197 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6579313Z %199 = tt.dot %184, %198, %182, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6579706Z ttg.local_dealloc %150 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.6579922Z %200 = arith.truncf %199 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma> 2026-02-21T09:50:29.6580100Z %201 = arith.extsi %139 : i32 to i64 2026-02-21T09:50:29.6580264Z %202 = arith.extsi %142 : i32 to i64 2026-02-21T09:50:29.6580428Z %203 = tt.splat %201 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.6580664Z %204 = arith.addi %203, %20 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.6580930Z %205 = tt.expand_dims %204 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6581175Z %206 = arith.muli %205, %cst_8 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6581360Z %207 = tt.broadcast %206 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6581569Z %208 = tt.splat %202 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.6581786Z %209 = arith.addi %208, %21 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.6582056Z %210 = tt.expand_dims %209 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6582322Z %211 = tt.broadcast %210 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6582513Z %212 = arith.addi %207, %211 : tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6582707Z %213 = tt.addptr %19, %212 : tensor<64x128x!tt.ptr, #mma>, tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6582917Z %214 = arith.cmpi sge, %205, %cst_9 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6583085Z %215 = arith.cmpi slt, %205, %cst_10 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6583249Z %216 = arith.andi %214, %215 : tensor<64x1xi1, #mma> 2026-02-21T09:50:29.6583422Z %217 = tt.broadcast %216 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6583613Z %218 = arith.cmpi sge, %210, %cst_11 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6583786Z %219 = arith.cmpi slt, %210, %cst_12 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6583945Z %220 = arith.andi %218, %219 : tensor<1x128xi1, #mma> 2026-02-21T09:50:29.6584142Z %221 = tt.broadcast %220 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6584323Z %222 = arith.andi %217, %221 : tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6584489Z tt.store %213, %200, %222 : tensor<64x128x!tt.ptr, #mma> 2026-02-21T09:50:29.6584653Z %223 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:50:29.6584781Z %224 = arith.divsi %223, %c512_i32 : i32 2026-02-21T09:50:29.6584909Z %225 = arith.muli %224, %c8_i32 : i32 2026-02-21T09:50:29.6585030Z %226 = arith.subi %c256_i32, %225 : i32 2026-02-21T09:50:29.6585156Z %227 = arith.minsi %226, %c8_i32 : i32 2026-02-21T09:50:29.6585278Z %228 = arith.remsi %223, %c512_i32 : i32 2026-02-21T09:50:29.6585402Z %229 = arith.remsi %228, %227 : i32 2026-02-21T09:50:29.6585520Z %230 = arith.addi %225, %229 : i32 2026-02-21T09:50:29.6585642Z %231 = arith.divsi %228, %227 : i32 2026-02-21T09:50:29.6585760Z %232 = arith.muli %230, %c64_i32 : i32 2026-02-21T09:50:29.6585939Z %233 = tt.splat %232 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.6586169Z %234 = arith.addi %233, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.6586344Z %235 = arith.muli %231, %c128_i32 : i32 2026-02-21T09:50:29.6586521Z %236 = tt.splat %235 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.6586742Z %237 = arith.addi %236, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.6587025Z %238 = tt.expand_dims %234 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.6587281Z %239 = arith.muli %238, %cst_1 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.6587477Z %240 = tt.broadcast %239 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6587789Z %241 = tt.expand_dims %237 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked> 2026-02-21T09:50:29.6588070Z %242 = tt.broadcast %241 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6588314Z %243 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.6588509Z %244 = arith.addi %240, %47 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6588710Z %245 = tt.addptr %10, %244 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6588922Z %246 = tt.load %245 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6589081Z %247 = arith.addi %53, %242 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6589282Z %248 = tt.addptr %11, %247 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6589487Z %249 = tt.load %248 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6589768Z %250 = ttg.memdesc_index %243[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6590132Z ttg.local_store %246, %250 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6590371Z %251 = arith.addi %240, %61 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6590574Z %252 = tt.addptr %10, %251 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6590783Z %253 = tt.load %252 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6590940Z %254 = arith.addi %67, %242 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6591136Z %255 = tt.addptr %11, %254 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6591335Z %256 = tt.load %255 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6591646Z %257 = ttg.memdesc_index %243[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6592000Z ttg.local_store %253, %257 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6592627Z %258:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %250, %arg8 = %257, %arg9 = %249, %arg10 = %256) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:50:29.6593168Z %409 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:29.6593344Z %410 = tt.splat %409 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6593571Z %411 = arith.addi %410, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6593751Z %412 = arith.muli %409, %c2_i32 : i32 2026-02-21T09:50:29.6593924Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6594142Z %414 = arith.addi %413, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6594418Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.6594692Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6594889Z %417 = arith.addi %240, %416 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6595092Z %418 = tt.addptr %10, %417 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6595300Z %419 = tt.load %418 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6595618Z %420 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6596052Z %421 = arith.extf %420 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6596448Z %422 = tt.expand_dims %411 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6596693Z %423 = arith.muli %422, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6596881Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6597068Z %425 = arith.addi %424, %242 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6597260Z %426 = tt.addptr %11, %425 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6597464Z %427 = tt.load %426 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6597621Z %428 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6597775Z %429 = arith.shrsi %428, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6598016Z %430 = ttg.convert_layout %429 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6598262Z %431 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6598499Z %432 = ttg.convert_layout %431 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6598836Z %433 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6599177Z %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6599472Z %435 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6599740Z %436 = arith.select %16, %435, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6599984Z %437 = tt.broadcast %434 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6600237Z %438 = arith.select %18, %437, %436 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6600471Z %439 = tt.reshape %438 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6600693Z %440 = arith.sitofp %439 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6600993Z %441 = ttg.convert_layout %440 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6601459Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6601803Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:29.6601930Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:50:29.6602061Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:50:29.6602324Z %446 = ttg.memdesc_index %243[%445] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6602947Z ttg.local_store %419, %446 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6603448Z scf.yield %442, %445, %arg8, %446, %arg10, %427 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6603932Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:29.6604212Z %259 = ttg.local_load %258#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6604667Z %260 = arith.extf %259 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6604967Z %261 = arith.shli %258#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6605130Z %262 = arith.shrsi %261, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6605376Z %263 = ttg.convert_layout %262 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6605619Z %264 = arith.shrsi %258#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6605858Z %265 = ttg.convert_layout %264 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6606198Z %266 = tt.expand_dims %263 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6606542Z %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6606831Z %268 = tt.broadcast %266 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6607072Z %269 = arith.select %16, %268, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6607313Z %270 = tt.broadcast %267 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6607551Z %271 = arith.select %18, %270, %269 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6607783Z %272 = tt.reshape %271 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6608034Z %273 = arith.sitofp %272 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6608327Z %274 = ttg.convert_layout %273 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6608790Z %275 = tt.dot %260, %274, %258#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6609301Z %276 = ttg.local_load %258#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6609722Z %277 = arith.extf %276 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6610014Z %278 = arith.shli %258#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6610174Z %279 = arith.shrsi %278, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6610413Z %280 = ttg.convert_layout %279 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6610666Z %281 = arith.shrsi %258#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6610907Z %282 = ttg.convert_layout %281 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6611241Z %283 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6611579Z %284 = tt.expand_dims %282 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6611864Z %285 = tt.broadcast %283 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6612131Z %286 = arith.select %16, %285, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6612370Z %287 = tt.broadcast %284 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6612621Z %288 = arith.select %18, %287, %286 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6612859Z %289 = tt.reshape %288 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6613083Z %290 = arith.sitofp %289 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6613378Z %291 = ttg.convert_layout %290 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6613833Z %292 = tt.dot %277, %291, %275, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6614216Z ttg.local_dealloc %243 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.6614426Z %293 = arith.truncf %292 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma> 2026-02-21T09:50:29.6614595Z %294 = arith.extsi %232 : i32 to i64 2026-02-21T09:50:29.6614709Z %295 = arith.extsi %235 : i32 to i64 2026-02-21T09:50:29.6614868Z %296 = tt.splat %294 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.6615074Z %297 = arith.addi %296, %20 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.6615340Z %298 = tt.expand_dims %297 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6615574Z %299 = arith.muli %298, %cst_8 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6615748Z %300 = tt.broadcast %299 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6615975Z %301 = tt.splat %295 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.6616184Z %302 = arith.addi %301, %21 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.6616446Z %303 = tt.expand_dims %302 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6616717Z %304 = tt.broadcast %303 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6616900Z %305 = arith.addi %300, %304 : tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6617088Z %306 = tt.addptr %19, %305 : tensor<64x128x!tt.ptr, #mma>, tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6617283Z %307 = arith.cmpi sge, %298, %cst_9 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6617446Z %308 = arith.cmpi slt, %298, %cst_10 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6617598Z %309 = arith.andi %307, %308 : tensor<64x1xi1, #mma> 2026-02-21T09:50:29.6617773Z %310 = tt.broadcast %309 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6617955Z %311 = arith.cmpi sge, %303, %cst_11 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6618120Z %312 = arith.cmpi slt, %303, %cst_12 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6618277Z %313 = arith.andi %311, %312 : tensor<1x128xi1, #mma> 2026-02-21T09:50:29.6618445Z %314 = tt.broadcast %313 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6618618Z %315 = arith.andi %310, %314 : tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6618774Z tt.store %306, %293, %315 : tensor<64x128x!tt.ptr, #mma> 2026-02-21T09:50:29.6618918Z %316 = arith.addi %arg3, %c3_i32 : i32 2026-02-21T09:50:29.6619042Z %317 = arith.divsi %316, %c512_i32 : i32 2026-02-21T09:50:29.6619163Z %318 = arith.muli %317, %c8_i32 : i32 2026-02-21T09:50:29.6619280Z %319 = arith.subi %c256_i32, %318 : i32 2026-02-21T09:50:29.6619394Z %320 = arith.minsi %319, %c8_i32 : i32 2026-02-21T09:50:29.6619525Z %321 = arith.remsi %316, %c512_i32 : i32 2026-02-21T09:50:29.6619638Z %322 = arith.remsi %321, %320 : i32 2026-02-21T09:50:29.6619749Z %323 = arith.addi %318, %322 : i32 2026-02-21T09:50:29.6619880Z %324 = arith.divsi %321, %320 : i32 2026-02-21T09:50:29.6619995Z %325 = arith.muli %323, %c64_i32 : i32 2026-02-21T09:50:29.6620164Z %326 = tt.splat %325 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.6620383Z %327 = arith.addi %326, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.6620553Z %328 = arith.muli %324, %c128_i32 : i32 2026-02-21T09:50:29.6620714Z %329 = tt.splat %328 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.6620930Z %330 = arith.addi %329, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.6621203Z %331 = tt.expand_dims %327 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.6621456Z %332 = arith.muli %331, %cst_1 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.6621647Z %333 = tt.broadcast %332 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6621921Z %334 = tt.expand_dims %330 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked> 2026-02-21T09:50:29.6622194Z %335 = tt.broadcast %334 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6622409Z %336 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.6622594Z %337 = arith.addi %333, %47 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6622796Z %338 = tt.addptr %10, %337 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6623000Z %339 = tt.load %338 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6623176Z %340 = arith.addi %53, %335 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6623366Z %341 = tt.addptr %11, %340 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6623562Z %342 = tt.load %341 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6623836Z %343 = ttg.memdesc_index %336[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6624204Z ttg.local_store %339, %343 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6624443Z %344 = arith.addi %333, %61 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6624635Z %345 = tt.addptr %10, %344 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6624834Z %346 = tt.load %345 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6624987Z %347 = arith.addi %67, %335 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6625175Z %348 = tt.addptr %11, %347 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6625372Z %349 = tt.load %348 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6625649Z %350 = ttg.memdesc_index %336[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6626000Z ttg.local_store %346, %350 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6626612Z %351:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %343, %arg8 = %350, %arg9 = %342, %arg10 = %349) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:50:29.6627142Z %409 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:29.6627313Z %410 = tt.splat %409 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6627549Z %411 = arith.addi %410, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6627718Z %412 = arith.muli %409, %c2_i32 : i32 2026-02-21T09:50:29.6627890Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6628109Z %414 = arith.addi %413, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6628379Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.6628651Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6628842Z %417 = arith.addi %333, %416 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6629043Z %418 = tt.addptr %10, %417 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6629244Z %419 = tt.load %418 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6629540Z %420 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6629968Z %421 = arith.extf %420 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6630339Z %422 = tt.expand_dims %411 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6630581Z %423 = arith.muli %422, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6630766Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6630979Z %425 = arith.addi %424, %335 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6631175Z %426 = tt.addptr %11, %425 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6631369Z %427 = tt.load %426 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6631523Z %428 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6631690Z %429 = arith.shrsi %428, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6631931Z %430 = ttg.convert_layout %429 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6632176Z %431 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6632417Z %432 = ttg.convert_layout %431 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6632752Z %433 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6633092Z %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6633381Z %435 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6633629Z %436 = arith.select %16, %435, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6633868Z %437 = tt.broadcast %434 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6634105Z %438 = arith.select %18, %437, %436 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6634337Z %439 = tt.reshape %438 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6634561Z %440 = arith.sitofp %439 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6634876Z %441 = ttg.convert_layout %440 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6635354Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6635701Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:29.6635824Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:50:29.6635953Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:50:29.6636214Z %446 = ttg.memdesc_index %336[%445] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6636564Z ttg.local_store %419, %446 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6637045Z scf.yield %442, %445, %arg8, %446, %arg10, %427 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6637428Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:29.6637700Z %352 = ttg.local_load %351#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6638126Z %353 = arith.extf %352 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6638418Z %354 = arith.shli %351#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6638578Z %355 = arith.shrsi %354, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6638823Z %356 = ttg.convert_layout %355 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6639084Z %357 = arith.shrsi %351#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6639324Z %358 = ttg.convert_layout %357 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6639678Z %359 = tt.expand_dims %356 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6640022Z %360 = tt.expand_dims %358 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6640311Z %361 = tt.broadcast %359 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6640553Z %362 = arith.select %16, %361, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6640794Z %363 = tt.broadcast %360 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6641029Z %364 = arith.select %18, %363, %362 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6641260Z %365 = tt.reshape %364 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6641483Z %366 = arith.sitofp %365 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6641776Z %367 = ttg.convert_layout %366 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6642239Z %368 = tt.dot %353, %367, %351#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6642808Z %369 = ttg.local_load %351#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6643256Z %370 = arith.extf %369 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6643569Z %371 = arith.shli %351#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6643727Z %372 = arith.shrsi %371, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6643976Z %373 = ttg.convert_layout %372 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6644230Z %374 = arith.shrsi %351#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6644476Z %375 = ttg.convert_layout %374 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6644822Z %376 = tt.expand_dims %373 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6645171Z %377 = tt.expand_dims %375 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6645476Z %378 = tt.broadcast %376 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6645731Z %379 = arith.select %16, %378, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6645979Z %380 = tt.broadcast %377 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6646223Z %381 = arith.select %18, %380, %379 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6646463Z %382 = tt.reshape %381 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6646696Z %383 = arith.sitofp %382 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6647001Z %384 = ttg.convert_layout %383 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6647492Z %385 = tt.dot %370, %384, %368, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6647898Z ttg.local_dealloc %336 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.6648117Z %386 = arith.truncf %385 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma> 2026-02-21T09:50:29.6648290Z %387 = arith.extsi %325 : i32 to i64 2026-02-21T09:50:29.6648416Z %388 = arith.extsi %328 : i32 to i64 2026-02-21T09:50:29.6648579Z %389 = tt.splat %387 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.6648796Z %390 = arith.addi %389, %20 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.6649060Z %391 = tt.expand_dims %390 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6649303Z %392 = arith.muli %391, %cst_8 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6649488Z %393 = tt.broadcast %392 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6649698Z %394 = tt.splat %388 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.6649916Z %395 = arith.addi %394, %21 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.6650185Z %396 = tt.expand_dims %395 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6650450Z %397 = tt.broadcast %396 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6650638Z %398 = arith.addi %393, %397 : tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6650827Z %399 = tt.addptr %19, %398 : tensor<64x128x!tt.ptr, #mma>, tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6651053Z %400 = arith.cmpi sge, %391, %cst_9 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6651224Z %401 = arith.cmpi slt, %391, %cst_10 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6651403Z %402 = arith.andi %400, %401 : tensor<64x1xi1, #mma> 2026-02-21T09:50:29.6651580Z %403 = tt.broadcast %402 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6651770Z %404 = arith.cmpi sge, %396, %cst_11 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6651946Z %405 = arith.cmpi slt, %396, %cst_12 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6652106Z %406 = arith.andi %404, %405 : tensor<1x128xi1, #mma> 2026-02-21T09:50:29.6652285Z %407 = tt.broadcast %406 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6652466Z %408 = arith.andi %403, %407 : tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6652633Z tt.store %399, %386, %408 : tensor<64x128x!tt.ptr, #mma> 2026-02-21T09:50:29.6652784Z } {tt.num_stages = 1 : i32} 2026-02-21T09:50:29.6652908Z scf.for %arg3 = %25 to %3 step %c1_i32 : i32 { 2026-02-21T09:50:29.6653051Z %26 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:50:29.6653174Z %27 = arith.muli %26, %c8_i32 : i32 2026-02-21T09:50:29.6653296Z %28 = arith.subi %c256_i32, %27 : i32 2026-02-21T09:50:29.6653413Z %29 = arith.minsi %28, %c8_i32 : i32 2026-02-21T09:50:29.6653534Z %30 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:50:29.6653651Z %31 = arith.remsi %30, %29 : i32 2026-02-21T09:50:29.6653765Z %32 = arith.addi %27, %31 : i32 2026-02-21T09:50:29.6653876Z %33 = arith.divsi %30, %29 : i32 2026-02-21T09:50:29.6653987Z %34 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:50:29.6654153Z %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.6654370Z %36 = arith.addi %35, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.6654542Z %37 = arith.muli %33, %c128_i32 : i32 2026-02-21T09:50:29.6654723Z %38 = tt.splat %37 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.6654938Z %39 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.6655212Z %40 = tt.expand_dims %36 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.6655472Z %41 = arith.muli %40, %cst_1 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.6655660Z %42 = tt.broadcast %41 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6655930Z %43 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked> 2026-02-21T09:50:29.6656201Z %44 = tt.broadcast %43 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6656413Z %45 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.6656678Z %46 = tt.expand_dims %9 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.6656945Z %47 = tt.broadcast %46 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6657130Z %48 = arith.addi %42, %47 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6657325Z %49 = tt.addptr %10, %48 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6657524Z %50 = tt.load %49 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6657754Z %51 = tt.expand_dims %8 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6657994Z %52 = arith.muli %51, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6658174Z %53 = tt.broadcast %52 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6658384Z %54 = arith.addi %53, %44 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6658574Z %55 = tt.addptr %11, %54 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6658785Z %56 = tt.load %55 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6659058Z %57 = ttg.memdesc_index %45[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6659406Z ttg.local_store %50, %57 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6659671Z %58 = arith.addi %8, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6659891Z %59 = arith.addi %9, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6660161Z %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.6660428Z %61 = tt.broadcast %60 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6660611Z %62 = arith.addi %42, %61 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6660804Z %63 = tt.addptr %10, %62 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6661002Z %64 = tt.load %63 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6661234Z %65 = tt.expand_dims %58 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6661471Z %66 = arith.muli %65, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6661648Z %67 = tt.broadcast %66 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6661830Z %68 = arith.addi %67, %44 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6662014Z %69 = tt.addptr %11, %68 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6662205Z %70 = tt.load %69 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6662493Z %71 = ttg.memdesc_index %45[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6662839Z ttg.local_store %64, %71 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6663465Z %72:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %57, %arg8 = %71, %arg9 = %56, %arg10 = %70) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:50:29.6663979Z %130 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:29.6664149Z %131 = tt.splat %130 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6664372Z %132 = arith.addi %131, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.6664544Z %133 = arith.muli %130, %c2_i32 : i32 2026-02-21T09:50:29.6664714Z %134 = tt.splat %133 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6664935Z %135 = arith.addi %134, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.6665209Z %136 = tt.expand_dims %135 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.6665483Z %137 = tt.broadcast %136 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6665676Z %138 = arith.addi %42, %137 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6665877Z %139 = tt.addptr %10, %138 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.6666082Z %140 = tt.load %139 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.6666394Z %141 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6666839Z %142 = arith.extf %141 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6667217Z %143 = tt.expand_dims %132 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6667464Z %144 = arith.muli %143, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.6667655Z %145 = tt.broadcast %144 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6667844Z %146 = arith.addi %145, %44 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6668038Z %147 = tt.addptr %11, %146 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.6668238Z %148 = tt.load %147 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.6668398Z %149 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6668558Z %150 = arith.shrsi %149, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6668801Z %151 = ttg.convert_layout %150 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6669050Z %152 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6669291Z %153 = ttg.convert_layout %152 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6669627Z %154 = tt.expand_dims %151 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6669973Z %155 = tt.expand_dims %153 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6670276Z %156 = tt.broadcast %154 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6670526Z %157 = arith.select %16, %156, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6670769Z %158 = tt.broadcast %155 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6671026Z %159 = arith.select %18, %158, %157 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6671262Z %160 = tt.reshape %159 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6671486Z %161 = arith.sitofp %160 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6671786Z %162 = ttg.convert_layout %161 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6672256Z %163 = tt.dot %142, %162, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6672601Z %164 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:29.6672730Z %165 = arith.cmpi slt, %164, %c2_i32 : i32 2026-02-21T09:50:29.6672862Z %166 = arith.select %165, %164, %c0_i32 : i32 2026-02-21T09:50:29.6673125Z %167 = ttg.memdesc_index %45[%166] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6673480Z ttg.local_store %140, %167 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.6673974Z scf.yield %163, %166, %arg8, %167, %arg10, %148 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6674357Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:29.6674625Z %73 = ttg.local_load %72#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6675058Z %74 = arith.extf %73 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6675353Z %75 = arith.shli %72#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6675510Z %76 = arith.shrsi %75, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6675751Z %77 = ttg.convert_layout %76 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6675990Z %78 = arith.shrsi %72#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6676230Z %79 = ttg.convert_layout %78 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6676566Z %80 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6676902Z %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6677186Z %82 = tt.broadcast %80 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6677426Z %83 = arith.select %16, %82, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6677661Z %84 = tt.broadcast %81 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6677892Z %85 = arith.select %18, %84, %83 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6678117Z %86 = tt.reshape %85 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6678353Z %87 = arith.sitofp %86 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6678644Z %88 = ttg.convert_layout %87 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6679113Z %89 = tt.dot %74, %88, %72#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6679593Z %90 = ttg.local_load %72#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6680012Z %91 = arith.extf %90 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6680303Z %92 = arith.shli %72#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6680462Z %93 = arith.shrsi %92, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6680701Z %94 = ttg.convert_layout %93 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6680944Z %95 = arith.shrsi %72#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.6681179Z %96 = ttg.convert_layout %95 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.6681509Z %97 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6681848Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.6682130Z %99 = tt.broadcast %97 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6682392Z %100 = arith.select %16, %99, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6682685Z %101 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6682945Z %102 = arith.select %18, %101, %100 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.6683183Z %103 = tt.reshape %102 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.6683409Z %104 = arith.sitofp %103 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.6683706Z %105 = ttg.convert_layout %104 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.6684162Z %106 = tt.dot %91, %105, %89, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.6684538Z ttg.local_dealloc %45 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.6684750Z %107 = arith.truncf %106 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma> 2026-02-21T09:50:29.6684919Z %108 = arith.extsi %34 : i32 to i64 2026-02-21T09:50:29.6685037Z %109 = arith.extsi %37 : i32 to i64 2026-02-21T09:50:29.6685198Z %110 = tt.splat %108 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.6685407Z %111 = arith.addi %110, %20 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.6685668Z %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6685901Z %113 = arith.muli %112, %cst_8 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6686079Z %114 = tt.broadcast %113 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6686308Z %115 = tt.splat %109 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.6686517Z %116 = arith.addi %115, %21 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.6686787Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6687066Z %118 = tt.broadcast %117 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6687248Z %119 = arith.addi %114, %118 : tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6687437Z %120 = tt.addptr %19, %119 : tensor<64x128x!tt.ptr, #mma>, tensor<64x128xi64, #mma> 2026-02-21T09:50:29.6687637Z %121 = arith.cmpi sge, %112, %cst_9 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6687802Z %122 = arith.cmpi slt, %112, %cst_10 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.6687957Z %123 = arith.andi %121, %122 : tensor<64x1xi1, #mma> 2026-02-21T09:50:29.6688128Z %124 = tt.broadcast %123 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6688309Z %125 = arith.cmpi sge, %117, %cst_11 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6688478Z %126 = arith.cmpi slt, %117, %cst_12 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.6688635Z %127 = arith.andi %125, %126 : tensor<1x128xi1, #mma> 2026-02-21T09:50:29.6688806Z %128 = tt.broadcast %127 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6688983Z %129 = arith.andi %124, %128 : tensor<64x128xi1, #mma> 2026-02-21T09:50:29.6689141Z tt.store %120, %107, %129 : tensor<64x128x!tt.ptr, #mma> 2026-02-21T09:50:29.6689287Z } {tt.num_stages = 1 : i32} 2026-02-21T09:50:29.6689387Z tt.return 2026-02-21T09:50:29.6689464Z } 2026-02-21T09:50:29.6689541Z } 2026-02-21T09:50:29.6689584Z 2026-02-21T09:50:29.6689613Z {-# 2026-02-21T09:50:29.6689693Z external_resources: { 2026-02-21T09:50:29.6689791Z mlir_reproducer: { 2026-02-21T09:50:29.6690811Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:50:29.6691833Z disable_threading: false, 2026-02-21T09:50:29.6691941Z verify_each: true 2026-02-21T09:50:29.6692039Z } 2026-02-21T09:50:29.6692114Z } 2026-02-21T09:50:29.6692192Z #-} 2026-02-21T09:50:29.6692476Z /tmp/torchinductor_root/3a/c3ahk2rjgdqgfrjhf4gky4xzjvueqvkmdafctmimwymr2l4aohzg.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:50:29.6693171Z /tmp/torchinductor_root/3a/c3ahk2rjgdqgfrjhf4gky4xzjvueqvkmdafctmimwymr2l4aohzg.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:50:29.6693725Z [360s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:50:29.6694504Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 128], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:50:29.6695211Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:50:29.6695400Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:50:29.8729448Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:50:29.8738693Z #blocked = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:50:29.8739239Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:50:29.8739727Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:50:29.8740193Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:50:29.8740648Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:50:29.8741061Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:50:29.8741354Z #smem = #ttg.shared_memory 2026-02-21T09:50:29.8741721Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:50:29.8742466Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:50:29.8743156Z %cst = arith.constant dense<4> : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8743406Z %c9728_i32 = arith.constant 9728 : i32 2026-02-21T09:50:29.8743596Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:50:29.8743784Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:50:29.8744128Z %cst_0 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8744408Z %c29184_i32 = arith.constant 29184 : i32 2026-02-21T09:50:29.8744592Z %c19456_i32 = arith.constant 19456 : i32 2026-02-21T09:50:29.8744881Z %c38912_i32 = arith.constant 38912 : i32 2026-02-21T09:50:29.8745046Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:50:29.8745207Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T09:50:29.8745372Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:50:29.8745537Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:50:29.8745702Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:50:29.8745861Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:50:29.8746019Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:50:29.8746224Z %cst_1 = arith.constant dense<1024> : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.8746478Z %cst_2 = arith.constant dense<8192> : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8746783Z %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8747047Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:50:29.8747298Z %cst_4 = arith.constant dense<2> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8747555Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:50:29.8747718Z %c26111_i32 = arith.constant 26111 : i32 2026-02-21T09:50:29.8747936Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8748200Z %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:50:29.8748446Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:50:29.8748686Z %cst_8 = arith.constant dense<8192> : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8748928Z %cst_9 = arith.constant dense<0> : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8749155Z %cst_10 = arith.constant dense<16384> : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8749400Z %cst_11 = arith.constant dense<0> : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8749767Z %cst_12 = arith.constant dense<8192> : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8749979Z %0 = tt.get_program_id x : i32 2026-02-21T09:50:29.8750264Z %1 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.8750678Z %2 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.8751060Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.8751450Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.8751823Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8752201Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8752579Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8752862Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8753242Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:50:29.8753838Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:50:29.8754396Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:50:29.8754684Z %12 = arith.cmpi eq, %11, %cst_6 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:50:29.8754907Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1> 2026-02-21T09:50:29.8755150Z %14 = arith.cmpi eq, %11, %cst_7 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:50:29.8755366Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1> 2026-02-21T09:50:29.8755616Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<64x128x!tt.ptr, #mma> 2026-02-21T09:50:29.8755919Z %17 = arith.extsi %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.8756293Z %18 = arith.extsi %4 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.8756550Z %19 = arith.subi %c26111_i32, %0 : i32 2026-02-21T09:50:29.8756689Z %20 = arith.divui %19, %c9728_i32 : i32 2026-02-21T09:50:29.8756822Z %21 = arith.remsi %20, %c4_i32 : i32 2026-02-21T09:50:29.8756948Z %22 = arith.subi %20, %21 : i32 2026-02-21T09:50:29.8757077Z %23 = arith.muli %22, %c9728_i32 : i32 2026-02-21T09:50:29.8757205Z %24 = arith.addi %0, %23 : i32 2026-02-21T09:50:29.8757350Z scf.for %arg3 = %0 to %24 step %c38912_i32 : i32 { 2026-02-21T09:50:29.8757504Z %25 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:50:29.8757647Z %26 = arith.muli %25, %c8_i32 : i32 2026-02-21T09:50:29.8757771Z %27 = arith.subi %c256_i32, %26 : i32 2026-02-21T09:50:29.8757904Z %28 = arith.minsi %27, %c8_i32 : i32 2026-02-21T09:50:29.8758032Z %29 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:50:29.8758170Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:50:29.8758294Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:50:29.8758417Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:50:29.8758542Z %33 = arith.muli %31, %c64_i32 : i32 2026-02-21T09:50:29.8758731Z %34 = tt.splat %33 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.8758979Z %35 = arith.addi %34, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.8759167Z %36 = arith.muli %32, %c128_i32 : i32 2026-02-21T09:50:29.8759384Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.8759623Z %38 = arith.addi %37, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.8759923Z %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.8760219Z %40 = arith.muli %39, %cst_1 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.8760428Z %41 = tt.broadcast %40 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8760734Z %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked> 2026-02-21T09:50:29.8761035Z %43 = tt.broadcast %42 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8761277Z %44 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.8761579Z %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.8761874Z %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8762090Z %47 = arith.addi %41, %46 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8762304Z %48 = tt.addptr %7, %47 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8762525Z %49 = tt.load %48 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8762843Z %50 = tt.expand_dims %5 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8763108Z %51 = arith.muli %50, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8763317Z %52 = tt.broadcast %51 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8763548Z %53 = arith.addi %52, %43 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8763762Z %54 = tt.addptr %8, %53 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8764004Z %55 = tt.load %54 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8764302Z %56 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8764658Z ttg.local_store %49, %56 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8764929Z %57 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8765150Z %58 = arith.addi %6, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8765424Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.8765692Z %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8765879Z %61 = arith.addi %41, %60 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8766071Z %62 = tt.addptr %7, %61 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8766271Z %63 = tt.load %62 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8766510Z %64 = tt.expand_dims %57 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8766749Z %65 = arith.muli %64, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8766936Z %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8767118Z %67 = arith.addi %66, %43 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8767308Z %68 = tt.addptr %8, %67 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8767520Z %69 = tt.load %68 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8767788Z %70 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8768134Z ttg.local_store %63, %70 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8768769Z %71:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %56, %arg8 = %70, %arg9 = %55, %arg10 = %69) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:50:29.8769289Z %408 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:29.8769466Z %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8769687Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8769863Z %411 = arith.muli %408, %c2_i32 : i32 2026-02-21T09:50:29.8770033Z %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8770253Z %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8770530Z %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.8770808Z %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8771006Z %416 = arith.addi %41, %415 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8771205Z %417 = tt.addptr %7, %416 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8771429Z %418 = tt.load %417 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8771731Z %419 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8772202Z %420 = arith.extf %419 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8772583Z %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8772831Z %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8773020Z %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8773214Z %424 = arith.addi %423, %43 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8773409Z %425 = tt.addptr %8, %424 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8773611Z %426 = tt.load %425 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8773769Z %427 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8773934Z %428 = arith.shrsi %427, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8774184Z %429 = ttg.convert_layout %428 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8774435Z %430 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8774682Z %431 = ttg.convert_layout %430 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8775025Z %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8775393Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8775688Z %434 = tt.broadcast %432 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8775938Z %435 = arith.select %13, %434, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8776217Z %436 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8776462Z %437 = arith.select %15, %436, %435 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8776699Z %438 = tt.reshape %437 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8776932Z %439 = arith.sitofp %438 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8777232Z %440 = ttg.convert_layout %439 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8777709Z %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8778060Z %442 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:29.8778190Z %443 = arith.cmpi slt, %442, %c2_i32 : i32 2026-02-21T09:50:29.8778327Z %444 = arith.select %443, %442, %c0_i32 : i32 2026-02-21T09:50:29.8778586Z %445 = ttg.memdesc_index %44[%444] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8778941Z ttg.local_store %418, %445 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8779439Z scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8779824Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:29.8780149Z %72 = ttg.local_load %71#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8780574Z %73 = arith.extf %72 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8780877Z %74 = arith.shli %71#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8781043Z %75 = arith.shrsi %74, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8781293Z %76 = ttg.convert_layout %75 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8781545Z %77 = arith.shrsi %71#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8781787Z %78 = ttg.convert_layout %77 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8782131Z %79 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8782472Z %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8782767Z %81 = tt.broadcast %79 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8783017Z %82 = arith.select %13, %81, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8783261Z %83 = tt.broadcast %80 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8783503Z %84 = arith.select %15, %83, %82 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8794109Z %85 = tt.reshape %84 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8794373Z %86 = arith.sitofp %85 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8794684Z %87 = ttg.convert_layout %86 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8795170Z %88 = tt.dot %73, %87, %71#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8795665Z %89 = ttg.local_load %71#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8796093Z %90 = arith.extf %89 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8796394Z %91 = arith.shli %71#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8796565Z %92 = arith.shrsi %91, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8796813Z %93 = ttg.convert_layout %92 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8797067Z %94 = arith.shrsi %71#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8797315Z %95 = ttg.convert_layout %94 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8797651Z %96 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8797998Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8798302Z %98 = tt.broadcast %96 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8798558Z %99 = arith.select %13, %98, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8798828Z %100 = tt.broadcast %97 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8799071Z %101 = arith.select %15, %100, %99 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8799316Z %102 = tt.reshape %101 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8799550Z %103 = arith.sitofp %102 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8799858Z %104 = ttg.convert_layout %103 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8800331Z %105 = tt.dot %90, %104, %88, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8800715Z ttg.local_dealloc %44 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.8800943Z %106 = arith.truncf %105 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma> 2026-02-21T09:50:29.8801125Z %107 = arith.extsi %33 : i32 to i64 2026-02-21T09:50:29.8801248Z %108 = arith.extsi %36 : i32 to i64 2026-02-21T09:50:29.8801419Z %109 = tt.splat %107 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.8801633Z %110 = arith.addi %109, %17 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.8801906Z %111 = tt.expand_dims %110 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8802149Z %112 = arith.muli %111, %cst_8 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8802337Z %113 = tt.broadcast %112 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8802637Z %114 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.8802856Z %115 = arith.addi %114, %18 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.8803131Z %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8803416Z %117 = tt.broadcast %116 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8803608Z %118 = arith.addi %113, %117 : tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8803809Z %119 = tt.addptr %16, %118 : tensor<64x128x!tt.ptr, #mma>, tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8804014Z %120 = arith.cmpi sge, %111, %cst_9 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8804191Z %121 = arith.cmpi slt, %111, %cst_10 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8804357Z %122 = arith.andi %120, %121 : tensor<64x1xi1, #mma> 2026-02-21T09:50:29.8804542Z %123 = tt.broadcast %122 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8804733Z %124 = arith.cmpi sge, %116, %cst_11 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8804911Z %125 = arith.cmpi slt, %116, %cst_12 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8805081Z %126 = arith.andi %124, %125 : tensor<1x128xi1, #mma> 2026-02-21T09:50:29.8805258Z %127 = tt.broadcast %126 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8805445Z %128 = arith.andi %123, %127 : tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8805607Z tt.store %119, %106, %128 : tensor<64x128x!tt.ptr, #mma> 2026-02-21T09:50:29.8805771Z %129 = arith.addi %arg3, %c9728_i32 : i32 2026-02-21T09:50:29.8805903Z %130 = arith.divsi %129, %c512_i32 : i32 2026-02-21T09:50:29.8806034Z %131 = arith.muli %130, %c8_i32 : i32 2026-02-21T09:50:29.8806163Z %132 = arith.subi %c256_i32, %131 : i32 2026-02-21T09:50:29.8806313Z %133 = arith.minsi %132, %c8_i32 : i32 2026-02-21T09:50:29.8806444Z %134 = arith.remsi %129, %c512_i32 : i32 2026-02-21T09:50:29.8806586Z %135 = arith.remsi %134, %133 : i32 2026-02-21T09:50:29.8806710Z %136 = arith.addi %131, %135 : i32 2026-02-21T09:50:29.8806831Z %137 = arith.divsi %134, %133 : i32 2026-02-21T09:50:29.8806959Z %138 = arith.muli %136, %c64_i32 : i32 2026-02-21T09:50:29.8807140Z %139 = tt.splat %138 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.8807371Z %140 = arith.addi %139, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.8807557Z %141 = arith.muli %137, %c128_i32 : i32 2026-02-21T09:50:29.8807731Z %142 = tt.splat %141 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.8807962Z %143 = arith.addi %142, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.8808247Z %144 = tt.expand_dims %140 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.8808511Z %145 = arith.muli %144, %cst_1 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.8808716Z %146 = tt.broadcast %145 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8809002Z %147 = tt.expand_dims %143 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked> 2026-02-21T09:50:29.8809289Z %148 = tt.broadcast %147 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8809513Z %149 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.8809711Z %150 = arith.addi %146, %46 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8809918Z %151 = tt.addptr %7, %150 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8810152Z %152 = tt.load %151 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8810320Z %153 = arith.addi %52, %148 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8810523Z %154 = tt.addptr %8, %153 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8810730Z %155 = tt.load %154 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8811036Z %156 = ttg.memdesc_index %149[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8811399Z ttg.local_store %152, %156 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8811646Z %157 = arith.addi %146, %60 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8811846Z %158 = tt.addptr %7, %157 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8812058Z %159 = tt.load %158 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8812225Z %160 = arith.addi %66, %148 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8812422Z %161 = tt.addptr %8, %160 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8812627Z %162 = tt.load %161 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8812903Z %163 = ttg.memdesc_index %149[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8813262Z ttg.local_store %159, %163 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8813912Z %164:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %156, %arg8 = %163, %arg9 = %155, %arg10 = %162) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:50:29.8814437Z %408 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:29.8814621Z %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8814920Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8815101Z %411 = arith.muli %408, %c2_i32 : i32 2026-02-21T09:50:29.8815284Z %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8815511Z %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8815796Z %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.8816084Z %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8816285Z %416 = arith.addi %146, %415 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8816495Z %417 = tt.addptr %7, %416 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8816707Z %418 = tt.load %417 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8817018Z %419 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8817462Z %420 = arith.extf %419 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8817848Z %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8818106Z %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8818303Z %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8822690Z %424 = arith.addi %423, %148 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8822907Z %425 = tt.addptr %8, %424 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8823110Z %426 = tt.load %425 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8823305Z %427 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8823468Z %428 = arith.shrsi %427, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8823722Z %429 = ttg.convert_layout %428 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8823979Z %430 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8824228Z %431 = ttg.convert_layout %430 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8824578Z %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8824934Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8825242Z %434 = tt.broadcast %432 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8825503Z %435 = arith.select %13, %434, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8825754Z %436 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8826005Z %437 = arith.select %15, %436, %435 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8826246Z %438 = tt.reshape %437 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8826509Z %439 = arith.sitofp %438 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8826819Z %440 = ttg.convert_layout %439 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8827310Z %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8827669Z %442 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:29.8827803Z %443 = arith.cmpi slt, %442, %c2_i32 : i32 2026-02-21T09:50:29.8827945Z %444 = arith.select %443, %442, %c0_i32 : i32 2026-02-21T09:50:29.8828217Z %445 = ttg.memdesc_index %149[%444] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8828575Z ttg.local_store %418, %445 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8829062Z scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8829457Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:29.8829739Z %165 = ttg.local_load %164#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8830174Z %166 = arith.extf %165 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8830476Z %167 = arith.shli %164#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8830647Z %168 = arith.shrsi %167, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8830918Z %169 = ttg.convert_layout %168 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8831169Z %170 = arith.shrsi %164#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8831418Z %171 = ttg.convert_layout %170 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8831771Z %172 = tt.expand_dims %169 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8832125Z %173 = tt.expand_dims %171 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8832426Z %174 = tt.broadcast %172 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8832680Z %175 = arith.select %13, %174, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8832937Z %176 = tt.broadcast %173 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8833180Z %177 = arith.select %15, %176, %175 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8833427Z %178 = tt.reshape %177 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8833663Z %179 = arith.sitofp %178 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8833964Z %180 = ttg.convert_layout %179 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8834435Z %181 = tt.dot %166, %180, %164#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8834953Z %182 = ttg.local_load %164#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8835378Z %183 = arith.extf %182 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8835701Z %184 = arith.shli %164#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8835869Z %185 = arith.shrsi %184, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8836125Z %186 = ttg.convert_layout %185 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8836385Z %187 = arith.shrsi %164#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8836632Z %188 = ttg.convert_layout %187 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8836980Z %189 = tt.expand_dims %186 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8837331Z %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8837632Z %191 = tt.broadcast %189 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8837890Z %192 = arith.select %13, %191, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8838139Z %193 = tt.broadcast %190 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8838390Z %194 = arith.select %15, %193, %192 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8838629Z %195 = tt.reshape %194 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8838868Z %196 = arith.sitofp %195 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8839205Z %197 = ttg.convert_layout %196 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8839671Z %198 = tt.dot %183, %197, %181, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8840078Z ttg.local_dealloc %149 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.8840303Z %199 = arith.truncf %198 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma> 2026-02-21T09:50:29.8840480Z %200 = arith.extsi %138 : i32 to i64 2026-02-21T09:50:29.8840610Z %201 = arith.extsi %141 : i32 to i64 2026-02-21T09:50:29.8840776Z %202 = tt.splat %200 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.8840995Z %203 = arith.addi %202, %17 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.8841262Z %204 = tt.expand_dims %203 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8841508Z %205 = arith.muli %204, %cst_8 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8841698Z %206 = tt.broadcast %205 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8841909Z %207 = tt.splat %201 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.8842129Z %208 = arith.addi %207, %18 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.8842399Z %209 = tt.expand_dims %208 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8842727Z %210 = tt.broadcast %209 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8842921Z %211 = arith.addi %206, %210 : tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8843141Z %212 = tt.addptr %16, %211 : tensor<64x128x!tt.ptr, #mma>, tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8843346Z %213 = arith.cmpi sge, %204, %cst_9 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8843543Z %214 = arith.cmpi slt, %204, %cst_10 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8843705Z %215 = arith.andi %213, %214 : tensor<64x1xi1, #mma> 2026-02-21T09:50:29.8843888Z %216 = tt.broadcast %215 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8844083Z %217 = arith.cmpi sge, %209, %cst_11 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8844254Z %218 = arith.cmpi slt, %209, %cst_12 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8844422Z %219 = arith.andi %217, %218 : tensor<1x128xi1, #mma> 2026-02-21T09:50:29.8844599Z %220 = tt.broadcast %219 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8844785Z %221 = arith.andi %216, %220 : tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8844940Z tt.store %212, %199, %221 : tensor<64x128x!tt.ptr, #mma> 2026-02-21T09:50:29.8845092Z %222 = arith.addi %arg3, %c19456_i32 : i32 2026-02-21T09:50:29.8845219Z %223 = arith.divsi %222, %c512_i32 : i32 2026-02-21T09:50:29.8845340Z %224 = arith.muli %223, %c8_i32 : i32 2026-02-21T09:50:29.8845461Z %225 = arith.subi %c256_i32, %224 : i32 2026-02-21T09:50:29.8845581Z %226 = arith.minsi %225, %c8_i32 : i32 2026-02-21T09:50:29.8845702Z %227 = arith.remsi %222, %c512_i32 : i32 2026-02-21T09:50:29.8845817Z %228 = arith.remsi %227, %226 : i32 2026-02-21T09:50:29.8845933Z %229 = arith.addi %224, %228 : i32 2026-02-21T09:50:29.8846045Z %230 = arith.divsi %227, %226 : i32 2026-02-21T09:50:29.8846170Z %231 = arith.muli %229, %c64_i32 : i32 2026-02-21T09:50:29.8846340Z %232 = tt.splat %231 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.8846566Z %233 = arith.addi %232, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.8846741Z %234 = arith.muli %230, %c128_i32 : i32 2026-02-21T09:50:29.8846928Z %235 = tt.splat %234 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.8847149Z %236 = arith.addi %235, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.8847423Z %237 = tt.expand_dims %233 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.8847692Z %238 = arith.muli %237, %cst_1 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.8847888Z %239 = tt.broadcast %238 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8848167Z %240 = tt.expand_dims %236 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked> 2026-02-21T09:50:29.8848443Z %241 = tt.broadcast %240 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8848667Z %242 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.8848854Z %243 = arith.addi %239, %46 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8849052Z %244 = tt.addptr %7, %243 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8849254Z %245 = tt.load %244 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8849412Z %246 = arith.addi %52, %241 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8849602Z %247 = tt.addptr %8, %246 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8849798Z %248 = tt.load %247 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8850075Z %249 = ttg.memdesc_index %242[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8850434Z ttg.local_store %245, %249 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8850691Z %250 = arith.addi %239, %60 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8850884Z %251 = tt.addptr %7, %250 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8851103Z %252 = tt.load %251 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8851262Z %253 = arith.addi %66, %241 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8851454Z %254 = tt.addptr %8, %253 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8851651Z %255 = tt.load %254 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8851921Z %256 = ttg.memdesc_index %242[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8852274Z ttg.local_store %252, %256 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8852902Z %257:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %249, %arg8 = %256, %arg9 = %248, %arg10 = %255) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:50:29.8853422Z %408 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:29.8853594Z %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8853813Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8853985Z %411 = arith.muli %408, %c2_i32 : i32 2026-02-21T09:50:29.8854154Z %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8854372Z %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8854664Z %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.8854941Z %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8855132Z %416 = arith.addi %239, %415 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8855352Z %417 = tt.addptr %7, %416 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8855553Z %418 = tt.load %417 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8855848Z %419 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8856277Z %420 = arith.extf %419 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8856654Z %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8856898Z %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8857090Z %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8857280Z %424 = arith.addi %423, %241 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8857473Z %425 = tt.addptr %8, %424 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8857668Z %426 = tt.load %425 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8857827Z %427 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8857982Z %428 = arith.shrsi %427, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8858240Z %429 = ttg.convert_layout %428 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8858487Z %430 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8858733Z %431 = ttg.convert_layout %430 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8859087Z %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8859431Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8859721Z %434 = tt.broadcast %432 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8859970Z %435 = arith.select %13, %434, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8860211Z %436 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8860456Z %437 = arith.select %15, %436, %435 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8860692Z %438 = tt.reshape %437 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8860919Z %439 = arith.sitofp %438 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8861218Z %440 = ttg.convert_layout %439 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8861685Z %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8862031Z %442 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:29.8862158Z %443 = arith.cmpi slt, %442, %c2_i32 : i32 2026-02-21T09:50:29.8862293Z %444 = arith.select %443, %442, %c0_i32 : i32 2026-02-21T09:50:29.8862572Z %445 = ttg.memdesc_index %242[%444] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8862923Z ttg.local_store %418, %445 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8863415Z scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8863801Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:29.8864072Z %258 = ttg.local_load %257#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8864499Z %259 = arith.extf %258 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8864792Z %260 = arith.shli %257#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8864955Z %261 = arith.shrsi %260, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8865198Z %262 = ttg.convert_layout %261 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8865441Z %263 = arith.shrsi %257#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8865683Z %264 = ttg.convert_layout %263 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8866018Z %265 = tt.expand_dims %262 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8866377Z %266 = tt.expand_dims %264 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8866669Z %267 = tt.broadcast %265 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8866928Z %268 = arith.select %13, %267, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8867173Z %269 = tt.broadcast %266 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8867412Z %270 = arith.select %15, %269, %268 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8867648Z %271 = tt.reshape %270 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8867873Z %272 = arith.sitofp %271 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8868166Z %273 = ttg.convert_layout %272 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8868632Z %274 = tt.dot %259, %273, %257#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8869121Z %275 = ttg.local_load %257#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8869544Z %276 = arith.extf %275 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8869841Z %277 = arith.shli %257#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8870000Z %278 = arith.shrsi %277, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8870245Z %279 = ttg.convert_layout %278 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8870492Z %280 = arith.shrsi %257#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8870750Z %281 = ttg.convert_layout %280 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8871087Z %282 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8871440Z %283 = tt.expand_dims %281 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8871727Z %284 = tt.broadcast %282 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8871973Z %285 = arith.select %13, %284, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8872214Z %286 = tt.broadcast %283 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8872454Z %287 = arith.select %15, %286, %285 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8872687Z %288 = tt.reshape %287 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8872911Z %289 = arith.sitofp %288 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8873207Z %290 = ttg.convert_layout %289 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8873663Z %291 = tt.dot %276, %290, %274, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8874043Z ttg.local_dealloc %242 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.8874259Z %292 = arith.truncf %291 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma> 2026-02-21T09:50:29.8874427Z %293 = arith.extsi %231 : i32 to i64 2026-02-21T09:50:29.8874559Z %294 = arith.extsi %234 : i32 to i64 2026-02-21T09:50:29.8874718Z %295 = tt.splat %293 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.8874942Z %296 = arith.addi %295, %17 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.8875202Z %297 = tt.expand_dims %296 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8875440Z %298 = arith.muli %297, %cst_8 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8875619Z %299 = tt.broadcast %298 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8875824Z %300 = tt.splat %294 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.8876036Z %301 = arith.addi %300, %18 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.8876302Z %302 = tt.expand_dims %301 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8876568Z %303 = tt.broadcast %302 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8876751Z %304 = arith.addi %299, %303 : tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8876940Z %305 = tt.addptr %16, %304 : tensor<64x128x!tt.ptr, #mma>, tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8877137Z %306 = arith.cmpi sge, %297, %cst_9 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8877301Z %307 = arith.cmpi slt, %297, %cst_10 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8877457Z %308 = arith.andi %306, %307 : tensor<64x1xi1, #mma> 2026-02-21T09:50:29.8877627Z %309 = tt.broadcast %308 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8877811Z %310 = arith.cmpi sge, %302, %cst_11 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8877978Z %311 = arith.cmpi slt, %302, %cst_12 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8878134Z %312 = arith.andi %310, %311 : tensor<1x128xi1, #mma> 2026-02-21T09:50:29.8878324Z %313 = tt.broadcast %312 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8878499Z %314 = arith.andi %309, %313 : tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8878659Z tt.store %305, %292, %314 : tensor<64x128x!tt.ptr, #mma> 2026-02-21T09:50:29.8878820Z %315 = arith.addi %arg3, %c29184_i32 : i32 2026-02-21T09:50:29.8878944Z %316 = arith.divsi %315, %c512_i32 : i32 2026-02-21T09:50:29.8879065Z %317 = arith.muli %316, %c8_i32 : i32 2026-02-21T09:50:29.8879182Z %318 = arith.subi %c256_i32, %317 : i32 2026-02-21T09:50:29.8879301Z %319 = arith.minsi %318, %c8_i32 : i32 2026-02-21T09:50:29.8879418Z %320 = arith.remsi %315, %c512_i32 : i32 2026-02-21T09:50:29.8879536Z %321 = arith.remsi %320, %319 : i32 2026-02-21T09:50:29.8879648Z %322 = arith.addi %317, %321 : i32 2026-02-21T09:50:29.8879761Z %323 = arith.divsi %320, %319 : i32 2026-02-21T09:50:29.8879878Z %324 = arith.muli %322, %c64_i32 : i32 2026-02-21T09:50:29.8880045Z %325 = tt.splat %324 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.8880268Z %326 = arith.addi %325, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.8880441Z %327 = arith.muli %323, %c128_i32 : i32 2026-02-21T09:50:29.8880610Z %328 = tt.splat %327 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.8880828Z %329 = arith.addi %328, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.8881100Z %330 = tt.expand_dims %326 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.8881353Z %331 = arith.muli %330, %cst_1 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.8881545Z %332 = tt.broadcast %331 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8881841Z %333 = tt.expand_dims %329 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked> 2026-02-21T09:50:29.8882115Z %334 = tt.broadcast %333 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8882350Z %335 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.8882539Z %336 = arith.addi %332, %46 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8882783Z %337 = tt.addptr %7, %336 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8882992Z %338 = tt.load %337 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8883153Z %339 = arith.addi %52, %334 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8883349Z %340 = tt.addptr %8, %339 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8883547Z %341 = tt.load %340 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8883825Z %342 = ttg.memdesc_index %335[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8884183Z ttg.local_store %338, %342 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8884420Z %343 = arith.addi %332, %60 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8884618Z %344 = tt.addptr %7, %343 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8884821Z %345 = tt.load %344 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8884976Z %346 = arith.addi %66, %334 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8885168Z %347 = tt.addptr %8, %346 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8885360Z %348 = tt.load %347 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8885660Z %349 = ttg.memdesc_index %335[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8886020Z ttg.local_store %345, %349 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8886642Z %350:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %342, %arg8 = %349, %arg9 = %341, %arg10 = %348) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:50:29.8887177Z %408 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:29.8887350Z %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8887569Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8887742Z %411 = arith.muli %408, %c2_i32 : i32 2026-02-21T09:50:29.8887912Z %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8888133Z %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8888408Z %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.8888688Z %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8888892Z %416 = arith.addi %332, %415 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8889097Z %417 = tt.addptr %7, %416 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8889316Z %418 = tt.load %417 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8889643Z %419 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8890077Z %420 = arith.extf %419 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8890484Z %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8890735Z %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8890936Z %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8891136Z %424 = arith.addi %423, %334 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8891336Z %425 = tt.addptr %8, %424 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8891546Z %426 = tt.load %425 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8891711Z %427 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8891881Z %428 = arith.shrsi %427, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8892132Z %429 = ttg.convert_layout %428 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8892392Z %430 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8892647Z %431 = ttg.convert_layout %430 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8892988Z %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8893343Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8893648Z %434 = tt.broadcast %432 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8893918Z %435 = arith.select %13, %434, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8894175Z %436 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8894421Z %437 = arith.select %15, %436, %435 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8894682Z %438 = tt.reshape %437 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8894920Z %439 = arith.sitofp %438 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8895222Z %440 = ttg.convert_layout %439 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8895704Z %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8896056Z %442 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:29.8896195Z %443 = arith.cmpi slt, %442, %c2_i32 : i32 2026-02-21T09:50:29.8896339Z %444 = arith.select %443, %442, %c0_i32 : i32 2026-02-21T09:50:29.8896609Z %445 = ttg.memdesc_index %335[%444] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8896971Z ttg.local_store %418, %445 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8897454Z scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8897864Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:29.8898149Z %351 = ttg.local_load %350#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8898594Z %352 = arith.extf %351 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8898902Z %353 = arith.shli %350#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8899075Z %354 = arith.shrsi %353, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8899323Z %355 = ttg.convert_layout %354 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8899579Z %356 = arith.shrsi %350#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8899826Z %357 = ttg.convert_layout %356 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8900173Z %358 = tt.expand_dims %355 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8900527Z %359 = tt.expand_dims %357 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8900822Z %360 = tt.broadcast %358 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8901078Z %361 = arith.select %13, %360, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8901329Z %362 = tt.broadcast %359 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8901578Z %363 = arith.select %15, %362, %361 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8901823Z %364 = tt.reshape %363 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8902071Z %365 = arith.sitofp %364 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8902378Z %366 = ttg.convert_layout %365 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8902854Z %367 = tt.dot %352, %366, %350#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8903365Z %368 = ttg.local_load %350#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8903798Z %369 = arith.extf %368 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8904098Z %370 = arith.shli %350#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8904271Z %371 = arith.shrsi %370, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8904523Z %372 = ttg.convert_layout %371 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8904776Z %373 = arith.shrsi %350#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8905029Z %374 = ttg.convert_layout %373 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8905369Z %375 = tt.expand_dims %372 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8905721Z %376 = tt.expand_dims %374 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8906021Z %377 = tt.broadcast %375 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8906286Z %378 = arith.select %13, %377, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8906539Z %379 = tt.broadcast %376 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8906805Z %380 = arith.select %15, %379, %378 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8907046Z %381 = tt.reshape %380 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8907280Z %382 = arith.sitofp %381 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8907578Z %383 = ttg.convert_layout %382 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8908048Z %384 = tt.dot %369, %383, %367, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8908439Z ttg.local_dealloc %335 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.8908656Z %385 = arith.truncf %384 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma> 2026-02-21T09:50:29.8908839Z %386 = arith.extsi %324 : i32 to i64 2026-02-21T09:50:29.8908965Z %387 = arith.extsi %327 : i32 to i64 2026-02-21T09:50:29.8909136Z %388 = tt.splat %386 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.8909350Z %389 = arith.addi %388, %17 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.8909622Z %390 = tt.expand_dims %389 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8909869Z %391 = arith.muli %390, %cst_8 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8910051Z %392 = tt.broadcast %391 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8910269Z %393 = tt.splat %387 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.8910503Z %394 = arith.addi %393, %18 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.8910780Z %395 = tt.expand_dims %394 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8911063Z %396 = tt.broadcast %395 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8911249Z %397 = arith.addi %392, %396 : tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8911447Z %398 = tt.addptr %16, %397 : tensor<64x128x!tt.ptr, #mma>, tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8911648Z %399 = arith.cmpi sge, %390, %cst_9 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8911824Z %400 = arith.cmpi slt, %390, %cst_10 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8911991Z %401 = arith.andi %399, %400 : tensor<64x1xi1, #mma> 2026-02-21T09:50:29.8912166Z %402 = tt.broadcast %401 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8912366Z %403 = arith.cmpi sge, %395, %cst_11 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8912537Z %404 = arith.cmpi slt, %395, %cst_12 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8912707Z %405 = arith.andi %403, %404 : tensor<1x128xi1, #mma> 2026-02-21T09:50:29.8912885Z %406 = tt.broadcast %405 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8913071Z %407 = arith.andi %402, %406 : tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8913239Z tt.store %398, %385, %407 : tensor<64x128x!tt.ptr, #mma> 2026-02-21T09:50:29.8913388Z } {tt.num_stages = 1 : i32} 2026-02-21T09:50:29.8913529Z scf.for %arg3 = %24 to %c16384_i32 step %c9728_i32 : i32 { 2026-02-21T09:50:29.8913679Z %25 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:50:29.8913814Z %26 = arith.muli %25, %c8_i32 : i32 2026-02-21T09:50:29.8913937Z %27 = arith.subi %c256_i32, %26 : i32 2026-02-21T09:50:29.8914079Z %28 = arith.minsi %27, %c8_i32 : i32 2026-02-21T09:50:29.8914212Z %29 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:50:29.8914335Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:50:29.8914481Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:50:29.8914598Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:50:29.8914723Z %33 = arith.muli %31, %c64_i32 : i32 2026-02-21T09:50:29.8914896Z %34 = tt.splat %33 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.8915126Z %35 = arith.addi %34, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:29.8915305Z %36 = arith.muli %32, %c128_i32 : i32 2026-02-21T09:50:29.8915472Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.8915696Z %38 = arith.addi %37, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:50:29.8915974Z %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.8916231Z %40 = arith.muli %39, %cst_1 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:29.8916426Z %41 = tt.broadcast %40 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8916707Z %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked> 2026-02-21T09:50:29.8916986Z %43 = tt.broadcast %42 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8917203Z %44 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.8917480Z %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.8917750Z %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8917946Z %47 = arith.addi %41, %46 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8918169Z %48 = tt.addptr %7, %47 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8918374Z %49 = tt.load %48 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8918619Z %50 = tt.expand_dims %5 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8918876Z %51 = arith.muli %50, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8919067Z %52 = tt.broadcast %51 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8919261Z %53 = arith.addi %52, %43 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8919455Z %54 = tt.addptr %8, %53 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8919657Z %55 = tt.load %54 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8919942Z %56 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8920305Z ttg.local_store %49, %56 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8920583Z %57 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8920811Z %58 = arith.addi %6, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8921091Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.8921363Z %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8921558Z %61 = arith.addi %41, %60 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8921758Z %62 = tt.addptr %7, %61 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8921980Z %63 = tt.load %62 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8922223Z %64 = tt.expand_dims %57 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8922507Z %65 = arith.muli %64, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8922730Z %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8922920Z %67 = arith.addi %66, %43 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8923113Z %68 = tt.addptr %8, %67 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8923310Z %69 = tt.load %68 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8923583Z %70 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8923940Z ttg.local_store %63, %70 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8924562Z %71:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %56, %arg8 = %70, %arg9 = %55, %arg10 = %69) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:50:29.8925087Z %129 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:29.8925268Z %130 = tt.splat %129 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8925493Z %131 = arith.addi %130, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:29.8925678Z %132 = arith.muli %129, %c2_i32 : i32 2026-02-21T09:50:29.8925858Z %133 = tt.splat %132 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8926106Z %134 = arith.addi %133, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:29.8926392Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:29.8926675Z %136 = tt.broadcast %135 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8926895Z %137 = arith.addi %41, %136 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8927104Z %138 = tt.addptr %7, %137 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:29.8927313Z %139 = tt.load %138 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:29.8927620Z %140 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8928058Z %141 = arith.extf %140 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8928444Z %142 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8928699Z %143 = arith.muli %142, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:50:29.8928895Z %144 = tt.broadcast %143 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8929094Z %145 = arith.addi %144, %43 : tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8929293Z %146 = tt.addptr %8, %145 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi32, #blocked> 2026-02-21T09:50:29.8929502Z %147 = tt.load %146 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:50:29.8929671Z %148 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8929835Z %149 = arith.shrsi %148, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8930111Z %150 = ttg.convert_layout %149 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8930365Z %151 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8930635Z %152 = ttg.convert_layout %151 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8930984Z %153 = tt.expand_dims %150 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8931331Z %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8931631Z %155 = tt.broadcast %153 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8931886Z %156 = arith.select %13, %155, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8932140Z %157 = tt.broadcast %154 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8932391Z %158 = arith.select %15, %157, %156 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8932635Z %159 = tt.reshape %158 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8932873Z %160 = arith.sitofp %159 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8933180Z %161 = ttg.convert_layout %160 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8933650Z %162 = tt.dot %141, %161, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8934006Z %163 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:29.8934140Z %164 = arith.cmpi slt, %163, %c2_i32 : i32 2026-02-21T09:50:29.8934301Z %165 = arith.select %164, %163, %c0_i32 : i32 2026-02-21T09:50:29.8934575Z %166 = ttg.memdesc_index %44[%165] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8934945Z ttg.local_store %139, %166 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:29.8935440Z scf.yield %162, %165, %arg8, %166, %arg10, %147 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8935832Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:29.8936113Z %72 = ttg.local_load %71#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8936543Z %73 = arith.extf %72 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8936841Z %74 = arith.shli %71#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8937011Z %75 = arith.shrsi %74, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8937257Z %76 = ttg.convert_layout %75 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8937507Z %77 = arith.shrsi %71#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8937753Z %78 = ttg.convert_layout %77 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8938087Z %79 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8938447Z %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8938546Z %81 = tt.broadcast %79 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8938678Z %82 = arith.select %13, %81, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8938774Z %83 = tt.broadcast %80 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8938876Z %84 = arith.select %15, %83, %82 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8938971Z %85 = tt.reshape %84 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8939065Z %86 = arith.sitofp %85 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8939227Z %87 = ttg.convert_layout %86 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8939495Z %88 = tt.dot %73, %87, %71#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8939686Z %89 = ttg.local_load %71#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8939877Z %90 = arith.extf %89 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8939947Z %91 = arith.shli %71#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8940009Z %92 = arith.shrsi %91, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8940153Z %93 = ttg.convert_layout %92 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8940241Z %94 = arith.shrsi %71#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:50:29.8940380Z %95 = ttg.convert_layout %94 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:29.8940536Z %96 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8940703Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:50:29.8940798Z %98 = tt.broadcast %96 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8940904Z %99 = arith.select %13, %98, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8941007Z %100 = tt.broadcast %97 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8941113Z %101 = arith.select %15, %100, %99 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:50:29.8941206Z %102 = tt.reshape %101 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:50:29.8941305Z %103 = arith.sitofp %102 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:50:29.8941468Z %104 = ttg.convert_layout %103 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:29.8941726Z %105 = tt.dot %90, %104, %88, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma> 2026-02-21T09:50:29.8941816Z ttg.local_dealloc %44 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:29.8941905Z %106 = arith.truncf %105 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma> 2026-02-21T09:50:29.8941970Z %107 = arith.extsi %33 : i32 to i64 2026-02-21T09:50:29.8942023Z %108 = arith.extsi %36 : i32 to i64 2026-02-21T09:50:29.8942112Z %109 = tt.splat %107 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.8942213Z %110 = arith.addi %109, %17 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:29.8942354Z %111 = tt.expand_dims %110 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8942420Z %112 = arith.muli %111, %cst_8 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8942504Z %113 = tt.broadcast %112 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8942592Z %114 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.8942684Z %115 = arith.addi %114, %18 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:29.8942827Z %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8942911Z %117 = tt.broadcast %116 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8942979Z %118 = arith.addi %113, %117 : tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8943078Z %119 = tt.addptr %16, %118 : tensor<64x128x!tt.ptr, #mma>, tensor<64x128xi64, #mma> 2026-02-21T09:50:29.8943147Z %120 = arith.cmpi sge, %111, %cst_9 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8943221Z %121 = arith.cmpi slt, %111, %cst_10 : tensor<64x1xi64, #mma> 2026-02-21T09:50:29.8943280Z %122 = arith.andi %120, %121 : tensor<64x1xi1, #mma> 2026-02-21T09:50:29.8943362Z %123 = tt.broadcast %122 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8943437Z %124 = arith.cmpi sge, %116, %cst_11 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8943504Z %125 = arith.cmpi slt, %116, %cst_12 : tensor<1x128xi64, #mma> 2026-02-21T09:50:29.8943564Z %126 = arith.andi %124, %125 : tensor<1x128xi1, #mma> 2026-02-21T09:50:29.8943667Z %127 = tt.broadcast %126 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8943727Z %128 = arith.andi %123, %127 : tensor<64x128xi1, #mma> 2026-02-21T09:50:29.8943797Z tt.store %119, %106, %128 : tensor<64x128x!tt.ptr, #mma> 2026-02-21T09:50:29.8943856Z } {tt.num_stages = 1 : i32} 2026-02-21T09:50:29.8943900Z tt.return 2026-02-21T09:50:29.8943935Z } 2026-02-21T09:50:29.8943976Z } 2026-02-21T09:50:29.8943980Z 2026-02-21T09:50:29.8944021Z {-# 2026-02-21T09:50:29.8944066Z external_resources: { 2026-02-21T09:50:29.8944107Z mlir_reproducer: { 2026-02-21T09:50:29.8945051Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:50:29.8945098Z disable_threading: false, 2026-02-21T09:50:29.8945138Z verify_each: true 2026-02-21T09:50:29.8945178Z } 2026-02-21T09:50:29.8945211Z } 2026-02-21T09:50:29.8945244Z #-} 2026-02-21T09:50:29.8945494Z /tmp/torchinductor_root/kf/ckfbghpgt3j65na45xytph3niqm4zjzp5oea4cwbjjsx3nxiqfac.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:50:29.8945946Z /tmp/torchinductor_root/kf/ckfbghpgt3j65na45xytph3niqm4zjzp5oea4cwbjjsx3nxiqfac.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:50:29.8946063Z [360s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:50:29.8946702Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 128], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:50:29.8946778Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:50:29.8946861Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:50:32.8973016Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:50:32.8979357Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:50:32.8980193Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:50:32.8980919Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:50:32.8981631Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:50:32.8982300Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:50:32.8982905Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:50:32.8983336Z #smem = #ttg.shared_memory 2026-02-21T09:50:32.8984400Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:50:32.8985271Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:50:32.8986216Z %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma> 2026-02-21T09:50:32.8986547Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:50:32.8986865Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:50:32.8987198Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma> 2026-02-21T09:50:32.8987489Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:50:32.8987833Z %cst_3 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.8988313Z %cst_4 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.8988775Z %cst_5 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:32.8989170Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.8989498Z %cst_7 = arith.constant dense<1024> : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:32.8989782Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:50:32.8989994Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:50:32.8990208Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:50:32.8990431Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T09:50:32.8990649Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:50:32.8990852Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:50:32.8991108Z %cst_8 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.8991373Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:50:32.8991568Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:50:32.8991976Z %cst_9 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.8992323Z %0 = tt.get_program_id x : i32 2026-02-21T09:50:32.8992529Z %1 = arith.muli %0, %c4_i32 : i32 2026-02-21T09:50:32.8992837Z %2 = arith.addi %1, %c4_i32 : i32 2026-02-21T09:50:32.8993045Z %3 = arith.minsi %2, %c32768_i32 : i32 2026-02-21T09:50:32.8993421Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:32.8993921Z %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:32.8994307Z %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:32.8994659Z %7 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:32.8995006Z %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.8995363Z %9 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:32.8995692Z %10 = tt.splat %arg0 : !tt.ptr -> tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:32.8995962Z %11 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:50:32.8996327Z %12 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:50:32.8996887Z %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:50:32.8997436Z %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:50:32.8997770Z %15 = arith.cmpi eq, %14, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:50:32.8998062Z %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:50:32.8998328Z %17 = arith.cmpi eq, %14, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:50:32.8998578Z %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:50:32.8998878Z %19 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:50:32.8999083Z %20 = arith.subi %3, %1 : i32 2026-02-21T09:50:32.8999234Z %21 = arith.remsi %20, %c3_i32 : i32 2026-02-21T09:50:32.8999387Z %22 = arith.subi %20, %21 : i32 2026-02-21T09:50:32.8999534Z %23 = arith.addi %1, %22 : i32 2026-02-21T09:50:32.8999732Z scf.for %arg3 = %1 to %23 step %c3_i32 : i32 { 2026-02-21T09:50:32.8999914Z %24 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:50:32.9000078Z %25 = arith.muli %24, %c2_i32 : i32 2026-02-21T09:50:32.9000237Z %26 = arith.subi %c256_i32, %25 : i32 2026-02-21T09:50:32.9000397Z %27 = arith.minsi %26, %c2_i32 : i32 2026-02-21T09:50:32.9000564Z %28 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:50:32.9000722Z %29 = arith.remsi %28, %27 : i32 2026-02-21T09:50:32.9000872Z %30 = arith.addi %25, %29 : i32 2026-02-21T09:50:32.9001014Z %31 = arith.divsi %28, %27 : i32 2026-02-21T09:50:32.9001166Z %32 = arith.muli %30, %c64_i32 : i32 2026-02-21T09:50:32.9001393Z %33 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:32.9001676Z %34 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:32.9001959Z %35 = arith.addi %33, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:32.9002240Z %36 = arith.addi %34, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:32.9002465Z %37 = arith.muli %31, %c64_i32 : i32 2026-02-21T09:50:32.9002861Z %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:32.9003143Z %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:32.9003447Z %40 = arith.addi %38, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:32.9003732Z %41 = arith.addi %39, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:32.9004129Z %42 = tt.expand_dims %35 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:32.9004393Z %43 = arith.muli %42, %cst_7 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:32.9004598Z %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9004890Z %45 = tt.expand_dims %40 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1> 2026-02-21T09:50:32.9005187Z %46 = tt.broadcast %45 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9005414Z %47 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:32.9005702Z %48 = tt.expand_dims %9 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:32.9005992Z %49 = tt.broadcast %48 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9006195Z %50 = arith.addi %44, %49 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9006407Z %51 = tt.addptr %10, %50 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9006625Z %52 = tt.load %51 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:32.9006926Z %53 = ttg.memdesc_index %47[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9007332Z ttg.local_store %52, %53 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9007617Z %54 = arith.addi %9, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:32.9007915Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:32.9008230Z %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9008430Z %57 = arith.addi %44, %56 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9008638Z %58 = tt.addptr %10, %57 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9008851Z %59 = tt.load %58 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:32.9009148Z %60 = ttg.memdesc_index %47[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9009525Z ttg.local_store %59, %60 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9010072Z %61:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %53, %arg8 = %60) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:50:32.9010578Z %276 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.9010822Z %277 = arith.addi %276, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.9011013Z %278 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:32.9011145Z %279 = arith.muli %278, %c2_i32 : i32 2026-02-21T09:50:32.9011327Z %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:32.9011583Z %281 = arith.addi %280, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:32.9011875Z %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:32.9012195Z %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9012407Z %284 = arith.addi %44, %283 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9012622Z %285 = tt.addptr %10, %284 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9012843Z %286 = tt.load %285 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:32.9013166Z %287 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9013632Z %288 = arith.extf %287 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9014027Z %289 = tt.expand_dims %277 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9014276Z %290 = arith.muli %289, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9014470Z %291 = tt.broadcast %290 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9014659Z %292 = arith.addi %291, %46 : tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9014856Z %293 = tt.addptr %11, %292 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9015057Z %294 = tt.load %293 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:50:32.9015298Z %295 = ttg.convert_layout %294 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9015580Z %296 = arith.shli %295, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9015858Z %297 = arith.shrsi %296, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9016100Z %298 = arith.shrsi %295, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9016388Z %299 = tt.expand_dims %297 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9016762Z %300 = tt.expand_dims %298 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9017047Z %301 = tt.broadcast %299 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9017283Z %302 = arith.select %16, %301, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9017520Z %303 = tt.broadcast %300 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9017756Z %304 = arith.select %18, %303, %302 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9017984Z %305 = tt.reshape %304 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:32.9018208Z %306 = arith.sitofp %305 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:32.9018507Z %307 = ttg.convert_layout %306 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9018983Z %308 = tt.dot %288, %307, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:32.9019334Z %309 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:32.9019463Z %310 = arith.cmpi slt, %309, %c2_i32 : i32 2026-02-21T09:50:32.9019615Z %311 = arith.select %310, %309, %c0_i32 : i32 2026-02-21T09:50:32.9019878Z %312 = ttg.memdesc_index %47[%311] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9020243Z ttg.local_store %286, %312 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9020634Z scf.yield %308, %311, %arg8, %312 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9020933Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:32.9021107Z %62 = arith.addi %8, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.9021430Z %63 = ttg.local_load %61#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9021851Z %64 = arith.extf %63 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9022224Z %65 = tt.expand_dims %62 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9022466Z %66 = arith.muli %65, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9030068Z %67 = tt.broadcast %66 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9030286Z %68 = arith.addi %67, %46 : tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9030482Z %69 = tt.addptr %11, %68 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9030677Z %70 = tt.load %69 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:50:32.9030914Z %71 = ttg.convert_layout %70 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9031242Z %72 = arith.shli %71, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9031467Z %73 = arith.shrsi %72, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9031698Z %74 = arith.shrsi %71, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9031997Z %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9032322Z %76 = tt.expand_dims %74 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9032597Z %77 = tt.broadcast %75 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9032825Z %78 = arith.select %16, %77, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9033054Z %79 = tt.broadcast %76 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9033275Z %80 = arith.select %18, %79, %78 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9033494Z %81 = tt.reshape %80 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:32.9033708Z %82 = arith.sitofp %81 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:32.9033995Z %83 = ttg.convert_layout %82 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9034452Z %84 = tt.dot %64, %83, %61#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:32.9034836Z %85 = arith.addi %8, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.9051318Z %86 = ttg.local_load %61#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9051757Z %87 = arith.extf %86 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9052162Z %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9052412Z %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9052600Z %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9052788Z %91 = arith.addi %90, %46 : tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9052980Z %92 = tt.addptr %11, %91 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9053173Z %93 = tt.load %92 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:50:32.9053411Z %94 = ttg.convert_layout %93 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9053684Z %95 = arith.shli %94, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9053910Z %96 = arith.shrsi %95, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9054138Z %97 = arith.shrsi %94, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9054416Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9054746Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9055021Z %100 = tt.broadcast %98 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9055255Z %101 = arith.select %16, %100, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9055511Z %102 = tt.broadcast %99 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9055742Z %103 = arith.select %18, %102, %101 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9055988Z %104 = tt.reshape %103 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:32.9056206Z %105 = arith.sitofp %104 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:32.9056495Z %106 = ttg.convert_layout %105 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9056949Z %107 = tt.dot %87, %106, %84, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:32.9057330Z ttg.local_dealloc %47 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:32.9057545Z %108 = arith.truncf %107 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:50:32.9057809Z %109 = tt.expand_dims %36 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:50:32.9058045Z %110 = arith.muli %109, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:50:32.9058274Z %111 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:50:32.9058531Z %112 = tt.broadcast %110 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9058732Z %113 = tt.broadcast %111 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9058911Z %114 = arith.addi %112, %113 : tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9059094Z %115 = tt.addptr %19, %114 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9059311Z tt.store %115, %108 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:50:32.9059451Z %116 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:50:32.9059594Z %117 = arith.divsi %116, %c256_i32 : i32 2026-02-21T09:50:32.9059713Z %118 = arith.muli %117, %c2_i32 : i32 2026-02-21T09:50:32.9059834Z %119 = arith.subi %c256_i32, %118 : i32 2026-02-21T09:50:32.9059953Z %120 = arith.minsi %119, %c2_i32 : i32 2026-02-21T09:50:32.9060071Z %121 = arith.remsi %116, %c256_i32 : i32 2026-02-21T09:50:32.9060191Z %122 = arith.remsi %121, %120 : i32 2026-02-21T09:50:32.9060303Z %123 = arith.addi %118, %122 : i32 2026-02-21T09:50:32.9060421Z %124 = arith.divsi %121, %120 : i32 2026-02-21T09:50:32.9060536Z %125 = arith.muli %123, %c64_i32 : i32 2026-02-21T09:50:32.9060711Z %126 = tt.splat %125 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:32.9060926Z %127 = tt.splat %125 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:32.9061141Z %128 = arith.addi %126, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:32.9061358Z %129 = arith.addi %127, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:32.9061521Z %130 = arith.muli %124, %c64_i32 : i32 2026-02-21T09:50:32.9061689Z %131 = tt.splat %130 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:32.9061898Z %132 = tt.splat %130 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:32.9062112Z %133 = arith.addi %131, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:32.9062324Z %134 = arith.addi %132, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:32.9062593Z %135 = tt.expand_dims %128 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:32.9062866Z %136 = arith.muli %135, %cst_7 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:32.9063060Z %137 = tt.broadcast %136 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9063344Z %138 = tt.expand_dims %133 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1> 2026-02-21T09:50:32.9063642Z %139 = tt.broadcast %138 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9063863Z %140 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:32.9064053Z %141 = arith.addi %137, %49 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9064253Z %142 = tt.addptr %10, %141 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9064461Z %143 = tt.load %142 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:32.9064749Z %144 = ttg.memdesc_index %140[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9065105Z ttg.local_store %143, %144 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9065346Z %145 = arith.addi %137, %56 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9065544Z %146 = tt.addptr %10, %145 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9065753Z %147 = tt.load %146 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:32.9066033Z %148 = ttg.memdesc_index %140[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9066384Z ttg.local_store %147, %148 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9066934Z %149:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %144, %arg8 = %148) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:50:32.9067419Z %276 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.9067649Z %277 = arith.addi %276, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.9067829Z %278 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:32.9067953Z %279 = arith.muli %278, %c2_i32 : i32 2026-02-21T09:50:32.9068124Z %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:32.9068347Z %281 = arith.addi %280, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:32.9068622Z %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:32.9068904Z %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9069099Z %284 = arith.addi %137, %283 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9069305Z %285 = tt.addptr %10, %284 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9069513Z %286 = tt.load %285 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:32.9069812Z %287 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9070246Z %288 = arith.extf %287 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9070627Z %289 = tt.expand_dims %277 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9070899Z %290 = arith.muli %289, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9071094Z %291 = tt.broadcast %290 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9071288Z %292 = arith.addi %291, %139 : tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9071484Z %293 = tt.addptr %11, %292 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9071701Z %294 = tt.load %293 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:50:32.9071947Z %295 = ttg.convert_layout %294 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9072229Z %296 = arith.shli %295, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9072462Z %297 = arith.shrsi %296, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9072703Z %298 = arith.shrsi %295, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9072992Z %299 = tt.expand_dims %297 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9073331Z %300 = tt.expand_dims %298 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9073616Z %301 = tt.broadcast %299 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9073851Z %302 = arith.select %16, %301, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9074088Z %303 = tt.broadcast %300 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9074317Z %304 = arith.select %18, %303, %302 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9074547Z %305 = tt.reshape %304 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:32.9074794Z %306 = arith.sitofp %305 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:32.9075087Z %307 = ttg.convert_layout %306 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9075573Z %308 = tt.dot %288, %307, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:32.9075921Z %309 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:32.9076048Z %310 = arith.cmpi slt, %309, %c2_i32 : i32 2026-02-21T09:50:32.9076184Z %311 = arith.select %310, %309, %c0_i32 : i32 2026-02-21T09:50:32.9076447Z %312 = ttg.memdesc_index %140[%311] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9076804Z ttg.local_store %286, %312 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9077194Z scf.yield %308, %311, %arg8, %312 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9077497Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:32.9077775Z %150 = ttg.local_load %149#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9078198Z %151 = arith.extf %150 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9078496Z %152 = arith.addi %67, %139 : tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9078695Z %153 = tt.addptr %11, %152 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9078911Z %154 = tt.load %153 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:50:32.9079155Z %155 = ttg.convert_layout %154 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9079432Z %156 = arith.shli %155, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9079682Z %157 = arith.shrsi %156, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9079918Z %158 = arith.shrsi %155, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9080200Z %159 = tt.expand_dims %157 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9080533Z %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9080814Z %161 = tt.broadcast %159 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9081051Z %162 = arith.select %16, %161, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9081292Z %163 = tt.broadcast %160 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9081521Z %164 = arith.select %18, %163, %162 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9081749Z %165 = tt.reshape %164 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:32.9081968Z %166 = arith.sitofp %165 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:32.9082262Z %167 = ttg.convert_layout %166 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9082790Z %168 = tt.dot %151, %167, %149#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:32.9083616Z %169 = ttg.local_load %149#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9084162Z %170 = arith.extf %169 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9084464Z %171 = arith.addi %90, %139 : tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9084662Z %172 = tt.addptr %11, %171 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9084864Z %173 = tt.load %172 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:50:32.9085106Z %174 = ttg.convert_layout %173 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9085391Z %175 = arith.shli %174, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9085628Z %176 = arith.shrsi %175, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9085864Z %177 = arith.shrsi %174, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9086154Z %178 = tt.expand_dims %176 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9086485Z %179 = tt.expand_dims %177 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9086768Z %180 = tt.broadcast %178 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9087008Z %181 = arith.select %16, %180, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9087245Z %182 = tt.broadcast %179 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9087507Z %183 = arith.select %18, %182, %181 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9087732Z %184 = tt.reshape %183 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:32.9087955Z %185 = arith.sitofp %184 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:32.9088273Z %186 = ttg.convert_layout %185 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9088729Z %187 = tt.dot %170, %186, %168, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:32.9089113Z ttg.local_dealloc %140 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:32.9089327Z %188 = arith.truncf %187 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:50:32.9089592Z %189 = tt.expand_dims %129 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:50:32.9089832Z %190 = arith.muli %189, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:50:32.9090056Z %191 = tt.expand_dims %134 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:50:32.9090314Z %192 = tt.broadcast %190 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9090512Z %193 = tt.broadcast %191 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9090690Z %194 = arith.addi %192, %193 : tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9090876Z %195 = tt.addptr %19, %194 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9091067Z tt.store %195, %188 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:50:32.9091208Z %196 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:50:32.9091355Z %197 = arith.divsi %196, %c256_i32 : i32 2026-02-21T09:50:32.9091476Z %198 = arith.muli %197, %c2_i32 : i32 2026-02-21T09:50:32.9091594Z %199 = arith.subi %c256_i32, %198 : i32 2026-02-21T09:50:32.9091732Z %200 = arith.minsi %199, %c2_i32 : i32 2026-02-21T09:50:32.9091851Z %201 = arith.remsi %196, %c256_i32 : i32 2026-02-21T09:50:32.9091967Z %202 = arith.remsi %201, %200 : i32 2026-02-21T09:50:32.9092083Z %203 = arith.addi %198, %202 : i32 2026-02-21T09:50:32.9092196Z %204 = arith.divsi %201, %200 : i32 2026-02-21T09:50:32.9092316Z %205 = arith.muli %203, %c64_i32 : i32 2026-02-21T09:50:32.9092485Z %206 = tt.splat %205 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:32.9092700Z %207 = tt.splat %205 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:32.9092912Z %208 = arith.addi %206, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:32.9093126Z %209 = arith.addi %207, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:32.9093290Z %210 = arith.muli %204, %c64_i32 : i32 2026-02-21T09:50:32.9093460Z %211 = tt.splat %210 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:32.9093671Z %212 = tt.splat %210 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:32.9093884Z %213 = arith.addi %211, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:32.9094097Z %214 = arith.addi %212, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:32.9094363Z %215 = tt.expand_dims %208 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:32.9094614Z %216 = arith.muli %215, %cst_7 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:32.9094810Z %217 = tt.broadcast %216 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9095105Z %218 = tt.expand_dims %213 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1> 2026-02-21T09:50:32.9095385Z %219 = tt.broadcast %218 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9095617Z %220 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:32.9095803Z %221 = arith.addi %217, %49 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9096000Z %222 = tt.addptr %10, %221 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9096205Z %223 = tt.load %222 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:32.9096485Z %224 = ttg.memdesc_index %220[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9096841Z ttg.local_store %223, %224 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9097079Z %225 = arith.addi %217, %56 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9097278Z %226 = tt.addptr %10, %225 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9097480Z %227 = tt.load %226 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:32.9097759Z %228 = ttg.memdesc_index %220[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9098110Z ttg.local_store %227, %228 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9098642Z %229:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %224, %arg8 = %228) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:50:32.9099111Z %276 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.9099358Z %277 = arith.addi %276, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.9099536Z %278 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:32.9099659Z %279 = arith.muli %278, %c2_i32 : i32 2026-02-21T09:50:32.9099828Z %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:32.9100048Z %281 = arith.addi %280, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:32.9100321Z %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:32.9100596Z %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9100789Z %284 = arith.addi %217, %283 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9100991Z %285 = tt.addptr %10, %284 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9101200Z %286 = tt.load %285 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:32.9101494Z %287 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9101923Z %288 = arith.extf %287 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9102300Z %289 = tt.expand_dims %277 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9102550Z %290 = arith.muli %289, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9102745Z %291 = tt.broadcast %290 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9102955Z %292 = arith.addi %291, %219 : tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9103152Z %293 = tt.addptr %11, %292 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9103350Z %294 = tt.load %293 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:50:32.9103641Z %295 = ttg.convert_layout %294 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9103917Z %296 = arith.shli %295, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9104151Z %297 = arith.shrsi %296, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9104382Z %298 = arith.shrsi %295, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9104669Z %299 = tt.expand_dims %297 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9105004Z %300 = tt.expand_dims %298 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9105284Z %301 = tt.broadcast %299 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9105521Z %302 = arith.select %16, %301, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9105753Z %303 = tt.broadcast %300 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9105983Z %304 = arith.select %18, %303, %302 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9106212Z %305 = tt.reshape %304 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:32.9106430Z %306 = arith.sitofp %305 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:32.9106744Z %307 = ttg.convert_layout %306 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9107206Z %308 = tt.dot %288, %307, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:32.9107568Z %309 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:32.9107697Z %310 = arith.cmpi slt, %309, %c2_i32 : i32 2026-02-21T09:50:32.9107829Z %311 = arith.select %310, %309, %c0_i32 : i32 2026-02-21T09:50:32.9108091Z %312 = ttg.memdesc_index %220[%311] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9108442Z ttg.local_store %286, %312 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9108829Z scf.yield %308, %311, %arg8, %312 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9109128Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:32.9109400Z %230 = ttg.local_load %229#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9109828Z %231 = arith.extf %230 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9110123Z %232 = arith.addi %67, %219 : tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9110320Z %233 = tt.addptr %11, %232 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9110521Z %234 = tt.load %233 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:50:32.9110775Z %235 = ttg.convert_layout %234 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9111053Z %236 = arith.shli %235, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9111288Z %237 = arith.shrsi %236, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9111541Z %238 = arith.shrsi %235, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9111830Z %239 = tt.expand_dims %237 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9112157Z %240 = tt.expand_dims %238 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9112438Z %241 = tt.broadcast %239 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9112677Z %242 = arith.select %16, %241, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9112910Z %243 = tt.broadcast %240 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9113140Z %244 = arith.select %18, %243, %242 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9113363Z %245 = tt.reshape %244 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:32.9113582Z %246 = arith.sitofp %245 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:32.9113871Z %247 = ttg.convert_layout %246 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9114330Z %248 = tt.dot %231, %247, %229#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:32.9114831Z %249 = ttg.local_load %229#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9115256Z %250 = arith.extf %249 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9115566Z %251 = arith.addi %90, %219 : tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9115763Z %252 = tt.addptr %11, %251 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9115958Z %253 = tt.load %252 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:50:32.9116195Z %254 = ttg.convert_layout %253 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9116472Z %255 = arith.shli %254, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9116705Z %256 = arith.shrsi %255, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9116935Z %257 = arith.shrsi %254, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9117219Z %258 = tt.expand_dims %256 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9117548Z %259 = tt.expand_dims %257 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9117826Z %260 = tt.broadcast %258 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9118058Z %261 = arith.select %16, %260, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9118291Z %262 = tt.broadcast %259 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9118516Z %263 = arith.select %18, %262, %261 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9118770Z %264 = tt.reshape %263 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:32.9118989Z %265 = arith.sitofp %264 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:32.9119282Z %266 = ttg.convert_layout %265 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9119753Z %267 = tt.dot %250, %266, %248, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:32.9120132Z ttg.local_dealloc %220 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:32.9120342Z %268 = arith.truncf %267 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:50:32.9120602Z %269 = tt.expand_dims %209 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:50:32.9120834Z %270 = arith.muli %269, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:50:32.9121060Z %271 = tt.expand_dims %214 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:50:32.9121314Z %272 = tt.broadcast %270 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9121513Z %273 = tt.broadcast %271 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9121689Z %274 = arith.addi %272, %273 : tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9121871Z %275 = tt.addptr %19, %274 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9122063Z tt.store %275, %268 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:50:32.9122200Z } {tt.num_stages = 1 : i32} 2026-02-21T09:50:32.9122321Z scf.for %arg3 = %23 to %3 step %c1_i32 : i32 { 2026-02-21T09:50:32.9122455Z %24 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:50:32.9122686Z %25 = arith.muli %24, %c2_i32 : i32 2026-02-21T09:50:32.9122810Z %26 = arith.subi %c256_i32, %25 : i32 2026-02-21T09:50:32.9122925Z %27 = arith.minsi %26, %c2_i32 : i32 2026-02-21T09:50:32.9123070Z %28 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:50:32.9123186Z %29 = arith.remsi %28, %27 : i32 2026-02-21T09:50:32.9123298Z %30 = arith.addi %25, %29 : i32 2026-02-21T09:50:32.9123406Z %31 = arith.divsi %28, %27 : i32 2026-02-21T09:50:32.9123518Z %32 = arith.muli %30, %c64_i32 : i32 2026-02-21T09:50:32.9123683Z %33 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:32.9123892Z %34 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:32.9124101Z %35 = arith.addi %33, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:32.9124305Z %36 = arith.addi %34, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:32.9124466Z %37 = arith.muli %31, %c64_i32 : i32 2026-02-21T09:50:32.9124626Z %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:32.9124833Z %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:32.9125038Z %40 = arith.addi %38, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:32.9125241Z %41 = arith.addi %39, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:32.9125504Z %42 = tt.expand_dims %35 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:50:32.9125746Z %43 = arith.muli %42, %cst_7 : tensor<64x1xi32, #blocked2> 2026-02-21T09:50:32.9125933Z %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9126223Z %45 = tt.expand_dims %40 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1> 2026-02-21T09:50:32.9126490Z %46 = tt.broadcast %45 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9126705Z %47 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:32.9126965Z %48 = tt.expand_dims %9 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:32.9127368Z %49 = tt.broadcast %48 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9127555Z %50 = arith.addi %44, %49 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9127745Z %51 = tt.addptr %10, %50 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9127943Z %52 = tt.load %51 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:32.9128216Z %53 = ttg.memdesc_index %47[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9128565Z ttg.local_store %52, %53 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9128832Z %54 = arith.addi %9, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:32.9129101Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:32.9129368Z %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9129552Z %57 = arith.addi %44, %56 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9129745Z %58 = tt.addptr %10, %57 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9129945Z %59 = tt.load %58 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:32.9130232Z %60 = ttg.memdesc_index %47[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9130575Z ttg.local_store %59, %60 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9131093Z %61:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %53, %arg8 = %60) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:50:32.9131554Z %116 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.9131782Z %117 = arith.addi %116, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.9131957Z %118 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:32.9132081Z %119 = arith.muli %118, %c2_i32 : i32 2026-02-21T09:50:32.9132251Z %120 = tt.splat %119 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:32.9132470Z %121 = arith.addi %120, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:32.9132744Z %122 = tt.expand_dims %121 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:50:32.9133017Z %123 = tt.broadcast %122 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9133213Z %124 = arith.addi %44, %123 : tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9133412Z %125 = tt.addptr %10, %124 : tensor<64x4x!tt.ptr, #blocked2>, tensor<64x4xi32, #blocked2> 2026-02-21T09:50:32.9133618Z %126 = tt.load %125 : tensor<64x4x!tt.ptr, #blocked2> 2026-02-21T09:50:32.9133916Z %127 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9134363Z %128 = arith.extf %127 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9134743Z %129 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9135006Z %130 = arith.muli %129, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9135196Z %131 = tt.broadcast %130 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9135387Z %132 = arith.addi %131, %46 : tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9135579Z %133 = tt.addptr %11, %132 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9135780Z %134 = tt.load %133 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:50:32.9136023Z %135 = ttg.convert_layout %134 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9136302Z %136 = arith.shli %135, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9136541Z %137 = arith.shrsi %136, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9136776Z %138 = arith.shrsi %135, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9137064Z %139 = tt.expand_dims %137 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9137396Z %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9137678Z %141 = tt.broadcast %139 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9137914Z %142 = arith.select %16, %141, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9138160Z %143 = tt.broadcast %140 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9138389Z %144 = arith.select %18, %143, %142 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9138634Z %145 = tt.reshape %144 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:32.9138854Z %146 = arith.sitofp %145 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:32.9139148Z %147 = ttg.convert_layout %146 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9139609Z %148 = tt.dot %128, %147, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:32.9139955Z %149 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:32.9140085Z %150 = arith.cmpi slt, %149, %c2_i32 : i32 2026-02-21T09:50:32.9140217Z %151 = arith.select %150, %149, %c0_i32 : i32 2026-02-21T09:50:32.9140480Z %152 = ttg.memdesc_index %47[%151] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9140831Z ttg.local_store %126, %152 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9141226Z scf.yield %148, %151, %arg8, %152 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:32.9141533Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:32.9141706Z %62 = arith.addi %8, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.9142029Z %63 = ttg.local_load %61#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9142475Z %64 = arith.extf %63 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9142848Z %65 = tt.expand_dims %62 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9143106Z %66 = arith.muli %65, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9143289Z %67 = tt.broadcast %66 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9143476Z %68 = arith.addi %67, %46 : tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9143665Z %69 = tt.addptr %11, %68 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9143853Z %70 = tt.load %69 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:50:32.9144088Z %71 = ttg.convert_layout %70 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9144359Z %72 = arith.shli %71, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9144586Z %73 = arith.shrsi %72, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9144812Z %74 = arith.shrsi %71, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9145090Z %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9145414Z %76 = tt.expand_dims %74 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9145684Z %77 = tt.broadcast %75 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9145930Z %78 = arith.select %16, %77, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9146157Z %79 = tt.broadcast %76 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9146374Z %80 = arith.select %18, %79, %78 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9146607Z %81 = tt.reshape %80 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:32.9146816Z %82 = arith.sitofp %81 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:32.9147096Z %83 = ttg.convert_layout %82 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9147543Z %84 = tt.dot %64, %83, %61#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:32.9147922Z %85 = arith.addi %8, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:32.9148245Z %86 = ttg.local_load %61#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9148661Z %87 = arith.extf %86 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9149034Z %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9149278Z %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:50:32.9149460Z %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9149646Z %91 = arith.addi %90, %46 : tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9149831Z %92 = tt.addptr %11, %91 : tensor<2x64x!tt.ptr, #blocked1>, tensor<2x64xi32, #blocked1> 2026-02-21T09:50:32.9150037Z %93 = tt.load %92 : tensor<2x64x!tt.ptr, #blocked1> 2026-02-21T09:50:32.9150268Z %94 = ttg.convert_layout %93 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9150537Z %95 = arith.shli %94, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9150775Z %96 = arith.shrsi %95, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9150999Z %97 = arith.shrsi %94, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:32.9151276Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9151603Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:32.9151876Z %100 = tt.broadcast %98 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9152112Z %101 = arith.select %16, %100, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9152345Z %102 = tt.broadcast %99 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9152573Z %103 = arith.select %18, %102, %101 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:32.9152807Z %104 = tt.reshape %103 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:32.9153025Z %105 = arith.sitofp %104 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:32.9153320Z %106 = ttg.convert_layout %105 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:32.9153792Z %107 = tt.dot %87, %106, %84, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:32.9154168Z ttg.local_dealloc %47 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:32.9154400Z %108 = arith.truncf %107 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:50:32.9154662Z %109 = tt.expand_dims %36 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:50:32.9154905Z %110 = arith.muli %109, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:50:32.9155141Z %111 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:50:32.9155400Z %112 = tt.broadcast %110 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9155608Z %113 = tt.broadcast %111 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9155789Z %114 = arith.addi %112, %113 : tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9155984Z %115 = tt.addptr %19, %114 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:50:32.9156183Z tt.store %115, %108 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:50:32.9156324Z } {tt.num_stages = 1 : i32} 2026-02-21T09:50:32.9156435Z tt.return 2026-02-21T09:50:32.9156518Z } 2026-02-21T09:50:32.9156603Z } 2026-02-21T09:50:32.9156648Z 2026-02-21T09:50:32.9156682Z {-# 2026-02-21T09:50:32.9156768Z external_resources: { 2026-02-21T09:50:32.9156871Z mlir_reproducer: { 2026-02-21T09:50:32.9157892Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:50:32.9158886Z disable_threading: false, 2026-02-21T09:50:32.9159002Z verify_each: true 2026-02-21T09:50:32.9159095Z } 2026-02-21T09:50:32.9159176Z } 2026-02-21T09:50:32.9159263Z #-} 2026-02-21T09:50:32.9159546Z /tmp/torchinductor_root/v4/cv4k7rhb7nmh5ams6lqfgfcicgqzja7tlcrle4voteobtcjndy3s.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:50:32.9160252Z /tmp/torchinductor_root/v4/cv4k7rhb7nmh5ams6lqfgfcicgqzja7tlcrle4voteobtcjndy3s.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:50:32.9160798Z [363s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:50:32.9161571Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:50:32.9162269Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:50:32.9162438Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:50:33.8392427Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:50:33.8400948Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:50:33.8401308Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:50:33.8401722Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:50:33.8402027Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:50:33.8402311Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:50:33.8402593Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:50:33.8402780Z #smem = #ttg.shared_memory 2026-02-21T09:50:33.8403010Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:50:33.8403481Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:50:33.8403873Z %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma> 2026-02-21T09:50:33.8404051Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:50:33.8404231Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:50:33.8404413Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma> 2026-02-21T09:50:33.8404577Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:50:33.8404767Z %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:33.8405020Z %cst_4 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8405273Z %cst_5 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8405590Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8405771Z %cst_7 = arith.constant dense<0> : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8405949Z %cst_8 = arith.constant dense<512> : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8406121Z %cst_9 = arith.constant dense<0> : tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8406353Z %cst_10 = arith.constant dense<8192> : tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8406531Z %cst_11 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1> 2026-02-21T09:50:33.8406687Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:50:33.8406804Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:50:33.8406925Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:50:33.8407054Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T09:50:33.8407176Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:50:33.8407294Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:50:33.8407437Z %cst_12 = arith.constant dense<0> : tensor<2x64xi8, #blocked2> 2026-02-21T09:50:33.8407617Z %cst_13 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8407766Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:50:33.8407891Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:50:33.8408002Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:50:33.8408187Z %cst_14 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8408379Z %0 = tt.get_program_id x : i32 2026-02-21T09:50:33.8408493Z %1 = arith.muli %0, %c4_i32 : i32 2026-02-21T09:50:33.8408611Z %2 = arith.addi %1, %c4_i32 : i32 2026-02-21T09:50:33.8408728Z %3 = arith.minsi %2, %c32768_i32 : i32 2026-02-21T09:50:33.8408932Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:33.8409202Z %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:33.8409490Z %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:33.8409757Z %7 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:33.8410037Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:33.8410284Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:50:33.8410483Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:50:33.8410716Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8411055Z %12 = arith.extsi %11 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8411420Z %13 = arith.extsi %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:33.8411770Z %14 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:50:33.8412183Z %15 = tt.expand_dims %14 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:50:33.8412584Z %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:50:33.8412834Z %17 = arith.cmpi eq, %16, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:50:33.8413034Z %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:50:33.8413235Z %19 = arith.cmpi eq, %16, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:50:33.8413445Z %20 = tt.broadcast %19 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:50:33.8413653Z %21 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:50:33.8413815Z %22 = arith.subi %3, %1 : i32 2026-02-21T09:50:33.8413933Z %23 = arith.remsi %22, %c3_i32 : i32 2026-02-21T09:50:33.8414063Z %24 = arith.subi %22, %23 : i32 2026-02-21T09:50:33.8414180Z %25 = arith.addi %1, %24 : i32 2026-02-21T09:50:33.8414308Z scf.for %arg3 = %1 to %25 step %c3_i32 : i32 { 2026-02-21T09:50:33.8414447Z %26 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:50:33.8414575Z %27 = arith.muli %26, %c2_i32 : i32 2026-02-21T09:50:33.8414696Z %28 = arith.subi %c128_i32, %27 : i32 2026-02-21T09:50:33.8414817Z %29 = arith.minsi %28, %c2_i32 : i32 2026-02-21T09:50:33.8414938Z %30 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:50:33.8415063Z %31 = arith.remsi %30, %29 : i32 2026-02-21T09:50:33.8415176Z %32 = arith.addi %27, %31 : i32 2026-02-21T09:50:33.8415294Z %33 = arith.divsi %30, %29 : i32 2026-02-21T09:50:33.8415412Z %34 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:50:33.8415570Z %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:33.8415783Z %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:33.8415948Z %37 = arith.muli %33, %c64_i32 : i32 2026-02-21T09:50:33.8416119Z %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:33.8416328Z %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:33.8416544Z %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:33.8416756Z %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:33.8417041Z %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:50:33.8417295Z %43 = arith.muli %42, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:50:33.8417506Z %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8417682Z %45 = arith.extsi %34 : i32 to i64 2026-02-21T09:50:33.8417853Z %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:33.8418071Z %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:33.8418350Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8418624Z %49 = tt.broadcast %48 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8418824Z %50 = arith.cmpi sge, %48, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8419001Z %51 = arith.cmpi slt, %48, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8419166Z %52 = arith.andi %50, %51 : tensor<1x64xi1, #blocked2> 2026-02-21T09:50:33.8419349Z %53 = tt.broadcast %52 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8419586Z %54 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:33.8419853Z %55 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:50:33.8420116Z %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8420305Z %57 = arith.addi %44, %56 : tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8420497Z %58 = tt.addptr %9, %57 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8420694Z %59 = tt.load %58 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:50:33.8421001Z %60 = ttg.memdesc_index %54[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8421348Z ttg.local_store %59, %60 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8421621Z %61 = arith.addi %8, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:33.8421910Z %62 = tt.expand_dims %61 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:50:33.8422173Z %63 = tt.broadcast %62 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8422358Z %64 = arith.addi %44, %63 : tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8422546Z %65 = tt.addptr %9, %64 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8422743Z %66 = tt.load %65 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:50:33.8423016Z %67 = ttg.memdesc_index %54[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8423360Z ttg.local_store %66, %67 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8423871Z %68:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %60, %arg8 = %67) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:50:33.8424285Z %307 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:33.8424408Z %308 = arith.muli %307, %c2_i32 : i32 2026-02-21T09:50:33.8424579Z %309 = tt.splat %308 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:33.8424803Z %310 = arith.addi %309, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:33.8425094Z %311 = tt.expand_dims %310 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:50:33.8425410Z %312 = tt.broadcast %311 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8425602Z %313 = arith.addi %44, %312 : tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8425804Z %314 = tt.addptr %9, %313 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8426010Z %315 = tt.load %314 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:50:33.8426313Z %316 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8426748Z %317 = arith.extf %316 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8427036Z %318 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:50:33.8427212Z %319 = tt.splat %318 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8427437Z %320 = arith.addi %319, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8427716Z %321 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8427967Z %322 = arith.muli %321, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8428158Z %323 = tt.broadcast %322 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8428351Z %324 = arith.addi %323, %49 : tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8428545Z %325 = tt.addptr %10, %324 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8428755Z %326 = arith.cmpi sge, %321, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8428945Z %327 = arith.cmpi slt, %321, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8429110Z %328 = arith.andi %326, %327 : tensor<2x1xi1, #blocked2> 2026-02-21T09:50:33.8429302Z %329 = tt.broadcast %328 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8429504Z %330 = arith.andi %329, %53 : tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8429671Z %331 = tt.load %325, %330, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:50:33.8429926Z %332 = ttg.convert_layout %331 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8430208Z %333 = arith.shli %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8430450Z %334 = arith.shrsi %333, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8430691Z %335 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8430980Z %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8431314Z %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8431597Z %338 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8431834Z %339 = arith.select %18, %338, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8432068Z %340 = tt.broadcast %337 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8432300Z %341 = arith.select %20, %340, %339 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8432542Z %342 = tt.reshape %341 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:33.8432764Z %343 = arith.sitofp %342 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:33.8433056Z %344 = ttg.convert_layout %343 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8433543Z %345 = tt.dot %317, %344, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:33.8433892Z %346 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:33.8434023Z %347 = arith.cmpi slt, %346, %c2_i32 : i32 2026-02-21T09:50:33.8434154Z %348 = arith.select %347, %346, %c0_i32 : i32 2026-02-21T09:50:33.8434417Z %349 = ttg.memdesc_index %54[%348] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8434764Z ttg.local_store %315, %349 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8435150Z scf.yield %345, %348, %arg8, %349 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8435454Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:33.8435727Z %69 = ttg.local_load %68#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8436150Z %70 = arith.extf %69 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8436479Z %71 = arith.addi %12, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8436771Z %72 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8437012Z %73 = arith.muli %72, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8437197Z %74 = tt.broadcast %73 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8437386Z %75 = arith.addi %74, %49 : tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8437588Z %76 = tt.addptr %10, %75 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8437790Z %77 = arith.cmpi sge, %72, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8437957Z %78 = arith.cmpi slt, %72, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8438113Z %79 = arith.andi %77, %78 : tensor<2x1xi1, #blocked2> 2026-02-21T09:50:33.8438294Z %80 = tt.broadcast %79 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8438478Z %81 = arith.andi %80, %53 : tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8438643Z %82 = tt.load %76, %81, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:50:33.8438892Z %83 = ttg.convert_layout %82 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8439164Z %84 = arith.shli %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8439396Z %85 = arith.shrsi %84, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8439623Z %86 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8439903Z %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8440231Z %88 = tt.expand_dims %86 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8440518Z %89 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8440748Z %90 = arith.select %18, %89, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8440992Z %91 = tt.broadcast %88 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8441212Z %92 = arith.select %20, %91, %90 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8441429Z %93 = tt.reshape %92 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:33.8441637Z %94 = arith.sitofp %93 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:33.8441918Z %95 = ttg.convert_layout %94 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8442367Z %96 = tt.dot %70, %95, %68#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:33.8442902Z %97 = ttg.local_load %68#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8443322Z %98 = arith.extf %97 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8443643Z %99 = arith.addi %12, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8443920Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8444168Z %101 = arith.muli %100, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8444359Z %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8444552Z %103 = arith.addi %102, %49 : tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8444769Z %104 = tt.addptr %10, %103 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8444977Z %105 = arith.cmpi sge, %100, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8445147Z %106 = arith.cmpi slt, %100, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8445327Z %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2> 2026-02-21T09:50:33.8445513Z %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8445699Z %109 = arith.andi %108, %53 : tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8445866Z %110 = tt.load %104, %109, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:50:33.8446118Z %111 = ttg.convert_layout %110 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8446399Z %112 = arith.shli %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8446637Z %113 = arith.shrsi %112, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8446871Z %114 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8447156Z %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8447491Z %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8447769Z %117 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8448005Z %118 = arith.select %18, %117, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8448237Z %119 = tt.broadcast %116 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8448485Z %120 = arith.select %20, %119, %118 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8448708Z %121 = tt.reshape %120 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:33.8448947Z %122 = arith.sitofp %121 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:33.8449240Z %123 = ttg.convert_layout %122 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8449695Z %124 = tt.dot %98, %123, %96, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:33.8450072Z ttg.local_dealloc %54 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:33.8450282Z %125 = arith.truncf %124 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:50:33.8450544Z %126 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:50:33.8450784Z %127 = arith.muli %126, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:50:33.8451047Z %128 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:50:33.8451303Z %129 = tt.broadcast %127 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8451504Z %130 = tt.broadcast %128 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8451678Z %131 = arith.addi %129, %130 : tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8451866Z %132 = tt.addptr %21, %131 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8452057Z tt.store %132, %125 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:50:33.8452201Z %133 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:50:33.8452324Z %134 = arith.divsi %133, %c512_i32 : i32 2026-02-21T09:50:33.8452462Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:50:33.8452582Z %136 = arith.subi %c128_i32, %135 : i32 2026-02-21T09:50:33.8452700Z %137 = arith.minsi %136, %c2_i32 : i32 2026-02-21T09:50:33.8452818Z %138 = arith.remsi %133, %c512_i32 : i32 2026-02-21T09:50:33.8452947Z %139 = arith.remsi %138, %137 : i32 2026-02-21T09:50:33.8453063Z %140 = arith.addi %135, %139 : i32 2026-02-21T09:50:33.8453174Z %141 = arith.divsi %138, %137 : i32 2026-02-21T09:50:33.8453292Z %142 = arith.muli %140, %c64_i32 : i32 2026-02-21T09:50:33.8453449Z %143 = tt.splat %142 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:33.8453657Z %144 = arith.addi %143, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:33.8453824Z %145 = arith.muli %141, %c64_i32 : i32 2026-02-21T09:50:33.8453990Z %146 = tt.splat %145 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:33.8454203Z %147 = tt.splat %145 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:33.8454415Z %148 = arith.addi %146, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:33.8454630Z %149 = arith.addi %147, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:33.8454899Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:50:33.8455148Z %151 = arith.muli %150, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:50:33.8455342Z %152 = tt.broadcast %151 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8455514Z %153 = arith.extsi %142 : i32 to i64 2026-02-21T09:50:33.8455680Z %154 = tt.splat %153 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:33.8455925Z %155 = arith.addi %154, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:33.8456198Z %156 = tt.expand_dims %155 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8456494Z %157 = tt.broadcast %156 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8456694Z %158 = arith.cmpi sge, %156, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8456870Z %159 = arith.cmpi slt, %156, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8457036Z %160 = arith.andi %158, %159 : tensor<1x64xi1, #blocked2> 2026-02-21T09:50:33.8457225Z %161 = tt.broadcast %160 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8457441Z %162 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:33.8457626Z %163 = arith.addi %152, %56 : tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8457828Z %164 = tt.addptr %9, %163 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8458032Z %165 = tt.load %164 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:50:33.8458314Z %166 = ttg.memdesc_index %162[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8458671Z ttg.local_store %165, %166 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8458905Z %167 = arith.addi %152, %63 : tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8459101Z %168 = tt.addptr %9, %167 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8459300Z %169 = tt.load %168 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:50:33.8459576Z %170 = ttg.memdesc_index %162[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8459944Z ttg.local_store %169, %170 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8460457Z %171:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %166, %arg8 = %170) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:50:33.8460888Z %307 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:33.8461014Z %308 = arith.muli %307, %c2_i32 : i32 2026-02-21T09:50:33.8461186Z %309 = tt.splat %308 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:33.8461412Z %310 = arith.addi %309, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:33.8461689Z %311 = tt.expand_dims %310 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:50:33.8461968Z %312 = tt.broadcast %311 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8462164Z %313 = arith.addi %152, %312 : tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8462361Z %314 = tt.addptr %9, %313 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8462568Z %315 = tt.load %314 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:50:33.8462863Z %316 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8463296Z %317 = arith.extf %316 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8463578Z %318 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:50:33.8463772Z %319 = tt.splat %318 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8463996Z %320 = arith.addi %319, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8464284Z %321 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8464534Z %322 = arith.muli %321, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8464726Z %323 = tt.broadcast %322 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8464918Z %324 = arith.addi %323, %157 : tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8465113Z %325 = tt.addptr %10, %324 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8465316Z %326 = arith.cmpi sge, %321, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8465487Z %327 = arith.cmpi slt, %321, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8465649Z %328 = arith.andi %326, %327 : tensor<2x1xi1, #blocked2> 2026-02-21T09:50:33.8465844Z %329 = tt.broadcast %328 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8466038Z %330 = arith.andi %329, %161 : tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8466204Z %331 = tt.load %325, %330, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:50:33.8466462Z %332 = ttg.convert_layout %331 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8466741Z %333 = arith.shli %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8466982Z %334 = arith.shrsi %333, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8467221Z %335 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8467526Z %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8467859Z %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8468141Z %338 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8468392Z %339 = arith.select %18, %338, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8468627Z %340 = tt.broadcast %337 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8476411Z %341 = arith.select %20, %340, %339 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8476668Z %342 = tt.reshape %341 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:33.8476890Z %343 = arith.sitofp %342 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:33.8477188Z %344 = ttg.convert_layout %343 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8477660Z %345 = tt.dot %317, %344, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:33.8478011Z %346 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:33.8478139Z %347 = arith.cmpi slt, %346, %c2_i32 : i32 2026-02-21T09:50:33.8478273Z %348 = arith.select %347, %346, %c0_i32 : i32 2026-02-21T09:50:33.8478538Z %349 = ttg.memdesc_index %162[%348] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8478890Z ttg.local_store %315, %349 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8479321Z scf.yield %345, %348, %arg8, %349 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8479634Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:33.8479909Z %172 = ttg.local_load %171#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8480337Z %173 = arith.extf %172 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8480634Z %174 = arith.addi %74, %157 : tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8480834Z %175 = tt.addptr %10, %174 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8481033Z %176 = arith.andi %80, %161 : tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8481204Z %177 = tt.load %175, %176, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:50:33.8481466Z %178 = ttg.convert_layout %177 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8481748Z %179 = arith.shli %178, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8481984Z %180 = arith.shrsi %179, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8482216Z %181 = arith.shrsi %178, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8482505Z %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8482887Z %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8483167Z %184 = tt.broadcast %182 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8483428Z %185 = arith.select %18, %184, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8483664Z %186 = tt.broadcast %183 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8483892Z %187 = arith.select %20, %186, %185 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8484138Z %188 = tt.reshape %187 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:33.8484355Z %189 = arith.sitofp %188 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:33.8484643Z %190 = ttg.convert_layout %189 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8485106Z %191 = tt.dot %173, %190, %171#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:33.8485597Z %192 = ttg.local_load %171#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8486020Z %193 = arith.extf %192 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8486314Z %194 = arith.addi %102, %157 : tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8486512Z %195 = tt.addptr %10, %194 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8486712Z %196 = arith.andi %108, %161 : tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8486877Z %197 = tt.load %195, %196, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:50:33.8487145Z %198 = ttg.convert_layout %197 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8487421Z %199 = arith.shli %198, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8487803Z %200 = arith.shrsi %199, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8488036Z %201 = arith.shrsi %198, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8488317Z %202 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8488646Z %203 = tt.expand_dims %201 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8488923Z %204 = tt.broadcast %202 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8489157Z %205 = arith.select %18, %204, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8489394Z %206 = tt.broadcast %203 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8489619Z %207 = arith.select %20, %206, %205 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8489846Z %208 = tt.reshape %207 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:33.8490062Z %209 = arith.sitofp %208 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:33.8490350Z %210 = ttg.convert_layout %209 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8490810Z %211 = tt.dot %193, %210, %191, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:33.8491190Z ttg.local_dealloc %162 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:33.8491419Z %212 = arith.truncf %211 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:50:33.8491676Z %213 = tt.expand_dims %149 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:50:33.8491915Z %214 = arith.muli %213, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:50:33.8492156Z %215 = tt.expand_dims %144 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:50:33.8492409Z %216 = tt.broadcast %214 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8492608Z %217 = tt.broadcast %215 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8492784Z %218 = arith.addi %216, %217 : tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8492969Z %219 = tt.addptr %21, %218 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8493163Z tt.store %219, %212 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:50:33.8493304Z %220 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:50:33.8493429Z %221 = arith.divsi %220, %c512_i32 : i32 2026-02-21T09:50:33.8493549Z %222 = arith.muli %221, %c2_i32 : i32 2026-02-21T09:50:33.8493669Z %223 = arith.subi %c128_i32, %222 : i32 2026-02-21T09:50:33.8493785Z %224 = arith.minsi %223, %c2_i32 : i32 2026-02-21T09:50:33.8493904Z %225 = arith.remsi %220, %c512_i32 : i32 2026-02-21T09:50:33.8494019Z %226 = arith.remsi %225, %224 : i32 2026-02-21T09:50:33.8494132Z %227 = arith.addi %222, %226 : i32 2026-02-21T09:50:33.8494245Z %228 = arith.divsi %225, %224 : i32 2026-02-21T09:50:33.8494358Z %229 = arith.muli %227, %c64_i32 : i32 2026-02-21T09:50:33.8494518Z %230 = tt.splat %229 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:33.8494724Z %231 = arith.addi %230, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:33.8494904Z %232 = arith.muli %228, %c64_i32 : i32 2026-02-21T09:50:33.8495072Z %233 = tt.splat %232 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:33.8495300Z %234 = tt.splat %232 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:33.8495513Z %235 = arith.addi %233, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:33.8495725Z %236 = arith.addi %234, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:33.8495994Z %237 = tt.expand_dims %235 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:50:33.8496251Z %238 = arith.muli %237, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:50:33.8496445Z %239 = tt.broadcast %238 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8496620Z %240 = arith.extsi %229 : i32 to i64 2026-02-21T09:50:33.8496787Z %241 = tt.splat %240 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:33.8497006Z %242 = arith.addi %241, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:33.8497282Z %243 = tt.expand_dims %242 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8497562Z %244 = tt.broadcast %243 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8497766Z %245 = arith.cmpi sge, %243, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8497939Z %246 = arith.cmpi slt, %243, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8498107Z %247 = arith.andi %245, %246 : tensor<1x64xi1, #blocked2> 2026-02-21T09:50:33.8498292Z %248 = tt.broadcast %247 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8498506Z %249 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:33.8498707Z %250 = arith.addi %239, %56 : tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8498902Z %251 = tt.addptr %9, %250 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8499108Z %252 = tt.load %251 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:50:33.8499399Z %253 = ttg.memdesc_index %249[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8499754Z ttg.local_store %252, %253 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8499992Z %254 = arith.addi %239, %63 : tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8500183Z %255 = tt.addptr %9, %254 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8500384Z %256 = tt.load %255 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:50:33.8500659Z %257 = ttg.memdesc_index %249[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8501008Z ttg.local_store %256, %257 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8501525Z %258:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %253, %arg8 = %257) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:50:33.8501943Z %307 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:33.8502068Z %308 = arith.muli %307, %c2_i32 : i32 2026-02-21T09:50:33.8502238Z %309 = tt.splat %308 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:33.8502464Z %310 = arith.addi %309, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:33.8502756Z %311 = tt.expand_dims %310 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:50:33.8503046Z %312 = tt.broadcast %311 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8503240Z %313 = arith.addi %239, %312 : tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8503437Z %314 = tt.addptr %9, %313 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8503642Z %315 = tt.load %314 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:50:33.8503938Z %316 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8504363Z %317 = arith.extf %316 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8504642Z %318 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:50:33.8504812Z %319 = tt.splat %318 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8505041Z %320 = arith.addi %319, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8505321Z %321 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8505567Z %322 = arith.muli %321, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8505759Z %323 = tt.broadcast %322 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8505949Z %324 = arith.addi %323, %244 : tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8506146Z %325 = tt.addptr %10, %324 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8506353Z %326 = arith.cmpi sge, %321, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8506545Z %327 = arith.cmpi slt, %321, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8506709Z %328 = arith.andi %326, %327 : tensor<2x1xi1, #blocked2> 2026-02-21T09:50:33.8506895Z %329 = tt.broadcast %328 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8507096Z %330 = arith.andi %329, %248 : tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8507264Z %331 = tt.load %325, %330, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:50:33.8507516Z %332 = ttg.convert_layout %331 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8507795Z %333 = arith.shli %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8508030Z %334 = arith.shrsi %333, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8508267Z %335 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8508555Z %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8508886Z %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8509168Z %338 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8509403Z %339 = arith.select %18, %338, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8509640Z %340 = tt.broadcast %337 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8509870Z %341 = arith.select %20, %340, %339 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8510094Z %342 = tt.reshape %341 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:33.8510329Z %343 = arith.sitofp %342 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:33.8510619Z %344 = ttg.convert_layout %343 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8511097Z %345 = tt.dot %317, %344, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:33.8511443Z %346 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:33.8511569Z %347 = arith.cmpi slt, %346, %c2_i32 : i32 2026-02-21T09:50:33.8511703Z %348 = arith.select %347, %346, %c0_i32 : i32 2026-02-21T09:50:33.8511964Z %349 = ttg.memdesc_index %249[%348] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8512316Z ttg.local_store %315, %349 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8512704Z scf.yield %345, %348, %arg8, %349 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8512999Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:33.8513275Z %259 = ttg.local_load %258#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8513698Z %260 = arith.extf %259 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8513990Z %261 = arith.addi %74, %244 : tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8514187Z %262 = tt.addptr %10, %261 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8514400Z %263 = arith.andi %80, %248 : tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8514567Z %264 = tt.load %262, %263, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:50:33.8514822Z %265 = ttg.convert_layout %264 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8515110Z %266 = arith.shli %265, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8515342Z %267 = arith.shrsi %266, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8515574Z %268 = arith.shrsi %265, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8515858Z %269 = tt.expand_dims %267 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8516188Z %270 = tt.expand_dims %268 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8516463Z %271 = tt.broadcast %269 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8516698Z %272 = arith.select %18, %271, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8516930Z %273 = tt.broadcast %270 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8517160Z %274 = arith.select %20, %273, %272 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8517386Z %275 = tt.reshape %274 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:33.8517604Z %276 = arith.sitofp %275 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:33.8517893Z %277 = ttg.convert_layout %276 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8518363Z %278 = tt.dot %260, %277, %258#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:33.8518865Z %279 = ttg.local_load %258#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8519287Z %280 = arith.extf %279 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8519578Z %281 = arith.addi %102, %244 : tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8519774Z %282 = tt.addptr %10, %281 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8519975Z %283 = arith.andi %108, %248 : tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8520141Z %284 = tt.load %282, %283, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:50:33.8520398Z %285 = ttg.convert_layout %284 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8520676Z %286 = arith.shli %285, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8520910Z %287 = arith.shrsi %286, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8521144Z %288 = arith.shrsi %285, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8521426Z %289 = tt.expand_dims %287 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8521758Z %290 = tt.expand_dims %288 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8522033Z %291 = tt.broadcast %289 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8522285Z %292 = arith.select %18, %291, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8522518Z %293 = tt.broadcast %290 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8522779Z %294 = arith.select %20, %293, %292 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8523022Z %295 = tt.reshape %294 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:33.8523235Z %296 = arith.sitofp %295 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:33.8523526Z %297 = ttg.convert_layout %296 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8523978Z %298 = tt.dot %280, %297, %278, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:33.8524354Z ttg.local_dealloc %249 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:33.8524562Z %299 = arith.truncf %298 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:50:33.8524822Z %300 = tt.expand_dims %236 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:50:33.8525052Z %301 = arith.muli %300, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:50:33.8525276Z %302 = tt.expand_dims %231 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:50:33.8525525Z %303 = tt.broadcast %301 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8525723Z %304 = tt.broadcast %302 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8525896Z %305 = arith.addi %303, %304 : tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8526096Z %306 = tt.addptr %21, %305 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8526287Z tt.store %306, %299 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:50:33.8526422Z } {tt.num_stages = 1 : i32} 2026-02-21T09:50:33.8526561Z scf.for %arg3 = %25 to %3 step %c1_i32 : i32 { 2026-02-21T09:50:33.8526690Z %26 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:50:33.8526808Z %27 = arith.muli %26, %c2_i32 : i32 2026-02-21T09:50:33.8526923Z %28 = arith.subi %c128_i32, %27 : i32 2026-02-21T09:50:33.8527037Z %29 = arith.minsi %28, %c2_i32 : i32 2026-02-21T09:50:33.8527154Z %30 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:50:33.8527269Z %31 = arith.remsi %30, %29 : i32 2026-02-21T09:50:33.8527379Z %32 = arith.addi %27, %31 : i32 2026-02-21T09:50:33.8527485Z %33 = arith.divsi %30, %29 : i32 2026-02-21T09:50:33.8527594Z %34 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:50:33.8527748Z %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:33.8527951Z %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:50:33.8528113Z %37 = arith.muli %33, %c64_i32 : i32 2026-02-21T09:50:33.8528276Z %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:33.8528484Z %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:33.8528689Z %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:50:33.8528892Z %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:50:33.8529153Z %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:50:33.8529399Z %43 = arith.muli %42, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:50:33.8529587Z %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8529773Z %45 = arith.extsi %34 : i32 to i64 2026-02-21T09:50:33.8529932Z %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:33.8530144Z %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:50:33.8530430Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8530699Z %49 = tt.broadcast %48 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8530889Z %50 = arith.cmpi sge, %48, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8531055Z %51 = arith.cmpi slt, %48, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:50:33.8531211Z %52 = arith.andi %50, %51 : tensor<1x64xi1, #blocked2> 2026-02-21T09:50:33.8531386Z %53 = tt.broadcast %52 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8531595Z %54 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:33.8531854Z %55 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:50:33.8532117Z %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8532301Z %57 = arith.addi %44, %56 : tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8532489Z %58 = tt.addptr %9, %57 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8532684Z %59 = tt.load %58 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:50:33.8532956Z %60 = ttg.memdesc_index %54[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8533318Z ttg.local_store %59, %60 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8533581Z %61 = arith.addi %8, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:33.8533872Z %62 = tt.expand_dims %61 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:50:33.8534139Z %63 = tt.broadcast %62 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8534320Z %64 = arith.addi %44, %63 : tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8534509Z %65 = tt.addptr %9, %64 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8534700Z %66 = tt.load %65 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:50:33.8534971Z %67 = ttg.memdesc_index %54[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8535315Z ttg.local_store %66, %67 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8535817Z %68:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %60, %arg8 = %67) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:50:33.8536231Z %133 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:50:33.8536352Z %134 = arith.muli %133, %c2_i32 : i32 2026-02-21T09:50:33.8536522Z %135 = tt.splat %134 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:33.8536743Z %136 = arith.addi %135, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:50:33.8537013Z %137 = tt.expand_dims %136 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:50:33.8537286Z %138 = tt.broadcast %137 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8537491Z %139 = arith.addi %44, %138 : tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8537687Z %140 = tt.addptr %9, %139 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:50:33.8537888Z %141 = tt.load %140 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:50:33.8538192Z %142 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8538622Z %143 = arith.extf %142 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8538898Z %144 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:50:33.8539068Z %145 = tt.splat %144 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8539294Z %146 = arith.addi %145, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8539568Z %147 = tt.expand_dims %146 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8539815Z %148 = arith.muli %147, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8540004Z %149 = tt.broadcast %148 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8540195Z %150 = arith.addi %149, %49 : tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8540388Z %151 = tt.addptr %10, %150 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8540594Z %152 = arith.cmpi sge, %147, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8540762Z %153 = arith.cmpi slt, %147, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8540921Z %154 = arith.andi %152, %153 : tensor<2x1xi1, #blocked2> 2026-02-21T09:50:33.8541122Z %155 = tt.broadcast %154 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8541307Z %156 = arith.andi %155, %53 : tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8541484Z %157 = tt.load %151, %156, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:50:33.8541736Z %158 = ttg.convert_layout %157 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8542012Z %159 = arith.shli %158, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8542247Z %160 = arith.shrsi %159, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8542481Z %161 = arith.shrsi %158, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8542766Z %162 = tt.expand_dims %160 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8543096Z %163 = tt.expand_dims %161 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8543374Z %164 = tt.broadcast %162 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8543610Z %165 = arith.select %18, %164, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8543846Z %166 = tt.broadcast %163 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8544072Z %167 = arith.select %20, %166, %165 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8544297Z %168 = tt.reshape %167 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:33.8544513Z %169 = arith.sitofp %168 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:33.8544819Z %170 = ttg.convert_layout %169 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8545276Z %171 = tt.dot %143, %170, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:33.8545629Z %172 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:50:33.8545754Z %173 = arith.cmpi slt, %172, %c2_i32 : i32 2026-02-21T09:50:33.8545883Z %174 = arith.select %173, %172, %c0_i32 : i32 2026-02-21T09:50:33.8546144Z %175 = ttg.memdesc_index %54[%174] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8546490Z ttg.local_store %141, %175 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8546875Z scf.yield %171, %174, %arg8, %175 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:50:33.8547166Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:50:33.8547433Z %69 = ttg.local_load %68#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8547849Z %70 = arith.extf %69 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8548169Z %71 = arith.addi %12, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8548438Z %72 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8548676Z %73 = arith.muli %72, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8548876Z %74 = tt.broadcast %73 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8549057Z %75 = arith.addi %74, %49 : tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8549242Z %76 = tt.addptr %10, %75 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8549453Z %77 = arith.cmpi sge, %72, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8549620Z %78 = arith.cmpi slt, %72, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8549775Z %79 = arith.andi %77, %78 : tensor<2x1xi1, #blocked2> 2026-02-21T09:50:33.8549950Z %80 = tt.broadcast %79 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8550130Z %81 = arith.andi %80, %53 : tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8550285Z %82 = tt.load %76, %81, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:50:33.8550530Z %83 = ttg.convert_layout %82 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8550803Z %84 = arith.shli %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8551029Z %85 = arith.shrsi %84, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8551256Z %86 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8551532Z %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8551854Z %88 = tt.expand_dims %86 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8552121Z %89 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8552348Z %90 = arith.select %18, %89, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8552574Z %91 = tt.broadcast %88 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8552807Z %92 = arith.select %20, %91, %90 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8553021Z %93 = tt.reshape %92 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:33.8553227Z %94 = arith.sitofp %93 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:33.8553518Z %95 = ttg.convert_layout %94 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8553966Z %96 = tt.dot %70, %95, %68#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:33.8554443Z %97 = ttg.local_load %68#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8554857Z %98 = arith.extf %97 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8555179Z %99 = arith.addi %12, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:50:33.8555454Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8555698Z %101 = arith.muli %100, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8555883Z %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8556071Z %103 = arith.addi %102, %49 : tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8556263Z %104 = tt.addptr %10, %103 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:50:33.8556481Z %105 = arith.cmpi sge, %100, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8556652Z %106 = arith.cmpi slt, %100, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:50:33.8556811Z %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2> 2026-02-21T09:50:33.8557008Z %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8557192Z %109 = arith.andi %108, %53 : tensor<2x64xi1, #blocked2> 2026-02-21T09:50:33.8557355Z %110 = tt.load %104, %109, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:50:33.8557606Z %111 = ttg.convert_layout %110 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8557879Z %112 = arith.shli %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8558110Z %113 = arith.shrsi %112, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8558342Z %114 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:50:33.8558624Z %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8558954Z %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:50:33.8559228Z %117 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8559460Z %118 = arith.select %18, %117, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8559690Z %119 = tt.broadcast %116 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8559913Z %120 = arith.select %20, %119, %118 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:50:33.8560134Z %121 = tt.reshape %120 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:50:33.8560363Z %122 = arith.sitofp %121 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:50:33.8560649Z %123 = ttg.convert_layout %122 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:50:33.8561102Z %124 = tt.dot %98, %123, %96, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:50:33.8561486Z ttg.local_dealloc %54 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:50:33.8561692Z %125 = arith.truncf %124 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:50:33.8561947Z %126 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:50:33.8562179Z %127 = arith.muli %126, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:50:33.8562405Z %128 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:50:33.8562700Z %129 = tt.broadcast %127 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8562901Z %130 = tt.broadcast %128 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8563075Z %131 = arith.addi %129, %130 : tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8563256Z %132 = tt.addptr %21, %131 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:50:33.8563446Z tt.store %132, %125 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:50:33.8563579Z } {tt.num_stages = 1 : i32} 2026-02-21T09:50:33.8563680Z tt.return 2026-02-21T09:50:33.8563757Z } 2026-02-21T09:50:33.8563830Z } 2026-02-21T09:50:33.8563872Z 2026-02-21T09:50:33.8563902Z {-# 2026-02-21T09:50:33.8563980Z external_resources: { 2026-02-21T09:50:33.8564075Z mlir_reproducer: { 2026-02-21T09:50:33.8565097Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:50:33.8566112Z disable_threading: false, 2026-02-21T09:50:33.8566213Z verify_each: true 2026-02-21T09:50:33.8566300Z } 2026-02-21T09:50:33.8566369Z } 2026-02-21T09:50:33.8566433Z #-} 2026-02-21T09:50:33.8566708Z /tmp/torchinductor_root/tt/cttppwz2klxjgyaagf3tfl4fnvqsgrdnel4y4bu34rosohpmvfbm.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:50:33.8567385Z /tmp/torchinductor_root/tt/cttppwz2klxjgyaagf3tfl4fnvqsgrdnel4y4bu34rosohpmvfbm.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:50:33.8567931Z [364s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:50:33.8568696Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:50:33.8569388Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:50:33.8569570Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:50:36.1365615Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 98/98 8.1 configs/s 2026-02-21T09:50:45.2605455Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 179/179 15.4 configs/s 2026-02-21T09:50:48.5258592Z [379s] Generation 3 complete: 2026-02-21T09:50:48.5259027Z error=15 2026-02-21T09:50:48.5259238Z ok=88 2026-02-21T09:50:48.5259445Z min=1.1053 2026-02-21T09:50:48.5259646Z mid=1.6796 2026-02-21T09:50:48.5259850Z max=103.7485 2026-02-21T09:50:48.5260095Z best={'block_sizes': [8, 128, 128], 2026-02-21T09:50:48.5260482Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T09:50:48.5260847Z 'l2_groupings': [8], 2026-02-21T09:50:48.5261121Z 'load_eviction_policies': ['', ''], 2026-02-21T09:50:48.5261430Z 'loop_orders': [[0, 1]], 2026-02-21T09:50:48.5261709Z 'matrix_instr_nonkdim': 0, 2026-02-21T09:50:48.5261984Z 'num_sm_multiplier': 32, 2026-02-21T09:50:48.5262292Z 'num_stages': 4, 2026-02-21T09:50:48.5262543Z 'num_warps': 2, 2026-02-21T09:50:48.5262814Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:50:48.5263174Z 'range_flattens': [False, True], 2026-02-21T09:50:48.5263861Z 'range_multi_buffers': [True, None], 2026-02-21T09:50:48.5264124Z 'range_num_stages': [2, 3], 2026-02-21T09:50:48.5264366Z 'range_unroll_factors': [4, 0], 2026-02-21T09:50:48.5264621Z 'range_warp_specializes': [], 2026-02-21T09:50:48.5264848Z 'waves_per_eu': 2} 2026-02-21T09:50:48.5332196Z [379s] Fitting surrogate: 425 points, 425 targets 2026-02-21T09:50:50.2527257Z [381s] Generation 4 starting: 77 neighbors, 4 active search path(s) 2026-02-21T09:51:04.1782796Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 1.4 configs/s 2026-02-21T09:51:04.4222577Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:51:04.4230965Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}> 2026-02-21T09:51:04.4231336Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:51:04.4231642Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}> 2026-02-21T09:51:04.4231935Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 4], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:51:04.4232193Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:51:04.4232429Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:51:04.4232615Z #smem = #ttg.shared_memory 2026-02-21T09:51:04.4232849Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:51:04.4233445Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:51:04.4233854Z %cst = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:51:04.4234024Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:51:04.4234149Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:51:04.4234296Z %cst_0 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4234459Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:51:04.4234585Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:51:04.4234703Z %c510_i32 = arith.constant 510 : i32 2026-02-21T09:51:04.4234825Z %c6_i32 = arith.constant 6 : i32 2026-02-21T09:51:04.4234939Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:51:04.4235173Z %cst_1 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4235471Z %cst_2 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:04.4235737Z %cst_3 = arith.constant dense<0> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4236059Z %cst_4 = arith.constant dense<8192> : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4236314Z %cst_5 = arith.constant dense<0> : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4236538Z %cst_6 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:51:04.4236752Z %cst_7 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4236974Z %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:04.4237151Z %cst_9 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:04.4237371Z %cst_10 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4237631Z %cst_11 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4237963Z %cst_12 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4238184Z %cst_13 = arith.constant dense<0> : tensor<1x128xi64, #mma> 2026-02-21T09:51:04.4238356Z %cst_14 = arith.constant dense<8192> : tensor<1x128xi64, #mma> 2026-02-21T09:51:04.4238532Z %cst_15 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:51:04.4238703Z %cst_16 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:51:04.4238874Z %cst_17 = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:51:04.4239029Z %0 = tt.get_program_id x : i32 2026-02-21T09:51:04.4239144Z %1 = arith.remsi %0, %c64_i32 : i32 2026-02-21T09:51:04.4239284Z %2 = arith.divsi %0, %c64_i32 : i32 2026-02-21T09:51:04.4239399Z %3 = arith.muli %1, %c128_i32 : i32 2026-02-21T09:51:04.4239516Z %4 = arith.muli %2, %c128_i32 : i32 2026-02-21T09:51:04.4239721Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:04.4240000Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:04.4240331Z %7 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4240644Z %8 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:04.4240894Z %9 = tt.splat %4 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:04.4241115Z %10 = arith.addi %9, %5 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:04.4241389Z %11 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:04.4241693Z %12 = tt.expand_dims %10 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:51:04.4241948Z %13 = arith.muli %12, %cst_6 : tensor<128x1xi32, #blocked1> 2026-02-21T09:51:04.4242147Z %14 = tt.broadcast %13 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:04.4242365Z %15 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:04.4242534Z %16 = arith.extsi %3 : i32 to i64 2026-02-21T09:51:04.4242808Z %17 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4243123Z %18 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4243561Z %19 = arith.extsi %18 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4243964Z %20 = tt.splat %16 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4244392Z %21 = arith.extsi %7 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4244800Z %22 = arith.addi %20, %21 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4245188Z %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4245623Z %24 = tt.broadcast %23 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4245937Z %25 = arith.cmpi sge, %23, %cst_5 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4246207Z %26 = arith.cmpi slt, %23, %cst_4 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4246448Z %27 = arith.andi %25, %26 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4246747Z %28 = tt.broadcast %27 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4247112Z %29 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:51:04.4247544Z %30 = tt.expand_dims %29 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:51:04.4247939Z %31 = tt.expand_dims %30 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:04.4248197Z %32 = arith.cmpi eq, %31, %cst_8 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:04.4248395Z %33 = tt.broadcast %32 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:51:04.4248594Z %34 = arith.cmpi eq, %31, %cst_9 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:04.4248790Z %35 = tt.broadcast %34 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:51:04.4249054Z %36 = scf.for %arg3 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg4 = %cst) -> (tensor<128x128xf32, #mma>) : i32 { 2026-02-21T09:51:04.4249276Z %97 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:51:04.4249448Z %98 = tt.splat %97 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:04.4249693Z %99 = arith.addi %98, %11 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:04.4249966Z %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:04.4250249Z %101 = tt.broadcast %100 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:04.4250453Z %102 = arith.addi %14, %101 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:04.4250659Z %103 = tt.addptr %15, %102 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:04.4250875Z %104 = tt.load %103 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:04.4251100Z %105 = ttg.local_alloc %104 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:04.4251441Z %106 = ttg.local_load %105 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:04.4251860Z %107 = arith.extf %106 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:04.4252145Z %108 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:51:04.4252362Z %109 = tt.splat %108 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4252679Z %110 = arith.addi %109, %19 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4253065Z %111 = tt.expand_dims %110 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4253426Z %112 = arith.muli %111, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4253739Z %113 = tt.broadcast %112 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4254049Z %114 = arith.addi %113, %24 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4254390Z %115 = tt.addptr %17, %114 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4254712Z %116 = arith.cmpi sge, %111, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4254962Z %117 = arith.cmpi slt, %111, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4255205Z %118 = arith.andi %116, %117 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4255509Z %119 = tt.broadcast %118 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4255847Z %120 = arith.andi %119, %28 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4256091Z %121 = tt.load %115, %120, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4256345Z %122 = arith.shli %121, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4256584Z %123 = arith.shrsi %122, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4256822Z %124 = arith.shrsi %121, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4257118Z %125 = tt.expand_dims %123 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:04.4257457Z %126 = tt.expand_dims %124 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:04.4257768Z %127 = tt.broadcast %125 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4258017Z %128 = arith.select %33, %127, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4258257Z %129 = tt.broadcast %126 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4258501Z %130 = arith.select %35, %129, %128 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4258737Z %131 = tt.reshape %130 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:04.4258967Z %132 = arith.sitofp %131 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:04.4259233Z %133 = ttg.local_alloc %132 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:04.4259560Z %134 = ttg.local_load %133 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:04.4260048Z %135 = tt.dot %107, %134, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:04.4260410Z %136 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:51:04.4260560Z %137 = arith.muli %136, %c2_i32 : i32 2026-02-21T09:51:04.4260735Z %138 = tt.splat %137 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:04.4260959Z %139 = arith.addi %138, %11 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:04.4261241Z %140 = tt.expand_dims %139 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:04.4261523Z %141 = tt.broadcast %140 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:04.4261725Z %142 = arith.addi %14, %141 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:04.4261937Z %143 = tt.addptr %15, %142 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:04.4262147Z %144 = tt.load %143 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:04.4262394Z %145 = ttg.local_alloc %144 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:04.4262724Z %146 = ttg.local_load %145 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:04.4263154Z %147 = arith.extf %146 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:04.4263438Z %148 = arith.extsi %136 : i32 to i64 2026-02-21T09:51:04.4263644Z %149 = tt.splat %148 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4263957Z %150 = arith.addi %149, %19 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4264342Z %151 = tt.expand_dims %150 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4264693Z %152 = arith.muli %151, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4265002Z %153 = tt.broadcast %152 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4265308Z %154 = arith.addi %153, %24 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4265619Z %155 = tt.addptr %17, %154 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4265961Z %156 = arith.cmpi sge, %151, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4266203Z %157 = arith.cmpi slt, %151, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4266439Z %158 = arith.andi %156, %157 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4266741Z %159 = tt.broadcast %158 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4267039Z %160 = arith.andi %159, %28 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4267283Z %161 = tt.load %155, %160, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4267531Z %162 = arith.shli %161, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4267766Z %163 = arith.shrsi %162, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4268005Z %164 = arith.shrsi %161, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4268297Z %165 = tt.expand_dims %163 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:04.4268652Z %166 = tt.expand_dims %164 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:04.4268939Z %167 = tt.broadcast %165 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4269180Z %168 = arith.select %33, %167, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4269422Z %169 = tt.broadcast %166 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4269656Z %170 = arith.select %35, %169, %168 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4269891Z %171 = tt.reshape %170 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:04.4270119Z %172 = arith.sitofp %171 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:04.4270391Z %173 = ttg.local_alloc %172 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:04.4270722Z %174 = ttg.local_load %173 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:04.4271190Z %175 = tt.dot %147, %174, %135, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:04.4271535Z %176 = arith.addi %arg3, %c4_i32 : i32 2026-02-21T09:51:04.4271659Z %177 = arith.muli %176, %c2_i32 : i32 2026-02-21T09:51:04.4271841Z %178 = tt.splat %177 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:04.4272065Z %179 = arith.addi %178, %11 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:04.4272337Z %180 = tt.expand_dims %179 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:04.4272614Z %181 = tt.broadcast %180 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:04.4272809Z %182 = arith.addi %14, %181 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:04.4273009Z %183 = tt.addptr %15, %182 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:04.4273215Z %184 = tt.load %183 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:04.4273434Z %185 = ttg.local_alloc %184 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:04.4273783Z %186 = ttg.local_load %185 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:04.4274189Z %187 = arith.extf %186 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:04.4274468Z %188 = arith.extsi %176 : i32 to i64 2026-02-21T09:51:04.4274677Z %189 = tt.splat %188 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4274972Z %190 = arith.addi %189, %19 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4275358Z %191 = tt.expand_dims %190 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4275714Z %192 = arith.muli %191, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4276019Z %193 = tt.broadcast %192 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4276326Z %194 = arith.addi %193, %24 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4276638Z %195 = tt.addptr %17, %194 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4276970Z %196 = arith.cmpi sge, %191, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4277215Z %197 = arith.cmpi slt, %191, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4277445Z %198 = arith.andi %196, %197 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4277746Z %199 = tt.broadcast %198 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4278048Z %200 = arith.andi %199, %28 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4278305Z %201 = tt.load %195, %200, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4278551Z %202 = arith.shli %201, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4278782Z %203 = arith.shrsi %202, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4279019Z %204 = arith.shrsi %201, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4279307Z %205 = tt.expand_dims %203 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:04.4279642Z %206 = tt.expand_dims %204 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:04.4279946Z %207 = tt.broadcast %205 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4280184Z %208 = arith.select %33, %207, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4280425Z %209 = tt.broadcast %206 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4280660Z %210 = arith.select %35, %209, %208 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4280889Z %211 = tt.reshape %210 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:04.4281116Z %212 = arith.sitofp %211 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:04.4281369Z %213 = ttg.local_alloc %212 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:04.4281697Z %214 = ttg.local_load %213 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:04.4282183Z %215 = tt.dot %187, %214, %175, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:04.4282535Z scf.yield %215 : tensor<128x128xf32, #mma> 2026-02-21T09:51:04.4282704Z } {tt.num_stages = 1 : i32} 2026-02-21T09:51:04.4282867Z %37 = arith.addi %11, %cst_2 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:04.4283141Z %38 = tt.expand_dims %37 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:04.4283411Z %39 = tt.broadcast %38 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:04.4283600Z %40 = arith.addi %14, %39 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:04.4283796Z %41 = tt.addptr %15, %40 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:04.4283996Z %42 = tt.load %41 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:04.4284212Z %43 = ttg.local_alloc %42 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:04.4284536Z %44 = ttg.local_load %43 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:04.4286676Z %45 = arith.extf %44 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:04.4287043Z %46 = arith.addi %19, %cst_1 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:04.4287425Z %47 = tt.expand_dims %46 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4287784Z %48 = arith.muli %47, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4288086Z %49 = tt.broadcast %48 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4288408Z %50 = arith.addi %49, %24 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4288711Z %51 = tt.addptr %17, %50 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4289043Z %52 = arith.cmpi sge, %47, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4289284Z %53 = arith.cmpi slt, %47, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4289510Z %54 = arith.andi %52, %53 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4289820Z %55 = tt.broadcast %54 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4290112Z %56 = arith.andi %55, %28 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4290341Z %57 = tt.load %51, %56, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4290577Z %58 = arith.shli %57, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4290805Z %59 = arith.shrsi %58, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4291031Z %60 = arith.shrsi %57, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:04.4291311Z %61 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:04.4291661Z %62 = tt.expand_dims %60 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:04.4291939Z %63 = tt.broadcast %61 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4292175Z %64 = arith.select %33, %63, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4292406Z %65 = tt.broadcast %62 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4292631Z %66 = arith.select %35, %65, %64 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:04.4292855Z %67 = tt.reshape %66 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:04.4293073Z %68 = arith.sitofp %67 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:04.4293318Z %69 = ttg.local_alloc %68 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:04.4293635Z %70 = ttg.local_load %69 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:04.4294098Z %71 = tt.dot %45, %70, %36, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:04.4294485Z %72 = arith.truncf %71 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:51:04.4294724Z %73 = arith.extsi %4 : i32 to i64 2026-02-21T09:51:04.4294884Z %74 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:51:04.4295086Z %75 = tt.splat %73 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:04.4295358Z %76 = arith.extsi %6 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:04.4295691Z %77 = arith.extsi %8 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:04.4295957Z %78 = arith.addi %75, %76 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:04.4296235Z %79 = tt.expand_dims %78 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:51:04.4296468Z %80 = arith.muli %79, %cst_15 : tensor<128x1xi64, #mma> 2026-02-21T09:51:04.4296648Z %81 = tt.broadcast %80 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:51:04.4296851Z %82 = tt.splat %16 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:04.4297050Z %83 = arith.addi %82, %77 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:04.4297307Z %84 = tt.expand_dims %83 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:51:04.4297575Z %85 = tt.broadcast %84 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:51:04.4297754Z %86 = arith.addi %81, %85 : tensor<128x128xi64, #mma> 2026-02-21T09:51:04.4297940Z %87 = tt.addptr %74, %86 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi64, #mma> 2026-02-21T09:51:04.4298140Z %88 = arith.cmpi sge, %79, %cst_16 : tensor<128x1xi64, #mma> 2026-02-21T09:51:04.4298303Z %89 = arith.cmpi slt, %79, %cst_17 : tensor<128x1xi64, #mma> 2026-02-21T09:51:04.4298454Z %90 = arith.andi %88, %89 : tensor<128x1xi1, #mma> 2026-02-21T09:51:04.4298624Z %91 = tt.broadcast %90 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:51:04.4298801Z %92 = arith.cmpi sge, %84, %cst_13 : tensor<1x128xi64, #mma> 2026-02-21T09:51:04.4298960Z %93 = arith.cmpi slt, %84, %cst_14 : tensor<1x128xi64, #mma> 2026-02-21T09:51:04.4299112Z %94 = arith.andi %92, %93 : tensor<1x128xi1, #mma> 2026-02-21T09:51:04.4299276Z %95 = tt.broadcast %94 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:51:04.4299451Z %96 = arith.andi %91, %95 : tensor<128x128xi1, #mma> 2026-02-21T09:51:04.4299619Z tt.store %87, %72, %96 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:51:04.4299756Z tt.return 2026-02-21T09:51:04.4299835Z } 2026-02-21T09:51:04.4299912Z } 2026-02-21T09:51:04.4299956Z 2026-02-21T09:51:04.4299989Z {-# 2026-02-21T09:51:04.4300072Z external_resources: { 2026-02-21T09:51:04.4300173Z mlir_reproducer: { 2026-02-21T09:51:04.4301160Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:51:04.4302149Z disable_threading: false, 2026-02-21T09:51:04.4302258Z verify_each: true 2026-02-21T09:51:04.4302348Z } 2026-02-21T09:51:04.4302425Z } 2026-02-21T09:51:04.4302494Z #-} 2026-02-21T09:51:04.4302775Z /tmp/torchinductor_root/it/cit3ulfy5jg54neqzjzy7mosayhvrimhhw6wmkdny764b2zi2n24.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:51:04.4303468Z /tmp/torchinductor_root/it/cit3ulfy5jg54neqzjzy7mosayhvrimhhw6wmkdny764b2zi2n24.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:51:04.4304018Z [395s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:51:04.4304745Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:51:04.4305416Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:51:04.4305583Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:51:05.5926167Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:51:05.5943881Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}> 2026-02-21T09:51:05.5944455Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:51:05.5944823Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}> 2026-02-21T09:51:05.5945170Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 4], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:51:05.5945467Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:51:05.5945751Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:51:05.5945965Z #smem = #ttg.shared_memory 2026-02-21T09:51:05.5946246Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:51:05.5946803Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:51:05.5947359Z %cst = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:51:05.5947552Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:51:05.5947689Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:51:05.5947827Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:51:05.5948004Z %cst_0 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.5948184Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:51:05.5948320Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:51:05.5948453Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:51:05.5948590Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:51:05.5948725Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:51:05.5948856Z %c510_i32 = arith.constant 510 : i32 2026-02-21T09:51:05.5948986Z %c6_i32 = arith.constant 6 : i32 2026-02-21T09:51:05.5949115Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:51:05.5949382Z %cst_1 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.5949727Z %cst_2 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.5950030Z %cst_3 = arith.constant dense<0> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5950330Z %cst_4 = arith.constant dense<8192> : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5950684Z %cst_5 = arith.constant dense<0> : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5950940Z %cst_6 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:51:05.5951176Z %cst_7 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5951411Z %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:05.5951596Z %cst_9 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:05.5951834Z %cst_10 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5952110Z %cst_11 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5952468Z %cst_12 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5952702Z %cst_13 = arith.constant dense<0> : tensor<1x128xi64, #mma> 2026-02-21T09:51:05.5952886Z %cst_14 = arith.constant dense<8192> : tensor<1x128xi64, #mma> 2026-02-21T09:51:05.5953071Z %cst_15 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.5953255Z %cst_16 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.5953440Z %cst_17 = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.5953602Z %0 = tt.get_program_id x : i32 2026-02-21T09:51:05.5953726Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:51:05.5953874Z %2 = arith.minsi %1, %c8192_i32 : i32 2026-02-21T09:51:05.5954096Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:05.5954402Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:05.5954746Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.5955089Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:05.5955382Z %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.5955648Z %8 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.5955913Z %9 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5956273Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.5956746Z %11 = arith.extsi %10 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.5957305Z %12 = arith.extsi %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.5957774Z %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:51:05.5958230Z %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:51:05.5958670Z %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:05.5958948Z %16 = arith.cmpi eq, %15, %cst_8 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:05.5959163Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:51:05.5959383Z %18 = arith.cmpi eq, %15, %cst_9 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:05.5959604Z %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:51:05.5959834Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:51:05.5960135Z %21 = arith.extsi %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:05.5960506Z %22 = arith.extsi %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:05.5960760Z %23 = arith.subi %2, %0 : i32 2026-02-21T09:51:05.5960883Z %24 = arith.remsi %23, %c3_i32 : i32 2026-02-21T09:51:05.5961030Z %25 = arith.subi %23, %24 : i32 2026-02-21T09:51:05.5961142Z %26 = arith.addi %0, %25 : i32 2026-02-21T09:51:05.5961304Z %27 = arith.addi %7, %cst_2 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.5961623Z %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:05.5961892Z %29 = tt.broadcast %28 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.5962154Z %30 = arith.addi %11, %cst_1 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.5962551Z %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5962973Z %32 = arith.muli %31, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5963276Z %33 = tt.broadcast %32 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5963580Z %34 = arith.cmpi sge, %31, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5963819Z %35 = arith.cmpi slt, %31, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5964041Z %36 = arith.andi %34, %35 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5964332Z %37 = tt.broadcast %36 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5964590Z scf.for %arg3 = %0 to %26 step %c3_i32 : i32 { 2026-02-21T09:51:05.5964729Z %38 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:51:05.5964875Z %39 = arith.muli %38, %c2_i32 : i32 2026-02-21T09:51:05.5964991Z %40 = arith.subi %c64_i32, %39 : i32 2026-02-21T09:51:05.5965113Z %41 = arith.minsi %40, %c2_i32 : i32 2026-02-21T09:51:05.5965234Z %42 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:51:05.5965356Z %43 = arith.remsi %42, %41 : i32 2026-02-21T09:51:05.5965467Z %44 = arith.addi %39, %43 : i32 2026-02-21T09:51:05.5965579Z %45 = arith.divsi %42, %41 : i32 2026-02-21T09:51:05.5965693Z %46 = arith.muli %44, %c128_i32 : i32 2026-02-21T09:51:05.5965808Z %47 = arith.muli %45, %c128_i32 : i32 2026-02-21T09:51:05.5965975Z %48 = tt.splat %47 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:05.5966194Z %49 = arith.addi %48, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:05.5966475Z %50 = tt.expand_dims %49 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:51:05.5966729Z %51 = arith.muli %50, %cst_6 : tensor<128x1xi32, #blocked1> 2026-02-21T09:51:05.5966922Z %52 = tt.broadcast %51 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.5967096Z %53 = arith.extsi %46 : i32 to i64 2026-02-21T09:51:05.5967298Z %54 = tt.splat %53 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.5967613Z %55 = arith.addi %54, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.5968000Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5968432Z %57 = tt.broadcast %56 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5968741Z %58 = arith.cmpi sge, %56, %cst_5 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5969003Z %59 = arith.cmpi slt, %56, %cst_4 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5969234Z %60 = arith.andi %58, %59 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5969542Z %61 = tt.broadcast %60 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5969887Z %62 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<128x128xf32, #mma>) : i32 { 2026-02-21T09:51:05.5970104Z %253 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:51:05.5970281Z %254 = tt.splat %253 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.5970532Z %255 = arith.addi %254, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.5970804Z %256 = tt.expand_dims %255 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:05.5971084Z %257 = tt.broadcast %256 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.5971279Z %258 = arith.addi %52, %257 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.5971486Z %259 = tt.addptr %8, %258 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.5971696Z %260 = tt.load %259 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.5971930Z %261 = ttg.local_alloc %260 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.5972266Z %262 = ttg.local_load %261 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.5972704Z %263 = arith.extf %262 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.5972992Z %264 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:05.5973205Z %265 = tt.splat %264 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.5973504Z %266 = arith.addi %265, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.5973894Z %267 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5974245Z %268 = arith.muli %267, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5974561Z %269 = tt.broadcast %268 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5974870Z %270 = arith.addi %269, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5975185Z %271 = tt.addptr %9, %270 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5975504Z %272 = arith.cmpi sge, %267, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5975774Z %273 = arith.cmpi slt, %267, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5976009Z %274 = arith.andi %272, %273 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5976312Z %275 = tt.broadcast %274 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5976612Z %276 = arith.andi %275, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5976858Z %277 = tt.load %271, %276, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5977124Z %278 = arith.shli %277, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5977359Z %279 = arith.shrsi %278, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5977603Z %280 = arith.shrsi %277, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5977891Z %281 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.5978230Z %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.5978539Z %283 = tt.broadcast %281 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.5978783Z %284 = arith.select %17, %283, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.5979025Z %285 = tt.broadcast %282 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.5979257Z %286 = arith.select %19, %285, %284 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.5979490Z %287 = tt.reshape %286 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.5979718Z %288 = arith.sitofp %287 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.5979976Z %289 = ttg.local_alloc %288 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.5980305Z %290 = ttg.local_load %289 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.5980799Z %291 = tt.dot %263, %290, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.5981148Z %292 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:51:05.5981276Z %293 = arith.muli %292, %c2_i32 : i32 2026-02-21T09:51:05.5981446Z %294 = tt.splat %293 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.5981670Z %295 = arith.addi %294, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.5981946Z %296 = tt.expand_dims %295 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:05.5982224Z %297 = tt.broadcast %296 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.5982429Z %298 = arith.addi %52, %297 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.5982632Z %299 = tt.addptr %8, %298 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.5982841Z %300 = tt.load %299 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.5983064Z %301 = ttg.local_alloc %300 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.5983399Z %302 = ttg.local_load %301 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.5983819Z %303 = arith.extf %302 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.5984102Z %304 = arith.extsi %292 : i32 to i64 2026-02-21T09:51:05.5984312Z %305 = tt.splat %304 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.5984609Z %306 = arith.addi %305, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.5985009Z %307 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5985365Z %308 = arith.muli %307, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5985678Z %309 = tt.broadcast %308 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5985984Z %310 = arith.addi %309, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5992878Z %311 = tt.addptr %9, %310 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5993209Z %312 = arith.cmpi sge, %307, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5993462Z %313 = arith.cmpi slt, %307, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5993697Z %314 = arith.andi %312, %313 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5993996Z %315 = tt.broadcast %314 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5994296Z %316 = arith.andi %315, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5994539Z %317 = tt.load %311, %316, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5994787Z %318 = arith.shli %317, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5995046Z %319 = arith.shrsi %318, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5995279Z %320 = arith.shrsi %317, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.5995575Z %321 = tt.expand_dims %319 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.5995912Z %322 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.5996201Z %323 = tt.broadcast %321 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.5996449Z %324 = arith.select %17, %323, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.5996690Z %325 = tt.broadcast %322 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.5996929Z %326 = arith.select %19, %325, %324 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.5997159Z %327 = tt.reshape %326 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.5997388Z %328 = arith.sitofp %327 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.5997644Z %329 = ttg.local_alloc %328 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.5997987Z %330 = ttg.local_load %329 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.5998461Z %331 = tt.dot %303, %330, %291, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.5998809Z %332 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:51:05.5998935Z %333 = arith.muli %332, %c2_i32 : i32 2026-02-21T09:51:05.5999109Z %334 = tt.splat %333 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.5999347Z %335 = arith.addi %334, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.5999625Z %336 = tt.expand_dims %335 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:05.5999904Z %337 = tt.broadcast %336 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6000100Z %338 = arith.addi %52, %337 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6000304Z %339 = tt.addptr %8, %338 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6000515Z %340 = tt.load %339 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6000755Z %341 = ttg.local_alloc %340 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6001085Z %342 = ttg.local_load %341 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6001492Z %343 = arith.extf %342 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6001776Z %344 = arith.extsi %332 : i32 to i64 2026-02-21T09:51:05.6001985Z %345 = tt.splat %344 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6002284Z %346 = arith.addi %345, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6002717Z %347 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6003094Z %348 = arith.muli %347, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6003409Z %349 = tt.broadcast %348 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6003720Z %350 = arith.addi %349, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6004038Z %351 = tt.addptr %9, %350 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6004361Z %352 = arith.cmpi sge, %347, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6004605Z %353 = arith.cmpi slt, %347, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6004847Z %354 = arith.andi %352, %353 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6005152Z %355 = tt.broadcast %354 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6005456Z %356 = arith.andi %355, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6005703Z %357 = tt.load %351, %356, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6005964Z %358 = arith.shli %357, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6006199Z %359 = arith.shrsi %358, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6006438Z %360 = arith.shrsi %357, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6006728Z %361 = tt.expand_dims %359 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6007066Z %362 = tt.expand_dims %360 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6007370Z %363 = tt.broadcast %361 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6007612Z %364 = arith.select %17, %363, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6007852Z %365 = tt.broadcast %362 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6008087Z %366 = arith.select %19, %365, %364 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6008322Z %367 = tt.reshape %366 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6008546Z %368 = arith.sitofp %367 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6008818Z %369 = ttg.local_alloc %368 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6009146Z %370 = ttg.local_load %369 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6009623Z %371 = tt.dot %343, %370, %331, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6009977Z scf.yield %371 : tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6010113Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:51:05.6010255Z %63 = arith.addi %52, %29 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6010454Z %64 = tt.addptr %8, %63 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6010655Z %65 = tt.load %64 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6010894Z %66 = ttg.local_alloc %65 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6011215Z %67 = ttg.local_load %66 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6011621Z %68 = arith.extf %67 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6011951Z %69 = arith.addi %33, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6012256Z %70 = tt.addptr %9, %69 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6012558Z %71 = arith.andi %37, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6012794Z %72 = tt.load %70, %71, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6013032Z %73 = arith.shli %72, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6013266Z %74 = arith.shrsi %73, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6013494Z %75 = arith.shrsi %72, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6013776Z %76 = tt.expand_dims %74 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6014121Z %77 = tt.expand_dims %75 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6014399Z %78 = tt.broadcast %76 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6014635Z %79 = arith.select %17, %78, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6014869Z %80 = tt.broadcast %77 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6015097Z %81 = arith.select %19, %80, %79 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6015341Z %82 = tt.reshape %81 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6015559Z %83 = arith.sitofp %82 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6015808Z %84 = ttg.local_alloc %83 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6016128Z %85 = ttg.local_load %84 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6016604Z %86 = tt.dot %68, %85, %62, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6016987Z %87 = arith.truncf %86 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:51:05.6017159Z %88 = arith.extsi %47 : i32 to i64 2026-02-21T09:51:05.6017324Z %89 = tt.splat %88 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:05.6017533Z %90 = arith.addi %89, %21 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:05.6017799Z %91 = tt.expand_dims %90 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6018040Z %92 = arith.muli %91, %cst_15 : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6018218Z %93 = tt.broadcast %92 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6018423Z %94 = tt.splat %53 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:05.6018627Z %95 = arith.addi %94, %22 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:05.6018905Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:51:05.6019168Z %97 = tt.broadcast %96 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6019346Z %98 = arith.addi %93, %97 : tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6019534Z %99 = tt.addptr %20, %98 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6019734Z %100 = arith.cmpi sge, %91, %cst_16 : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6019902Z %101 = arith.cmpi slt, %91, %cst_17 : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6020059Z %102 = arith.andi %100, %101 : tensor<128x1xi1, #mma> 2026-02-21T09:51:05.6020230Z %103 = tt.broadcast %102 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:51:05.6020414Z %104 = arith.cmpi sge, %96, %cst_13 : tensor<1x128xi64, #mma> 2026-02-21T09:51:05.6020576Z %105 = arith.cmpi slt, %96, %cst_14 : tensor<1x128xi64, #mma> 2026-02-21T09:51:05.6020730Z %106 = arith.andi %104, %105 : tensor<1x128xi1, #mma> 2026-02-21T09:51:05.6020900Z %107 = tt.broadcast %106 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:51:05.6021078Z %108 = arith.andi %103, %107 : tensor<128x128xi1, #mma> 2026-02-21T09:51:05.6021239Z tt.store %99, %87, %108 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:51:05.6021382Z %109 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:51:05.6021520Z %110 = arith.divsi %109, %c256_i32 : i32 2026-02-21T09:51:05.6021639Z %111 = arith.muli %110, %c2_i32 : i32 2026-02-21T09:51:05.6021755Z %112 = arith.subi %c64_i32, %111 : i32 2026-02-21T09:51:05.6021870Z %113 = arith.minsi %112, %c2_i32 : i32 2026-02-21T09:51:05.6021988Z %114 = arith.remsi %109, %c256_i32 : i32 2026-02-21T09:51:05.6022102Z %115 = arith.remsi %114, %113 : i32 2026-02-21T09:51:05.6022216Z %116 = arith.addi %111, %115 : i32 2026-02-21T09:51:05.6022328Z %117 = arith.divsi %114, %113 : i32 2026-02-21T09:51:05.6022444Z %118 = arith.muli %116, %c128_i32 : i32 2026-02-21T09:51:05.6022580Z %119 = arith.muli %117, %c128_i32 : i32 2026-02-21T09:51:05.6022749Z %120 = tt.splat %119 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:05.6022975Z %121 = arith.addi %120, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:05.6023257Z %122 = tt.expand_dims %121 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:51:05.6023514Z %123 = arith.muli %122, %cst_6 : tensor<128x1xi32, #blocked1> 2026-02-21T09:51:05.6023709Z %124 = tt.broadcast %123 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6023882Z %125 = arith.extsi %118 : i32 to i64 2026-02-21T09:51:05.6024103Z %126 = tt.splat %125 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6024400Z %127 = arith.addi %126, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6024800Z %128 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6025239Z %129 = tt.broadcast %128 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6025560Z %130 = arith.cmpi sge, %128, %cst_5 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6025812Z %131 = arith.cmpi slt, %128, %cst_4 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6026053Z %132 = arith.andi %130, %131 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6026369Z %133 = tt.broadcast %132 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6026713Z %134 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<128x128xf32, #mma>) : i32 { 2026-02-21T09:51:05.6026926Z %253 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:51:05.6027099Z %254 = tt.splat %253 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6027325Z %255 = arith.addi %254, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6027600Z %256 = tt.expand_dims %255 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:05.6027881Z %257 = tt.broadcast %256 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6028076Z %258 = arith.addi %124, %257 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6028282Z %259 = tt.addptr %8, %258 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6028492Z %260 = tt.load %259 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6028715Z %261 = ttg.local_alloc %260 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6029050Z %262 = ttg.local_load %261 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6029471Z %263 = arith.extf %262 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6029754Z %264 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:05.6029965Z %265 = tt.splat %264 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6030258Z %266 = arith.addi %265, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6030662Z %267 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6031015Z %268 = arith.muli %267, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6031325Z %269 = tt.broadcast %268 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6031634Z %270 = arith.addi %269, %129 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6031962Z %271 = tt.addptr %9, %270 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6032280Z %272 = arith.cmpi sge, %267, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6032530Z %273 = arith.cmpi slt, %267, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6032763Z %274 = arith.andi %272, %273 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6033062Z %275 = tt.broadcast %274 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6033362Z %276 = arith.andi %275, %133 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6033608Z %277 = tt.load %271, %276, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6033857Z %278 = arith.shli %277, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6034111Z %279 = arith.shrsi %278, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6034347Z %280 = arith.shrsi %277, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6034636Z %281 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6034977Z %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6035267Z %283 = tt.broadcast %281 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6035509Z %284 = arith.select %17, %283, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6035752Z %285 = tt.broadcast %282 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6035990Z %286 = arith.select %19, %285, %284 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6036220Z %287 = tt.reshape %286 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6036448Z %288 = arith.sitofp %287 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6036702Z %289 = ttg.local_alloc %288 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6037051Z %290 = ttg.local_load %289 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6037532Z %291 = tt.dot %263, %290, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6037882Z %292 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:51:05.6038012Z %293 = arith.muli %292, %c2_i32 : i32 2026-02-21T09:51:05.6038186Z %294 = tt.splat %293 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6038434Z %295 = arith.addi %294, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6038718Z %296 = tt.expand_dims %295 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:05.6038997Z %297 = tt.broadcast %296 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6039203Z %298 = arith.addi %124, %297 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6039408Z %299 = tt.addptr %8, %298 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6039629Z %300 = tt.load %299 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6039892Z %301 = ttg.local_alloc %300 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6040229Z %302 = ttg.local_load %301 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6040643Z %303 = arith.extf %302 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6040932Z %304 = arith.extsi %292 : i32 to i64 2026-02-21T09:51:05.6041150Z %305 = tt.splat %304 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6041453Z %306 = arith.addi %305, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6041851Z %307 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6042231Z %308 = arith.muli %307, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6042550Z %309 = tt.broadcast %308 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6042900Z %310 = arith.addi %309, %129 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6043222Z %311 = tt.addptr %9, %310 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6043545Z %312 = arith.cmpi sge, %307, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6043800Z %313 = arith.cmpi slt, %307, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6044043Z %314 = arith.andi %312, %313 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6044350Z %315 = tt.broadcast %314 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6044660Z %316 = arith.andi %315, %133 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6044909Z %317 = tt.load %311, %316, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6045183Z %318 = arith.shli %317, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6045425Z %319 = arith.shrsi %318, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6045663Z %320 = arith.shrsi %317, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6045959Z %321 = tt.expand_dims %319 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6046305Z %322 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6046618Z %323 = tt.broadcast %321 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6046871Z %324 = arith.select %17, %323, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6047113Z %325 = tt.broadcast %322 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6047359Z %326 = arith.select %19, %325, %324 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6047600Z %327 = tt.reshape %326 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6047830Z %328 = arith.sitofp %327 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6048114Z %329 = ttg.local_alloc %328 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6048447Z %330 = ttg.local_load %329 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6048930Z %331 = tt.dot %303, %330, %291, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6049287Z %332 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:51:05.6049416Z %333 = arith.muli %332, %c2_i32 : i32 2026-02-21T09:51:05.6049594Z %334 = tt.splat %333 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6049819Z %335 = arith.addi %334, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6050102Z %336 = tt.expand_dims %335 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:05.6050403Z %337 = tt.broadcast %336 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6050603Z %338 = arith.addi %124, %337 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6050810Z %339 = tt.addptr %8, %338 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6051020Z %340 = tt.load %339 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6051251Z %341 = ttg.local_alloc %340 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6051588Z %342 = ttg.local_load %341 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6052000Z %343 = arith.extf %342 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6052293Z %344 = arith.extsi %332 : i32 to i64 2026-02-21T09:51:05.6052506Z %345 = tt.splat %344 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6052816Z %346 = arith.addi %345, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6053211Z %347 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6053585Z %348 = arith.muli %347, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6053905Z %349 = tt.broadcast %348 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6054226Z %350 = arith.addi %349, %129 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6054542Z %351 = tt.addptr %9, %350 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6054882Z %352 = arith.cmpi sge, %347, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6055129Z %353 = arith.cmpi slt, %347, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6055373Z %354 = arith.andi %352, %353 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6055685Z %355 = tt.broadcast %354 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6055990Z %356 = arith.andi %355, %133 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6056260Z %357 = tt.load %351, %356, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6056512Z %358 = arith.shli %357, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6056756Z %359 = arith.shrsi %358, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6057002Z %360 = arith.shrsi %357, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6057344Z %361 = tt.expand_dims %359 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6057692Z %362 = tt.expand_dims %360 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6057988Z %363 = tt.broadcast %361 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6058235Z %364 = arith.select %17, %363, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6058500Z %365 = tt.broadcast %362 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6058744Z %366 = arith.select %19, %365, %364 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6058982Z %367 = tt.reshape %366 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6059216Z %368 = arith.sitofp %367 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6059475Z %369 = ttg.local_alloc %368 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6059808Z %370 = ttg.local_load %369 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6060284Z %371 = tt.dot %343, %370, %331, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6060641Z scf.yield %371 : tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6060785Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:51:05.6060933Z %135 = arith.addi %124, %29 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6061145Z %136 = tt.addptr %8, %135 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6061372Z %137 = tt.load %136 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6061604Z %138 = ttg.local_alloc %137 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6061949Z %139 = ttg.local_load %138 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6062367Z %140 = arith.extf %139 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6062710Z %141 = arith.addi %33, %129 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6063046Z %142 = tt.addptr %9, %141 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6063360Z %143 = arith.andi %37, %133 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6063616Z %144 = tt.load %142, %143, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6063865Z %145 = arith.shli %144, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6064107Z %146 = arith.shrsi %145, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6064368Z %147 = arith.shrsi %144, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6064662Z %148 = tt.expand_dims %146 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6065007Z %149 = tt.expand_dims %147 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6065294Z %150 = tt.broadcast %148 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6065541Z %151 = arith.select %17, %150, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6065792Z %152 = tt.broadcast %149 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6066030Z %153 = arith.select %19, %152, %151 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6066269Z %154 = tt.reshape %153 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6066515Z %155 = arith.sitofp %154 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6066779Z %156 = ttg.local_alloc %155 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6067113Z %157 = ttg.local_load %156 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6067583Z %158 = tt.dot %140, %157, %134, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6067982Z %159 = arith.truncf %158 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:51:05.6068161Z %160 = arith.extsi %119 : i32 to i64 2026-02-21T09:51:05.6068334Z %161 = tt.splat %160 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:05.6068557Z %162 = arith.addi %161, %21 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:05.6068826Z %163 = tt.expand_dims %162 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6069076Z %164 = arith.muli %163, %cst_15 : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6069263Z %165 = tt.broadcast %164 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6069481Z %166 = tt.splat %125 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:05.6069712Z %167 = arith.addi %166, %22 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:05.6069980Z %168 = tt.expand_dims %167 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:51:05.6070251Z %169 = tt.broadcast %168 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6070439Z %170 = arith.addi %165, %169 : tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6070640Z %171 = tt.addptr %20, %170 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6070877Z %172 = arith.cmpi sge, %163, %cst_16 : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6071048Z %173 = arith.cmpi slt, %163, %cst_17 : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6071214Z %174 = arith.andi %172, %173 : tensor<128x1xi1, #mma> 2026-02-21T09:51:05.6071394Z %175 = tt.broadcast %174 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:51:05.6071589Z %176 = arith.cmpi sge, %168, %cst_13 : tensor<1x128xi64, #mma> 2026-02-21T09:51:05.6071759Z %177 = arith.cmpi slt, %168, %cst_14 : tensor<1x128xi64, #mma> 2026-02-21T09:51:05.6071925Z %178 = arith.andi %176, %177 : tensor<1x128xi1, #mma> 2026-02-21T09:51:05.6072105Z %179 = tt.broadcast %178 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:51:05.6072305Z %180 = arith.andi %175, %179 : tensor<128x128xi1, #mma> 2026-02-21T09:51:05.6072477Z tt.store %171, %159, %180 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:51:05.6072629Z %181 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:51:05.6072762Z %182 = arith.divsi %181, %c256_i32 : i32 2026-02-21T09:51:05.6072888Z %183 = arith.muli %182, %c2_i32 : i32 2026-02-21T09:51:05.6073013Z %184 = arith.subi %c64_i32, %183 : i32 2026-02-21T09:51:05.6073138Z %185 = arith.minsi %184, %c2_i32 : i32 2026-02-21T09:51:05.6073262Z %186 = arith.remsi %181, %c256_i32 : i32 2026-02-21T09:51:05.6073389Z %187 = arith.remsi %186, %185 : i32 2026-02-21T09:51:05.6073508Z %188 = arith.addi %183, %187 : i32 2026-02-21T09:51:05.6073630Z %189 = arith.divsi %186, %185 : i32 2026-02-21T09:51:05.6073750Z %190 = arith.muli %188, %c128_i32 : i32 2026-02-21T09:51:05.6073875Z %191 = arith.muli %189, %c128_i32 : i32 2026-02-21T09:51:05.6074053Z %192 = tt.splat %191 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:05.6074307Z %193 = arith.addi %192, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:05.6074600Z %194 = tt.expand_dims %193 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:51:05.6074858Z %195 = arith.muli %194, %cst_6 : tensor<128x1xi32, #blocked1> 2026-02-21T09:51:05.6075062Z %196 = tt.broadcast %195 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6075241Z %197 = arith.extsi %190 : i32 to i64 2026-02-21T09:51:05.6075454Z %198 = tt.splat %197 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6075763Z %199 = arith.addi %198, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6076163Z %200 = tt.expand_dims %199 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6076608Z %201 = tt.broadcast %200 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6076938Z %202 = arith.cmpi sge, %200, %cst_5 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6077205Z %203 = arith.cmpi slt, %200, %cst_4 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6077447Z %204 = arith.andi %202, %203 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6077752Z %205 = tt.broadcast %204 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6078096Z %206 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<128x128xf32, #mma>) : i32 { 2026-02-21T09:51:05.6078314Z %253 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:51:05.6078487Z %254 = tt.splat %253 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6078732Z %255 = arith.addi %254, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6079007Z %256 = tt.expand_dims %255 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:05.6079288Z %257 = tt.broadcast %256 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6079487Z %258 = arith.addi %196, %257 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6079691Z %259 = tt.addptr %8, %258 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6079902Z %260 = tt.load %259 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6080142Z %261 = ttg.local_alloc %260 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6080475Z %262 = ttg.local_load %261 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6080888Z %263 = arith.extf %262 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6081172Z %264 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:05.6081384Z %265 = tt.splat %264 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6081680Z %266 = arith.addi %265, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6082089Z %267 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6082444Z %268 = arith.muli %267, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6082818Z %269 = tt.broadcast %268 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6083130Z %270 = arith.addi %269, %201 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6083447Z %271 = tt.addptr %9, %270 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6083766Z %272 = arith.cmpi sge, %267, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6084014Z %273 = arith.cmpi slt, %267, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6084252Z %274 = arith.andi %272, %273 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6084555Z %275 = tt.broadcast %274 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6084859Z %276 = arith.andi %275, %205 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6085105Z %277 = tt.load %271, %276, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6085377Z %278 = arith.shli %277, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6085610Z %279 = arith.shrsi %278, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6085848Z %280 = arith.shrsi %277, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6086141Z %281 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6086479Z %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6086792Z %283 = tt.broadcast %281 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6087036Z %284 = arith.select %17, %283, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6087276Z %285 = tt.broadcast %282 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6087520Z %286 = arith.select %19, %285, %284 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6087754Z %287 = tt.reshape %286 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6087982Z %288 = arith.sitofp %287 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6088256Z %289 = ttg.local_alloc %288 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6088587Z %290 = ttg.local_load %289 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6089066Z %291 = tt.dot %263, %290, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6089416Z %292 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:51:05.6089542Z %293 = arith.muli %292, %c2_i32 : i32 2026-02-21T09:51:05.6089716Z %294 = tt.splat %293 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6089939Z %295 = arith.addi %294, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6090240Z %296 = tt.expand_dims %295 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:05.6090517Z %297 = tt.broadcast %296 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6090717Z %298 = arith.addi %196, %297 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6090921Z %299 = tt.addptr %8, %298 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6091129Z %300 = tt.load %299 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6091354Z %301 = ttg.local_alloc %300 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6091683Z %302 = ttg.local_load %301 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6092101Z %303 = arith.extf %302 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6092386Z %304 = arith.extsi %292 : i32 to i64 2026-02-21T09:51:05.6092595Z %305 = tt.splat %304 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6092897Z %306 = arith.addi %305, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6093303Z %307 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6093658Z %308 = arith.muli %307, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6093971Z %309 = tt.broadcast %308 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6094283Z %310 = arith.addi %309, %201 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6094600Z %311 = tt.addptr %9, %310 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6094936Z %312 = arith.cmpi sge, %307, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6095181Z %313 = arith.cmpi slt, %307, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6095418Z %314 = arith.andi %312, %313 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6095718Z %315 = tt.broadcast %314 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6096021Z %316 = arith.andi %315, %205 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6096286Z %317 = tt.load %311, %316, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6096532Z %318 = arith.shli %317, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6096772Z %319 = arith.shrsi %318, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6097007Z %320 = arith.shrsi %317, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6097301Z %321 = tt.expand_dims %319 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6097644Z %322 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6097934Z %323 = tt.broadcast %321 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6098199Z %324 = arith.select %17, %323, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6098440Z %325 = tt.broadcast %322 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6098675Z %326 = arith.select %19, %325, %324 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6098910Z %327 = tt.reshape %326 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6099137Z %328 = arith.sitofp %327 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6099397Z %329 = ttg.local_alloc %328 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6099728Z %330 = ttg.local_load %329 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6100207Z %331 = tt.dot %303, %330, %291, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6100557Z %332 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:51:05.6100680Z %333 = arith.muli %332, %c2_i32 : i32 2026-02-21T09:51:05.6100855Z %334 = tt.splat %333 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6101080Z %335 = arith.addi %334, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6101385Z %336 = tt.expand_dims %335 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:05.6101665Z %337 = tt.broadcast %336 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6101862Z %338 = arith.addi %196, %337 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6102069Z %339 = tt.addptr %8, %338 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6102277Z %340 = tt.load %339 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6102519Z %341 = ttg.local_alloc %340 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6102855Z %342 = ttg.local_load %341 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6103266Z %343 = arith.extf %342 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6103553Z %344 = arith.extsi %332 : i32 to i64 2026-02-21T09:51:05.6103766Z %345 = tt.splat %344 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6104081Z %346 = arith.addi %345, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6104474Z %347 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6104839Z %348 = arith.muli %347, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6105149Z %349 = tt.broadcast %348 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6105462Z %350 = arith.addi %349, %201 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6105776Z %351 = tt.addptr %9, %350 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6106100Z %352 = arith.cmpi sge, %347, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6106365Z %353 = arith.cmpi slt, %347, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6106602Z %354 = arith.andi %352, %353 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6106905Z %355 = tt.broadcast %354 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6107208Z %356 = arith.andi %355, %205 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6107456Z %357 = tt.load %351, %356, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6107706Z %358 = arith.shli %357, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6107941Z %359 = arith.shrsi %358, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6108180Z %360 = arith.shrsi %357, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6108469Z %361 = tt.expand_dims %359 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6108810Z %362 = tt.expand_dims %360 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6109119Z %363 = tt.broadcast %361 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6109360Z %364 = arith.select %17, %363, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6109603Z %365 = tt.broadcast %362 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6109840Z %366 = arith.select %19, %365, %364 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6110072Z %367 = tt.reshape %366 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6110302Z %368 = arith.sitofp %367 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6110575Z %369 = ttg.local_alloc %368 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6110905Z %370 = ttg.local_load %369 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6111379Z %371 = tt.dot %343, %370, %331, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6111730Z scf.yield %371 : tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6111867Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:51:05.6112027Z %207 = arith.addi %196, %29 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6112234Z %208 = tt.addptr %8, %207 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6112445Z %209 = tt.load %208 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6112669Z %210 = ttg.local_alloc %209 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6113002Z %211 = ttg.local_load %210 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6113413Z %212 = arith.extf %211 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6113750Z %213 = arith.addi %33, %201 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6114079Z %214 = tt.addptr %9, %213 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6114388Z %215 = arith.andi %37, %205 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6114635Z %216 = tt.load %214, %215, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6114880Z %217 = arith.shli %216, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6115120Z %218 = arith.shrsi %217, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6115358Z %219 = arith.shrsi %216, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6115647Z %220 = tt.expand_dims %218 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6115986Z %221 = tt.expand_dims %219 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6116274Z %222 = tt.broadcast %220 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6116516Z %223 = arith.select %17, %222, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6116758Z %224 = tt.broadcast %221 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6116995Z %225 = arith.select %19, %224, %223 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6117247Z %226 = tt.reshape %225 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6117477Z %227 = arith.sitofp %226 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6117733Z %228 = ttg.local_alloc %227 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6118063Z %229 = ttg.local_load %228 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6118529Z %230 = tt.dot %212, %229, %206, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6118939Z %231 = arith.truncf %230 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:51:05.6119118Z %232 = arith.extsi %191 : i32 to i64 2026-02-21T09:51:05.6119283Z %233 = tt.splat %232 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:05.6119497Z %234 = arith.addi %233, %21 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:05.6119764Z %235 = tt.expand_dims %234 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6120027Z %236 = arith.muli %235, %cst_15 : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6120213Z %237 = tt.broadcast %236 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6120423Z %238 = tt.splat %197 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:05.6120639Z %239 = arith.addi %238, %22 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:05.6120903Z %240 = tt.expand_dims %239 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:51:05.6121170Z %241 = tt.broadcast %240 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6121357Z %242 = arith.addi %237, %241 : tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6121549Z %243 = tt.addptr %20, %242 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6121755Z %244 = arith.cmpi sge, %235, %cst_16 : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6121924Z %245 = arith.cmpi slt, %235, %cst_17 : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6122102Z %246 = arith.andi %244, %245 : tensor<128x1xi1, #mma> 2026-02-21T09:51:05.6122280Z %247 = tt.broadcast %246 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:51:05.6122470Z %248 = arith.cmpi sge, %240, %cst_13 : tensor<1x128xi64, #mma> 2026-02-21T09:51:05.6122684Z %249 = arith.cmpi slt, %240, %cst_14 : tensor<1x128xi64, #mma> 2026-02-21T09:51:05.6122842Z %250 = arith.andi %248, %249 : tensor<1x128xi1, #mma> 2026-02-21T09:51:05.6123018Z %251 = tt.broadcast %250 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:51:05.6123201Z %252 = arith.andi %247, %251 : tensor<128x128xi1, #mma> 2026-02-21T09:51:05.6123372Z tt.store %243, %231, %252 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:51:05.6123521Z } {tt.num_stages = 1 : i32} 2026-02-21T09:51:05.6123647Z scf.for %arg3 = %26 to %2 step %c1_i32 : i32 { 2026-02-21T09:51:05.6123787Z %38 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:51:05.6123910Z %39 = arith.muli %38, %c2_i32 : i32 2026-02-21T09:51:05.6124031Z %40 = arith.subi %c64_i32, %39 : i32 2026-02-21T09:51:05.6124148Z %41 = arith.minsi %40, %c2_i32 : i32 2026-02-21T09:51:05.6124269Z %42 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:51:05.6124388Z %43 = arith.remsi %42, %41 : i32 2026-02-21T09:51:05.6124502Z %44 = arith.addi %39, %43 : i32 2026-02-21T09:51:05.6124616Z %45 = arith.divsi %42, %41 : i32 2026-02-21T09:51:05.6124748Z %46 = arith.muli %44, %c128_i32 : i32 2026-02-21T09:51:05.6124867Z %47 = arith.muli %45, %c128_i32 : i32 2026-02-21T09:51:05.6125035Z %48 = tt.splat %47 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:05.6125259Z %49 = arith.addi %48, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:05.6125536Z %50 = tt.expand_dims %49 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:51:05.6125790Z %51 = arith.muli %50, %cst_6 : tensor<128x1xi32, #blocked1> 2026-02-21T09:51:05.6125985Z %52 = tt.broadcast %51 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6126183Z %53 = arith.extsi %46 : i32 to i64 2026-02-21T09:51:05.6126390Z %54 = tt.splat %53 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6126688Z %55 = arith.addi %54, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6127079Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6127529Z %57 = tt.broadcast %56 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6127842Z %58 = arith.cmpi sge, %56, %cst_5 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6128091Z %59 = arith.cmpi slt, %56, %cst_4 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6128323Z %60 = arith.andi %58, %59 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6128622Z %61 = tt.broadcast %60 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6128959Z %62 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<128x128xf32, #mma>) : i32 { 2026-02-21T09:51:05.6129174Z %109 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:51:05.6129352Z %110 = tt.splat %109 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6129584Z %111 = arith.addi %110, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6129883Z %112 = tt.expand_dims %111 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:05.6130168Z %113 = tt.broadcast %112 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6130365Z %114 = arith.addi %52, %113 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6130573Z %115 = tt.addptr %8, %114 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6130787Z %116 = tt.load %115 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6131015Z %117 = ttg.local_alloc %116 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6131354Z %118 = ttg.local_load %117 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6131769Z %119 = arith.extf %118 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6132062Z %120 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:05.6139556Z %121 = tt.splat %120 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6139888Z %122 = arith.addi %121, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6140330Z %123 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6140691Z %124 = arith.muli %123, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6141014Z %125 = tt.broadcast %124 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6141324Z %126 = arith.addi %125, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6141674Z %127 = tt.addptr %9, %126 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6141996Z %128 = arith.cmpi sge, %123, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6142242Z %129 = arith.cmpi slt, %123, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6142481Z %130 = arith.andi %128, %129 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6142784Z %131 = tt.broadcast %130 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6143102Z %132 = arith.andi %131, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6143350Z %133 = tt.load %127, %132, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6143600Z %134 = arith.shli %133, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6143839Z %135 = arith.shrsi %134, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6144078Z %136 = arith.shrsi %133, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6144367Z %137 = tt.expand_dims %135 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6144706Z %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6144994Z %139 = tt.broadcast %137 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6145263Z %140 = arith.select %17, %139, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6145507Z %141 = tt.broadcast %138 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6145743Z %142 = arith.select %19, %141, %140 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6145979Z %143 = tt.reshape %142 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6146206Z %144 = arith.sitofp %143 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6146462Z %145 = ttg.local_alloc %144 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6146792Z %146 = ttg.local_load %145 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6147278Z %147 = tt.dot %119, %146, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6147632Z %148 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:51:05.6147757Z %149 = arith.muli %148, %c2_i32 : i32 2026-02-21T09:51:05.6147929Z %150 = tt.splat %149 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6148169Z %151 = arith.addi %150, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6148442Z %152 = tt.expand_dims %151 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:05.6148719Z %153 = tt.broadcast %152 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6148916Z %154 = arith.addi %52, %153 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6149123Z %155 = tt.addptr %8, %154 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6149352Z %156 = tt.load %155 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6149576Z %157 = ttg.local_alloc %156 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6149907Z %158 = ttg.local_load %157 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6150314Z %159 = arith.extf %158 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6150600Z %160 = arith.extsi %148 : i32 to i64 2026-02-21T09:51:05.6150812Z %161 = tt.splat %160 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6151124Z %162 = arith.addi %161, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6151512Z %163 = tt.expand_dims %162 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6151866Z %164 = arith.muli %163, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6152176Z %165 = tt.broadcast %164 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6152485Z %166 = arith.addi %165, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6152798Z %167 = tt.addptr %9, %166 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6153134Z %168 = arith.cmpi sge, %163, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6153378Z %169 = arith.cmpi slt, %163, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6153615Z %170 = arith.andi %168, %169 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6153920Z %171 = tt.broadcast %170 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6154221Z %172 = arith.andi %171, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6154464Z %173 = tt.load %167, %172, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6154712Z %174 = arith.shli %173, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6154947Z %175 = arith.shrsi %174, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6155184Z %176 = arith.shrsi %173, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6155476Z %177 = tt.expand_dims %175 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6155814Z %178 = tt.expand_dims %176 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6156117Z %179 = tt.broadcast %177 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6156358Z %180 = arith.select %17, %179, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6156598Z %181 = tt.broadcast %178 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6156834Z %182 = arith.select %19, %181, %180 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6157067Z %183 = tt.reshape %182 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6157317Z %184 = arith.sitofp %183 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6157572Z %185 = ttg.local_alloc %184 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6157902Z %186 = ttg.local_load %185 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6158376Z %187 = tt.dot %159, %186, %147, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6158723Z %188 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:51:05.6158867Z %189 = arith.muli %188, %c2_i32 : i32 2026-02-21T09:51:05.6159041Z %190 = tt.splat %189 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6159263Z %191 = arith.addi %190, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:05.6159540Z %192 = tt.expand_dims %191 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:05.6159814Z %193 = tt.broadcast %192 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6160012Z %194 = arith.addi %52, %193 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6160211Z %195 = tt.addptr %8, %194 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6160418Z %196 = tt.load %195 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6160645Z %197 = ttg.local_alloc %196 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6160992Z %198 = ttg.local_load %197 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6161404Z %199 = arith.extf %198 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6161686Z %200 = arith.extsi %188 : i32 to i64 2026-02-21T09:51:05.6161899Z %201 = tt.splat %200 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6162198Z %202 = arith.addi %201, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:05.6162643Z %203 = tt.expand_dims %202 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6163004Z %204 = arith.muli %203, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6163315Z %205 = tt.broadcast %204 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6163621Z %206 = arith.addi %205, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6163933Z %207 = tt.addptr %9, %206 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6164278Z %208 = arith.cmpi sge, %203, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6164523Z %209 = arith.cmpi slt, %203, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6164759Z %210 = arith.andi %208, %209 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6165061Z %211 = tt.broadcast %210 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6165386Z %212 = arith.andi %211, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6165630Z %213 = tt.load %207, %212, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6165875Z %214 = arith.shli %213, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6166111Z %215 = arith.shrsi %214, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6166345Z %216 = arith.shrsi %213, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6166636Z %217 = tt.expand_dims %215 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6166998Z %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6167284Z %219 = tt.broadcast %217 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6167526Z %220 = arith.select %17, %219, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6167766Z %221 = tt.broadcast %218 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6168003Z %222 = arith.select %19, %221, %220 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6168236Z %223 = tt.reshape %222 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6168460Z %224 = arith.sitofp %223 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6168718Z %225 = ttg.local_alloc %224 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6169065Z %226 = ttg.local_load %225 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6169536Z %227 = tt.dot %199, %226, %187, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6169889Z scf.yield %227 : tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6170026Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:51:05.6170168Z %63 = arith.addi %52, %29 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6170363Z %64 = tt.addptr %8, %63 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:05.6170563Z %65 = tt.load %64 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:05.6170784Z %66 = ttg.local_alloc %65 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:05.6171109Z %67 = ttg.local_load %66 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6171518Z %68 = arith.extf %67 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6171845Z %69 = arith.addi %33, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6172169Z %70 = tt.addptr %9, %69 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6172471Z %71 = arith.andi %37, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6172703Z %72 = tt.load %70, %71, %cst_3 : tensor<2x128x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6172943Z %73 = arith.shli %72, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6173171Z %74 = arith.shrsi %73, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6173416Z %75 = arith.shrsi %72, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:05.6173698Z %76 = tt.expand_dims %74 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6174027Z %77 = tt.expand_dims %75 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:51:05.6174306Z %78 = tt.broadcast %76 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6174545Z %79 = arith.select %17, %78, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6174800Z %80 = tt.broadcast %77 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6175029Z %81 = arith.select %19, %80, %79 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:51:05.6175253Z %82 = tt.reshape %81 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:51:05.6175477Z %83 = arith.sitofp %82 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:51:05.6175724Z %84 = ttg.local_alloc %83 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:05.6176042Z %85 = ttg.local_load %84 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:05.6176503Z %86 = tt.dot %68, %85, %62, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:05.6176901Z %87 = arith.truncf %86 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:51:05.6177079Z %88 = arith.extsi %47 : i32 to i64 2026-02-21T09:51:05.6177242Z %89 = tt.splat %88 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:05.6177448Z %90 = arith.addi %89, %21 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:05.6177710Z %91 = tt.expand_dims %90 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6177947Z %92 = arith.muli %91, %cst_15 : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6178127Z %93 = tt.broadcast %92 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6178329Z %94 = tt.splat %53 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:05.6178529Z %95 = arith.addi %94, %22 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:05.6178791Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:51:05.6179047Z %97 = tt.broadcast %96 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6179228Z %98 = arith.addi %93, %97 : tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6179417Z %99 = tt.addptr %20, %98 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi64, #mma> 2026-02-21T09:51:05.6179634Z %100 = arith.cmpi sge, %91, %cst_16 : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6179803Z %101 = arith.cmpi slt, %91, %cst_17 : tensor<128x1xi64, #mma> 2026-02-21T09:51:05.6179959Z %102 = arith.andi %100, %101 : tensor<128x1xi1, #mma> 2026-02-21T09:51:05.6180134Z %103 = tt.broadcast %102 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:51:05.6180318Z %104 = arith.cmpi sge, %96, %cst_13 : tensor<1x128xi64, #mma> 2026-02-21T09:51:05.6180483Z %105 = arith.cmpi slt, %96, %cst_14 : tensor<1x128xi64, #mma> 2026-02-21T09:51:05.6180640Z %106 = arith.andi %104, %105 : tensor<1x128xi1, #mma> 2026-02-21T09:51:05.6180811Z %107 = tt.broadcast %106 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:51:05.6181016Z %108 = arith.andi %103, %107 : tensor<128x128xi1, #mma> 2026-02-21T09:51:05.6181173Z tt.store %99, %87, %108 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:51:05.6181317Z } {tt.num_stages = 1 : i32} 2026-02-21T09:51:05.6181421Z tt.return 2026-02-21T09:51:05.6181502Z } 2026-02-21T09:51:05.6181574Z } 2026-02-21T09:51:05.6181622Z 2026-02-21T09:51:05.6181653Z {-# 2026-02-21T09:51:05.6181734Z external_resources: { 2026-02-21T09:51:05.6181833Z mlir_reproducer: { 2026-02-21T09:51:05.6182853Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:51:05.6183842Z disable_threading: false, 2026-02-21T09:51:05.6183947Z verify_each: true 2026-02-21T09:51:05.6184037Z } 2026-02-21T09:51:05.6184107Z } 2026-02-21T09:51:05.6184176Z #-} 2026-02-21T09:51:05.6184452Z /tmp/torchinductor_root/lb/clbavg5reont7fzmlfj6vv4kl6ka6fhi6slqhyesoznsr4cqh7yc.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:51:05.6185174Z /tmp/torchinductor_root/lb/clbavg5reont7fzmlfj6vv4kl6ka6fhi6slqhyesoznsr4cqh7yc.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:51:05.6185725Z [396s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:51:05.6186502Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T09:51:05.6187206Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:51:05.6187371Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:51:07.4417104Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:51:07.4422245Z #blocked = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:51:07.4422620Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T09:51:07.4422930Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:51:07.4423603Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:51:07.4423884Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:51:07.4424136Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:51:07.4424369Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:51:07.4424555Z #smem = #ttg.shared_memory 2026-02-21T09:51:07.4424784Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:51:07.4425402Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:51:07.4425789Z %cst = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:51:07.4425957Z %cst_0 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:51:07.4426125Z %cst_1 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:51:07.4426292Z %cst_2 = arith.constant dense<8192> : tensor<1x128xi64, #mma> 2026-02-21T09:51:07.4426452Z %cst_3 = arith.constant dense<0> : tensor<1x128xi64, #mma> 2026-02-21T09:51:07.4430681Z %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4430858Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4431029Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4431203Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:51:07.4431374Z %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:51:07.4431555Z %cst_9 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:51:07.4431713Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:51:07.4431829Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:51:07.4431943Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:51:07.4432128Z %cst_10 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:07.4432377Z %cst_11 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:07.4432778Z %cst_12 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:07.4432988Z %cst_13 = arith.constant dense<0> : tensor<1x128xi64, #blocked> 2026-02-21T09:51:07.4433163Z %cst_14 = arith.constant dense<8192> : tensor<1x128xi64, #blocked> 2026-02-21T09:51:07.4433345Z %cst_15 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:51:07.4433495Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:51:07.4433610Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:51:07.4433725Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:51:07.4433866Z %cst_16 = arith.constant dense<0> : tensor<2x128xi8, #blocked> 2026-02-21T09:51:07.4434040Z %cst_17 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked1> 2026-02-21T09:51:07.4434185Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:51:07.4434299Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:51:07.4434479Z %cst_18 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:07.4434693Z %0 = tt.get_program_id x : i32 2026-02-21T09:51:07.4434807Z %1 = arith.divsi %0, %c512_i32 : i32 2026-02-21T09:51:07.4434922Z %2 = arith.muli %1, %c8_i32 : i32 2026-02-21T09:51:07.4435035Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T09:51:07.4435149Z %4 = arith.minsi %3, %c8_i32 : i32 2026-02-21T09:51:07.4435258Z %5 = arith.remsi %0, %c512_i32 : i32 2026-02-21T09:51:07.4435375Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:51:07.4435511Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:51:07.4435619Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:51:07.4435725Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:51:07.4435930Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:07.4436209Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:07.4436482Z %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:51:07.4436752Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:07.4437019Z %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:07.4437245Z %15 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:07.4437418Z %16 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:51:07.4437609Z %17 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:07.4437913Z %18 = tt.expand_dims %15 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:51:07.4438165Z %19 = arith.muli %18, %cst_15 : tensor<128x1xi32, #blocked2> 2026-02-21T09:51:07.4438376Z %20 = tt.broadcast %19 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:51:07.4438593Z %21 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:51:07.4438755Z %22 = arith.extsi %16 : i32 to i64 2026-02-21T09:51:07.4438906Z %23 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:51:07.4439135Z %24 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:07.4439449Z %25 = arith.extsi %24 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:07.4439732Z %26 = tt.splat %22 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:51:07.4440025Z %27 = arith.extsi %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:51:07.4440317Z %28 = arith.addi %26, %27 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:51:07.4440604Z %29 = tt.expand_dims %28 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi64, #blocked> 2026-02-21T09:51:07.4440878Z %30 = tt.broadcast %29 : tensor<1x128xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:51:07.4441078Z %31 = arith.cmpi sge, %29, %cst_13 : tensor<1x128xi64, #blocked> 2026-02-21T09:51:07.4441248Z %32 = arith.cmpi slt, %29, %cst_14 : tensor<1x128xi64, #blocked> 2026-02-21T09:51:07.4441412Z %33 = arith.andi %31, %32 : tensor<1x128xi1, #blocked> 2026-02-21T09:51:07.4441591Z %34 = tt.broadcast %33 : tensor<1x128xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:51:07.4441882Z %35 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:51:07.4442308Z %36 = tt.expand_dims %35 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:51:07.4442784Z %37 = tt.expand_dims %36 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:51:07.4443044Z %38 = arith.cmpi eq, %37, %cst_8 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:51:07.4443240Z %39 = tt.broadcast %38 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1> 2026-02-21T09:51:07.4443456Z %40 = arith.cmpi eq, %37, %cst_7 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:51:07.4443646Z %41 = tt.broadcast %40 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1> 2026-02-21T09:51:07.4443864Z %42 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:07.4444132Z %43 = tt.expand_dims %17 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:51:07.4444401Z %44 = tt.broadcast %43 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:51:07.4444592Z %45 = arith.addi %20, %44 : tensor<128x4xi32, #blocked2> 2026-02-21T09:51:07.4444805Z %46 = tt.addptr %21, %45 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:51:07.4445008Z %47 = tt.load %46 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:51:07.4445294Z %48 = ttg.memdesc_index %42[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:51:07.4445651Z ttg.local_store %47, %48 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:51:07.4445925Z %49 = arith.addi %17, %cst_10 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:07.4446198Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:51:07.4446483Z %51 = tt.broadcast %50 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:51:07.4446674Z %52 = arith.addi %20, %51 : tensor<128x4xi32, #blocked2> 2026-02-21T09:51:07.4446867Z %53 = tt.addptr %21, %52 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:51:07.4447067Z %54 = tt.load %53 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:51:07.4447343Z %55 = ttg.memdesc_index %42[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:51:07.4447696Z ttg.local_store %54, %55 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:51:07.4448236Z %56:4 = scf.for %arg3 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg4 = %cst_9, %arg5 = %c1_i32, %arg6 = %48, %arg7 = %55) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:51:07.4448655Z %140 = arith.addi %arg3, %c4_i32 : i32 2026-02-21T09:51:07.4448782Z %141 = arith.muli %140, %c2_i32 : i32 2026-02-21T09:51:07.4448953Z %142 = tt.splat %141 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:07.4449176Z %143 = arith.addi %142, %17 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:07.4449453Z %144 = tt.expand_dims %143 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:51:07.4449730Z %145 = tt.broadcast %144 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:51:07.4449929Z %146 = arith.addi %20, %145 : tensor<128x4xi32, #blocked2> 2026-02-21T09:51:07.4450133Z %147 = tt.addptr %21, %146 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:51:07.4450344Z %148 = tt.load %147 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:51:07.4450650Z %149 = ttg.local_load %arg6 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:07.4451086Z %150 = arith.extf %149 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:07.4451371Z %151 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:51:07.4451557Z %152 = tt.splat %151 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:07.4451776Z %153 = arith.addi %152, %25 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:07.4452050Z %154 = tt.expand_dims %153 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4452291Z %155 = arith.muli %154, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4452482Z %156 = tt.broadcast %155 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:51:07.4452674Z %157 = arith.addi %156, %30 : tensor<2x128xi64, #blocked> 2026-02-21T09:51:07.4452885Z %158 = tt.addptr %23, %157 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:51:07.4453100Z %159 = arith.cmpi sge, %154, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4453269Z %160 = arith.cmpi slt, %154, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4453434Z %161 = arith.andi %159, %160 : tensor<2x1xi1, #blocked> 2026-02-21T09:51:07.4453617Z %162 = tt.broadcast %161 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:51:07.4453805Z %163 = arith.andi %162, %34 : tensor<2x128xi1, #blocked> 2026-02-21T09:51:07.4453975Z %164 = tt.load %158, %163, %cst_16 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:51:07.4454260Z %165 = ttg.convert_layout %164 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:07.4454548Z %166 = arith.shli %165, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:07.4454793Z %167 = arith.shrsi %166, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:07.4455037Z %168 = arith.shrsi %165, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:07.4455337Z %169 = tt.expand_dims %167 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:51:07.4455679Z %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:51:07.4455972Z %171 = tt.broadcast %169 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:51:07.4456219Z %172 = arith.select %39, %171, %cst_17 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:51:07.4456484Z %173 = tt.broadcast %170 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:51:07.4456728Z %174 = arith.select %41, %173, %172 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:51:07.4456962Z %175 = tt.reshape %174 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:51:07.4457190Z %176 = arith.sitofp %175 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:51:07.4457446Z %177 = ttg.local_alloc %176 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:07.4457774Z %178 = ttg.local_load %177 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:07.4458258Z %179 = tt.dot %150, %178, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:07.4458608Z %180 = arith.addi %arg5, %c1_i32 : i32 2026-02-21T09:51:07.4458738Z %181 = arith.cmpi slt, %180, %c2_i32 : i32 2026-02-21T09:51:07.4458870Z %182 = arith.select %181, %180, %c0_i32 : i32 2026-02-21T09:51:07.4459139Z %183 = ttg.memdesc_index %42[%182] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:51:07.4459517Z ttg.local_store %148, %183 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:51:07.4459916Z scf.yield %179, %182, %arg7, %183 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:51:07.4460223Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:51:07.4460501Z %57 = ttg.local_load %56#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:07.4460948Z %58 = arith.extf %57 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:07.4461306Z %59 = arith.addi %25, %cst_11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:07.4461582Z %60 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4461826Z %61 = arith.muli %60, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4462013Z %62 = tt.broadcast %61 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:51:07.4462199Z %63 = arith.addi %62, %30 : tensor<2x128xi64, #blocked> 2026-02-21T09:51:07.4462423Z %64 = tt.addptr %23, %63 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:51:07.4462624Z %65 = arith.cmpi sge, %60, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4462791Z %66 = arith.cmpi slt, %60, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4462949Z %67 = arith.andi %65, %66 : tensor<2x1xi1, #blocked> 2026-02-21T09:51:07.4463126Z %68 = tt.broadcast %67 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:51:07.4463308Z %69 = arith.andi %68, %34 : tensor<2x128xi1, #blocked> 2026-02-21T09:51:07.4463467Z %70 = tt.load %64, %69, %cst_16 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:51:07.4463721Z %71 = ttg.convert_layout %70 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:07.4463999Z %72 = arith.shli %71, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:07.4464236Z %73 = arith.shrsi %72, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:07.4464495Z %74 = arith.shrsi %71, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:07.4464781Z %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:51:07.4465120Z %76 = tt.expand_dims %74 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:51:07.4465401Z %77 = tt.broadcast %75 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:51:07.4465643Z %78 = arith.select %39, %77, %cst_17 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:51:07.4465881Z %79 = tt.broadcast %76 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:51:07.4466110Z %80 = arith.select %41, %79, %78 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:51:07.4466341Z %81 = tt.reshape %80 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:51:07.4466558Z %82 = arith.sitofp %81 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:51:07.4466811Z %83 = ttg.local_alloc %82 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:07.4467134Z %84 = ttg.local_load %83 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:07.4467611Z %85 = tt.dot %58, %84, %56#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:07.4468099Z %86 = ttg.local_load %56#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:07.4468526Z %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:07.4468850Z %88 = arith.addi %25, %cst_12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:07.4469143Z %89 = tt.expand_dims %88 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4469381Z %90 = arith.muli %89, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4469564Z %91 = tt.broadcast %90 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:51:07.4469749Z %92 = arith.addi %91, %30 : tensor<2x128xi64, #blocked> 2026-02-21T09:51:07.4469937Z %93 = tt.addptr %23, %92 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:51:07.4470137Z %94 = arith.cmpi sge, %89, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4470298Z %95 = arith.cmpi slt, %89, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:51:07.4470471Z %96 = arith.andi %94, %95 : tensor<2x1xi1, #blocked> 2026-02-21T09:51:07.4470645Z %97 = tt.broadcast %96 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:51:07.4470826Z %98 = arith.andi %97, %34 : tensor<2x128xi1, #blocked> 2026-02-21T09:51:07.4470987Z %99 = tt.load %93, %98, %cst_16 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:51:07.4471239Z %100 = ttg.convert_layout %99 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:07.4471525Z %101 = arith.shli %100, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:07.4471767Z %102 = arith.shrsi %101, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:07.4472013Z %103 = arith.shrsi %100, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:07.4472331Z %104 = tt.expand_dims %102 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:51:07.4472674Z %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:51:07.4472976Z %106 = tt.broadcast %104 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:51:07.4473228Z %107 = arith.select %39, %106, %cst_17 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:51:07.4473478Z %108 = tt.broadcast %105 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:51:07.4473723Z %109 = arith.select %41, %108, %107 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:51:07.4473961Z %110 = tt.reshape %109 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:51:07.4474195Z %111 = arith.sitofp %110 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:51:07.4474456Z %112 = ttg.local_alloc %111 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:51:07.4474781Z %113 = ttg.local_load %112 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:07.4475254Z %114 = tt.dot %87, %113, %85, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:51:07.4475652Z ttg.local_dealloc %42 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:07.4475875Z %115 = arith.truncf %114 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:51:07.4476054Z %116 = arith.extsi %9 : i32 to i64 2026-02-21T09:51:07.4476215Z %117 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:51:07.4476428Z %118 = tt.splat %116 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:07.4476709Z %119 = arith.extsi %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:07.4477065Z %120 = arith.extsi %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:07.4477349Z %121 = arith.addi %118, %119 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:07.4477621Z %122 = tt.expand_dims %121 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:51:07.4477866Z %123 = arith.muli %122, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:51:07.4478049Z %124 = tt.broadcast %123 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:51:07.4478263Z %125 = tt.splat %22 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:07.4478493Z %126 = arith.addi %125, %120 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:07.4478761Z %127 = tt.expand_dims %126 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:51:07.4479031Z %128 = tt.broadcast %127 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:51:07.4479217Z %129 = arith.addi %124, %128 : tensor<128x128xi64, #mma> 2026-02-21T09:51:07.4479417Z %130 = tt.addptr %117, %129 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi64, #mma> 2026-02-21T09:51:07.4479628Z %131 = arith.cmpi sge, %122, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:51:07.4479798Z %132 = arith.cmpi slt, %122, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:51:07.4479960Z %133 = arith.andi %131, %132 : tensor<128x1xi1, #mma> 2026-02-21T09:51:07.4480136Z %134 = tt.broadcast %133 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:51:07.4480330Z %135 = arith.cmpi sge, %127, %cst_3 : tensor<1x128xi64, #mma> 2026-02-21T09:51:07.4480513Z %136 = arith.cmpi slt, %127, %cst_2 : tensor<1x128xi64, #mma> 2026-02-21T09:51:07.4480677Z %137 = arith.andi %135, %136 : tensor<1x128xi1, #mma> 2026-02-21T09:51:07.4480855Z %138 = tt.broadcast %137 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:51:07.4481036Z %139 = arith.andi %134, %138 : tensor<128x128xi1, #mma> 2026-02-21T09:51:07.4481205Z tt.store %130, %115, %139 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:51:07.4481351Z tt.return 2026-02-21T09:51:07.4481439Z } 2026-02-21T09:51:07.4481519Z } 2026-02-21T09:51:07.4481568Z 2026-02-21T09:51:07.4481602Z {-# 2026-02-21T09:51:07.4481687Z external_resources: { 2026-02-21T09:51:07.4481793Z mlir_reproducer: { 2026-02-21T09:51:07.4482842Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:51:07.4483848Z disable_threading: false, 2026-02-21T09:51:07.4483975Z verify_each: true 2026-02-21T09:51:07.4484072Z } 2026-02-21T09:51:07.4484146Z } 2026-02-21T09:51:07.4484222Z #-} 2026-02-21T09:51:07.4484504Z /tmp/torchinductor_root/5w/c5wwxmx5vabbpkn6yll2cuxzqosrtw6kz5sxmqkbqh3xmwwb4n63.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:51:07.4485207Z /tmp/torchinductor_root/5w/c5wwxmx5vabbpkn6yll2cuxzqosrtw6kz5sxmqkbqh3xmwwb4n63.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:51:07.4485765Z [398s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:51:07.4486518Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:51:07.4489107Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:51:07.4489286Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:51:13.4609838Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 77/77 8.4 configs/s 2026-02-21T09:51:21.4486021Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 193/193 18.2 configs/s 2026-02-21T09:51:24.5592683Z [415s] Generation 4 complete: 2026-02-21T09:51:24.5593013Z error=4 2026-02-21T09:51:24.5593190Z ok=77 2026-02-21T09:51:24.5593335Z min=1.1042 2026-02-21T09:51:24.5593487Z mid=1.5332 2026-02-21T09:51:24.5593635Z max=67.4286 2026-02-21T09:51:24.5593809Z best={'block_sizes': [8, 128, 256], 2026-02-21T09:51:24.5594077Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T09:51:24.5594343Z 'l2_groupings': [2], 2026-02-21T09:51:24.5594551Z 'load_eviction_policies': ['', ''], 2026-02-21T09:51:24.5594772Z 'loop_orders': [[0, 1]], 2026-02-21T09:51:24.5594977Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:51:24.5595164Z 'num_stages': 1, 2026-02-21T09:51:24.5595375Z 'num_warps': 4, 2026-02-21T09:51:24.5595543Z 'pid_type': 'flat', 2026-02-21T09:51:24.5595732Z 'range_flattens': [None, None], 2026-02-21T09:51:24.5595949Z 'range_multi_buffers': [None, False], 2026-02-21T09:51:24.5596172Z 'range_num_stages': [0, 3], 2026-02-21T09:51:24.5596373Z 'range_unroll_factors': [0, 0], 2026-02-21T09:51:24.5597112Z 'range_warp_specializes': [], 2026-02-21T09:51:24.5597310Z 'waves_per_eu': 2} 2026-02-21T09:51:24.5717952Z [415s] Fitting surrogate: 506 points, 506 targets 2026-02-21T09:51:25.4006461Z [416s] Generation 5 starting: 80 neighbors, 4 active search path(s) 2026-02-21T09:51:38.9101625Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 1.7 configs/s 2026-02-21T09:51:45.1039603Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:51:45.1049481Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:51:45.1051696Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:51:45.1052150Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:51:45.1052470Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:51:45.1052958Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:51:45.1053368Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:51:45.1053917Z #smem = #ttg.shared_memory 2026-02-21T09:51:45.1054147Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:51:45.1054623Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:51:45.1055017Z %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma> 2026-02-21T09:51:45.1055199Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:45.1055465Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:45.1055654Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma> 2026-02-21T09:51:45.1055814Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:51:45.1056005Z %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:45.1056256Z %cst_4 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1056623Z %cst_5 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1056858Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1057031Z %cst_7 = arith.constant dense<0> : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1057203Z %cst_8 = arith.constant dense<512> : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1057375Z %cst_9 = arith.constant dense<0> : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1057549Z %cst_10 = arith.constant dense<8192> : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1057726Z %cst_11 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:45.1057873Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:51:45.1057988Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:51:45.1058104Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:51:45.1058226Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T09:51:45.1058344Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:51:45.1058457Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:51:45.1058597Z %cst_12 = arith.constant dense<0> : tensor<2x64xi8, #blocked2> 2026-02-21T09:51:45.1058769Z %cst_13 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1058916Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:51:45.1059129Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:51:45.1059241Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:51:45.1059418Z %cst_14 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1059604Z %0 = tt.get_program_id x : i32 2026-02-21T09:51:45.1059716Z %1 = arith.muli %0, %c4_i32 : i32 2026-02-21T09:51:45.1059824Z %2 = arith.addi %1, %c4_i32 : i32 2026-02-21T09:51:45.1059941Z %3 = arith.minsi %2, %c32768_i32 : i32 2026-02-21T09:51:45.1060140Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:45.1060414Z %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:45.1060677Z %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:45.1060941Z %7 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:45.1061200Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:45.1061439Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:45.1061637Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:45.1061866Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1062201Z %12 = arith.extsi %11 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1062561Z %13 = arith.extsi %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:45.1062910Z %14 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:51:45.1063325Z %15 = tt.expand_dims %14 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:51:45.1063745Z %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:45.1063993Z %17 = arith.cmpi eq, %16, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:45.1064185Z %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:51:45.1064403Z %19 = arith.cmpi eq, %16, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:45.1064587Z %20 = tt.broadcast %19 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:51:45.1064791Z %21 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:45.1064974Z %22 = arith.subi %3, %1 : i32 2026-02-21T09:51:45.1065090Z %23 = arith.remsi %22, %c3_i32 : i32 2026-02-21T09:51:45.1065203Z %24 = arith.subi %22, %23 : i32 2026-02-21T09:51:45.1065316Z %25 = arith.addi %1, %24 : i32 2026-02-21T09:51:45.1065438Z scf.for %arg3 = %1 to %25 step %c3_i32 : i32 { 2026-02-21T09:51:45.1065575Z %26 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:51:45.1065697Z %27 = arith.muli %26, %c2_i32 : i32 2026-02-21T09:51:45.1065814Z %28 = arith.subi %c128_i32, %27 : i32 2026-02-21T09:51:45.1065933Z %29 = arith.minsi %28, %c2_i32 : i32 2026-02-21T09:51:45.1066049Z %30 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:51:45.1066173Z %31 = arith.remsi %30, %29 : i32 2026-02-21T09:51:45.1066301Z %32 = arith.addi %27, %31 : i32 2026-02-21T09:51:45.1066412Z %33 = arith.divsi %30, %29 : i32 2026-02-21T09:51:45.1066522Z %34 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:51:45.1066701Z %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:45.1066904Z %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:45.1067067Z %37 = arith.muli %33, %c64_i32 : i32 2026-02-21T09:51:45.1067233Z %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:45.1067436Z %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:45.1067646Z %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:45.1067873Z %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:45.1068136Z %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:51:45.1068386Z %43 = arith.muli %42, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:45.1068574Z %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1068746Z %45 = arith.extsi %34 : i32 to i64 2026-02-21T09:51:45.1068929Z %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:45.1069145Z %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:45.1069414Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1069725Z %49 = tt.broadcast %48 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1069921Z %50 = arith.cmpi sge, %48, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1070091Z %51 = arith.cmpi slt, %48, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1070250Z %52 = arith.andi %50, %51 : tensor<1x64xi1, #blocked2> 2026-02-21T09:51:45.1070453Z %53 = tt.broadcast %52 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1070662Z %54 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:45.1070947Z %55 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:45.1071210Z %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1071414Z %57 = arith.addi %44, %56 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1071609Z %58 = tt.addptr %9, %57 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1071822Z %59 = tt.load %58 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:45.1072102Z %60 = ttg.memdesc_index %54[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1072495Z ttg.local_store %59, %60 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1072779Z %61 = arith.addi %8, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:45.1073073Z %62 = tt.expand_dims %61 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:45.1073337Z %63 = tt.broadcast %62 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1073569Z %64 = arith.addi %44, %63 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1073859Z %65 = tt.addptr %9, %64 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1074141Z %66 = tt.load %65 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:45.1074525Z %67 = ttg.memdesc_index %54[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1075002Z ttg.local_store %66, %67 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1075813Z %68:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %60, %arg8 = %67) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:51:45.1076465Z %307 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:51:45.1076651Z %308 = arith.muli %307, %c2_i32 : i32 2026-02-21T09:51:45.1076914Z %309 = tt.splat %308 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:45.1077259Z %310 = arith.addi %309, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:45.1077687Z %311 = tt.expand_dims %310 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:45.1078118Z %312 = tt.broadcast %311 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1078415Z %313 = arith.addi %44, %312 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1078721Z %314 = tt.addptr %9, %313 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1079034Z %315 = tt.load %314 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:45.1079497Z %316 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1080107Z %317 = arith.extf %316 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1080386Z %318 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:45.1080633Z %319 = tt.splat %318 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1080913Z %320 = arith.addi %319, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1081191Z %321 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1081458Z %322 = arith.muli %321, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1081648Z %323 = tt.broadcast %322 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1081843Z %324 = arith.addi %323, %49 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1082039Z %325 = tt.addptr %10, %324 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1082281Z %326 = arith.cmpi sge, %321, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1082544Z %327 = arith.cmpi slt, %321, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1082798Z %328 = arith.andi %326, %327 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:45.1082989Z %329 = tt.broadcast %328 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1083176Z %330 = arith.andi %329, %53 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1083347Z %331 = tt.load %325, %330, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:45.1083601Z %332 = ttg.convert_layout %331 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1083896Z %333 = arith.shli %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1084139Z %334 = arith.shrsi %333, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1084380Z %335 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1084676Z %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1085036Z %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1085317Z %338 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1085556Z %339 = arith.select %18, %338, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1085823Z %340 = tt.broadcast %337 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1086150Z %341 = arith.select %20, %340, %339 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1086383Z %342 = tt.reshape %341 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:45.1086602Z %343 = arith.sitofp %342 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:45.1086896Z %344 = ttg.convert_layout %343 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1087361Z %345 = tt.dot %317, %344, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:45.1087710Z %346 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:51:45.1087863Z %347 = arith.cmpi slt, %346, %c2_i32 : i32 2026-02-21T09:51:45.1087997Z %348 = arith.select %347, %346, %c0_i32 : i32 2026-02-21T09:51:45.1088263Z %349 = ttg.memdesc_index %54[%348] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1088610Z ttg.local_store %315, %349 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1089002Z scf.yield %345, %348, %arg8, %349 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1089352Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:51:45.1089661Z %69 = ttg.local_load %68#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1090080Z %70 = arith.extf %69 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1090420Z %71 = arith.addi %12, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1090693Z %72 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1090936Z %73 = arith.muli %72, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1091121Z %74 = tt.broadcast %73 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1091306Z %75 = arith.addi %74, %49 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1091499Z %76 = tt.addptr %10, %75 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1091702Z %77 = arith.cmpi sge, %72, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1091869Z %78 = arith.cmpi slt, %72, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1092027Z %79 = arith.andi %77, %78 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:45.1092209Z %80 = tt.broadcast %79 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1092393Z %81 = arith.andi %80, %53 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1092554Z %82 = tt.load %76, %81, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:45.1092826Z %83 = ttg.convert_layout %82 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1093097Z %84 = arith.shli %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1093329Z %85 = arith.shrsi %84, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1093556Z %86 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1093835Z %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1094163Z %88 = tt.expand_dims %86 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1094432Z %89 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1094661Z %90 = arith.select %18, %89, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1094887Z %91 = tt.broadcast %88 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1095107Z %92 = arith.select %20, %91, %90 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1095326Z %93 = tt.reshape %92 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:45.1095535Z %94 = arith.sitofp %93 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:45.1095833Z %95 = ttg.convert_layout %94 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1096283Z %96 = tt.dot %70, %95, %68#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:45.1096767Z %97 = ttg.local_load %68#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1097186Z %98 = arith.extf %97 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1097528Z %99 = arith.addi %12, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1097803Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1098053Z %101 = arith.muli %100, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1098259Z %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1098453Z %103 = arith.addi %102, %49 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1098647Z %104 = tt.addptr %10, %103 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1098858Z %105 = arith.cmpi sge, %100, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1099034Z %106 = arith.cmpi slt, %100, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1099201Z %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:45.1099387Z %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1099574Z %109 = arith.andi %108, %53 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1099744Z %110 = tt.load %104, %109, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:45.1100004Z %111 = ttg.convert_layout %110 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1100281Z %112 = arith.shli %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1100520Z %113 = arith.shrsi %112, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1100777Z %114 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1101065Z %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1101398Z %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1101676Z %117 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1101918Z %118 = arith.select %18, %117, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1102155Z %119 = tt.broadcast %116 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1102384Z %120 = arith.select %20, %119, %118 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1102614Z %121 = tt.reshape %120 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:45.1102831Z %122 = arith.sitofp %121 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:45.1103127Z %123 = ttg.convert_layout %122 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1103579Z %124 = tt.dot %98, %123, %96, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:45.1103974Z ttg.local_dealloc %54 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:45.1104185Z %125 = arith.truncf %124 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:51:45.1104445Z %126 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:51:45.1104679Z %127 = arith.muli %126, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:51:45.1104905Z %128 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:51:45.1105182Z %129 = tt.broadcast %127 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1105382Z %130 = tt.broadcast %128 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1105558Z %131 = arith.addi %129, %130 : tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1105745Z %132 = tt.addptr %21, %131 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1105935Z tt.store %132, %125 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:45.1106097Z %133 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:51:45.1106219Z %134 = arith.divsi %133, %c512_i32 : i32 2026-02-21T09:51:45.1106341Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:51:45.1106461Z %136 = arith.subi %c128_i32, %135 : i32 2026-02-21T09:51:45.1106580Z %137 = arith.minsi %136, %c2_i32 : i32 2026-02-21T09:51:45.1106700Z %138 = arith.remsi %133, %c512_i32 : i32 2026-02-21T09:51:45.1106822Z %139 = arith.remsi %138, %137 : i32 2026-02-21T09:51:45.1106940Z %140 = arith.addi %135, %139 : i32 2026-02-21T09:51:45.1107052Z %141 = arith.divsi %138, %137 : i32 2026-02-21T09:51:45.1107168Z %142 = arith.muli %140, %c64_i32 : i32 2026-02-21T09:51:45.1107327Z %143 = tt.splat %142 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:45.1107537Z %144 = arith.addi %143, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:45.1107707Z %145 = arith.muli %141, %c64_i32 : i32 2026-02-21T09:51:45.1107875Z %146 = tt.splat %145 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:45.1108168Z %147 = tt.splat %145 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:45.1108420Z %148 = arith.addi %146, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:45.1108757Z %149 = arith.addi %147, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:45.1109049Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:51:45.1109331Z %151 = arith.muli %150, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:45.1109582Z %152 = tt.broadcast %151 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1109776Z %153 = arith.extsi %142 : i32 to i64 2026-02-21T09:51:45.1117994Z %154 = tt.splat %153 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:45.1118222Z %155 = arith.addi %154, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:45.1118506Z %156 = tt.expand_dims %155 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1118791Z %157 = tt.broadcast %156 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1118999Z %158 = arith.cmpi sge, %156, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1119179Z %159 = arith.cmpi slt, %156, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1119351Z %160 = arith.andi %158, %159 : tensor<1x64xi1, #blocked2> 2026-02-21T09:51:45.1119580Z %161 = tt.broadcast %160 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1119799Z %162 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:45.1119985Z %163 = arith.addi %152, %56 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1120184Z %164 = tt.addptr %9, %163 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1120390Z %165 = tt.load %164 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:45.1120675Z %166 = ttg.memdesc_index %162[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1121052Z ttg.local_store %165, %166 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1121288Z %167 = arith.addi %152, %63 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1121486Z %168 = tt.addptr %9, %167 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1121690Z %169 = tt.load %168 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:45.1121982Z %170 = ttg.memdesc_index %162[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1122337Z ttg.local_store %169, %170 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1122920Z %171:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %166, %arg8 = %170) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:51:45.1123342Z %307 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:51:45.1123471Z %308 = arith.muli %307, %c2_i32 : i32 2026-02-21T09:51:45.1123644Z %309 = tt.splat %308 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:45.1123872Z %310 = arith.addi %309, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:45.1124148Z %311 = tt.expand_dims %310 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:45.1124429Z %312 = tt.broadcast %311 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1124649Z %313 = arith.addi %152, %312 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1124846Z %314 = tt.addptr %9, %313 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1125053Z %315 = tt.load %314 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:45.1125347Z %316 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1125781Z %317 = arith.extf %316 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1126062Z %318 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:45.1126234Z %319 = tt.splat %318 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1126456Z %320 = arith.addi %319, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1126731Z %321 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1126982Z %322 = arith.muli %321, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1127175Z %323 = tt.broadcast %322 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1127367Z %324 = arith.addi %323, %157 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1127583Z %325 = tt.addptr %10, %324 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1127791Z %326 = arith.cmpi sge, %321, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1127965Z %327 = arith.cmpi slt, %321, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1128132Z %328 = arith.andi %326, %327 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:45.1128319Z %329 = tt.broadcast %328 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1128513Z %330 = arith.andi %329, %161 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1128681Z %331 = tt.load %325, %330, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:45.1128965Z %332 = ttg.convert_layout %331 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1129242Z %333 = arith.shli %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1129480Z %334 = arith.shrsi %333, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1129736Z %335 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1130025Z %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1130360Z %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1130641Z %338 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1130881Z %339 = arith.select %18, %338, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1131118Z %340 = tt.broadcast %337 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1131353Z %341 = arith.select %20, %340, %339 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1131583Z %342 = tt.reshape %341 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:45.1131809Z %343 = arith.sitofp %342 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:45.1132104Z %344 = ttg.convert_layout %343 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1132591Z %345 = tt.dot %317, %344, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:45.1132936Z %346 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:51:45.1133067Z %347 = arith.cmpi slt, %346, %c2_i32 : i32 2026-02-21T09:51:45.1133202Z %348 = arith.select %347, %346, %c0_i32 : i32 2026-02-21T09:51:45.1133465Z %349 = ttg.memdesc_index %162[%348] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1133817Z ttg.local_store %315, %349 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1134202Z scf.yield %345, %348, %arg8, %349 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1134538Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:51:45.1134853Z %172 = ttg.local_load %171#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1135278Z %173 = arith.extf %172 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1135591Z %174 = arith.addi %74, %157 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1135792Z %175 = tt.addptr %10, %174 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1135992Z %176 = arith.andi %80, %161 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1136164Z %177 = tt.load %175, %176, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:45.1136420Z %178 = ttg.convert_layout %177 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1136705Z %179 = arith.shli %178, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1136960Z %180 = arith.shrsi %179, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1137196Z %181 = arith.shrsi %178, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1137487Z %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1137840Z %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1138121Z %184 = tt.broadcast %182 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1138361Z %185 = arith.select %18, %184, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1138598Z %186 = tt.broadcast %183 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1138833Z %187 = arith.select %20, %186, %185 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1139058Z %188 = tt.reshape %187 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:45.1139280Z %189 = arith.sitofp %188 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:45.1139575Z %190 = ttg.convert_layout %189 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1140041Z %191 = tt.dot %173, %190, %171#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:45.1140554Z %192 = ttg.local_load %171#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1140977Z %193 = arith.extf %192 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1141275Z %194 = arith.addi %102, %157 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1141476Z %195 = tt.addptr %10, %194 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1141674Z %196 = arith.andi %108, %161 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1141846Z %197 = tt.load %195, %196, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:45.1142102Z %198 = ttg.convert_layout %197 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1142384Z %199 = arith.shli %198, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1142621Z %200 = arith.shrsi %199, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1142855Z %201 = arith.shrsi %198, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1143144Z %202 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1143479Z %203 = tt.expand_dims %201 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1143774Z %204 = tt.broadcast %202 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1144011Z %205 = arith.select %18, %204, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1144245Z %206 = tt.broadcast %203 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1144478Z %207 = arith.select %20, %206, %205 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1144708Z %208 = tt.reshape %207 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:45.1144943Z %209 = arith.sitofp %208 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:45.1145235Z %210 = ttg.convert_layout %209 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1145712Z %211 = tt.dot %193, %210, %191, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:45.1146097Z ttg.local_dealloc %162 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:45.1146312Z %212 = arith.truncf %211 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:51:45.1146574Z %213 = tt.expand_dims %149 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:51:45.1146813Z %214 = arith.muli %213, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:51:45.1147040Z %215 = tt.expand_dims %144 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:51:45.1147296Z %216 = tt.broadcast %214 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1147498Z %217 = tt.broadcast %215 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1147675Z %218 = arith.addi %216, %217 : tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1147862Z %219 = tt.addptr %21, %218 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1148054Z tt.store %219, %212 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:45.1148197Z %220 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:51:45.1148321Z %221 = arith.divsi %220, %c512_i32 : i32 2026-02-21T09:51:45.1148457Z %222 = arith.muli %221, %c2_i32 : i32 2026-02-21T09:51:45.1148579Z %223 = arith.subi %c128_i32, %222 : i32 2026-02-21T09:51:45.1148698Z %224 = arith.minsi %223, %c2_i32 : i32 2026-02-21T09:51:45.1148817Z %225 = arith.remsi %220, %c512_i32 : i32 2026-02-21T09:51:45.1148934Z %226 = arith.remsi %225, %224 : i32 2026-02-21T09:51:45.1149052Z %227 = arith.addi %222, %226 : i32 2026-02-21T09:51:45.1149165Z %228 = arith.divsi %225, %224 : i32 2026-02-21T09:51:45.1149283Z %229 = arith.muli %227, %c64_i32 : i32 2026-02-21T09:51:45.1149443Z %230 = tt.splat %229 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:45.1149654Z %231 = arith.addi %230, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:45.1149821Z %232 = arith.muli %228, %c64_i32 : i32 2026-02-21T09:51:45.1149987Z %233 = tt.splat %232 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:45.1150204Z %234 = tt.splat %232 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:45.1150418Z %235 = arith.addi %233, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:45.1150633Z %236 = arith.addi %234, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:45.1150903Z %237 = tt.expand_dims %235 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:51:45.1151172Z %238 = arith.muli %237, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:45.1151371Z %239 = tt.broadcast %238 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1151544Z %240 = arith.extsi %229 : i32 to i64 2026-02-21T09:51:45.1151716Z %241 = tt.splat %240 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:45.1151941Z %242 = arith.addi %241, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:45.1152218Z %243 = tt.expand_dims %242 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1152518Z %244 = tt.broadcast %243 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1152720Z %245 = arith.cmpi sge, %243, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1152897Z %246 = arith.cmpi slt, %243, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1153066Z %247 = arith.andi %245, %246 : tensor<1x64xi1, #blocked2> 2026-02-21T09:51:45.1153269Z %248 = tt.broadcast %247 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1153485Z %249 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:45.1153668Z %250 = arith.addi %239, %56 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1153867Z %251 = tt.addptr %9, %250 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1154070Z %252 = tt.load %251 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:45.1154353Z %253 = ttg.memdesc_index %249[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1154711Z ttg.local_store %252, %253 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1154949Z %254 = arith.addi %239, %63 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1155149Z %255 = tt.addptr %9, %254 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1155349Z %256 = tt.load %255 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:45.1155628Z %257 = ttg.memdesc_index %249[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1156001Z ttg.local_store %256, %257 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1156513Z %258:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %253, %arg8 = %257) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:51:45.1156933Z %307 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:51:45.1157060Z %308 = arith.muli %307, %c2_i32 : i32 2026-02-21T09:51:45.1157231Z %309 = tt.splat %308 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:45.1157459Z %310 = arith.addi %309, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:45.1157734Z %311 = tt.expand_dims %310 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:45.1158017Z %312 = tt.broadcast %311 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1158215Z %313 = arith.addi %239, %312 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1158414Z %314 = tt.addptr %9, %313 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1158619Z %315 = tt.load %314 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:45.1158927Z %316 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1159359Z %317 = arith.extf %316 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1159641Z %318 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:45.1159814Z %319 = tt.splat %318 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1160037Z %320 = arith.addi %319, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1160330Z %321 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1160578Z %322 = arith.muli %321, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1160768Z %323 = tt.broadcast %322 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1160961Z %324 = arith.addi %323, %244 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1161173Z %325 = tt.addptr %10, %324 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1161378Z %326 = arith.cmpi sge, %321, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1161548Z %327 = arith.cmpi slt, %321, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1161710Z %328 = arith.andi %326, %327 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:45.1161895Z %329 = tt.broadcast %328 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1162083Z %330 = arith.andi %329, %248 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1162250Z %331 = tt.load %325, %330, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:45.1162504Z %332 = ttg.convert_layout %331 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1162835Z %333 = arith.shli %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1163072Z %334 = arith.shrsi %333, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1163305Z %335 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1163615Z %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1163946Z %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1164226Z %338 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1164460Z %339 = arith.select %18, %338, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1164694Z %340 = tt.broadcast %337 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1164924Z %341 = arith.select %20, %340, %339 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1165152Z %342 = tt.reshape %341 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:45.1165371Z %343 = arith.sitofp %342 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:45.1165665Z %344 = ttg.convert_layout %343 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1166127Z %345 = tt.dot %317, %344, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:45.1166469Z %346 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:51:45.1166615Z %347 = arith.cmpi slt, %346, %c2_i32 : i32 2026-02-21T09:51:45.1166746Z %348 = arith.select %347, %346, %c0_i32 : i32 2026-02-21T09:51:45.1167012Z %349 = ttg.memdesc_index %249[%348] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1167361Z ttg.local_store %315, %349 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1167745Z scf.yield %345, %348, %arg8, %349 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1168097Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:51:45.1168405Z %259 = ttg.local_load %258#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1168829Z %260 = arith.extf %259 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1169140Z %261 = arith.addi %74, %244 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1169336Z %262 = tt.addptr %10, %261 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1169533Z %263 = arith.andi %80, %248 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1169698Z %264 = tt.load %262, %263, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:45.1169951Z %265 = ttg.convert_layout %264 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1170230Z %266 = arith.shli %265, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1170461Z %267 = arith.shrsi %266, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1170695Z %268 = arith.shrsi %265, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1170980Z %269 = tt.expand_dims %267 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1171309Z %270 = tt.expand_dims %268 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1171603Z %271 = tt.broadcast %269 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1171835Z %272 = arith.select %18, %271, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1172073Z %273 = tt.broadcast %270 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1172302Z %274 = arith.select %20, %273, %272 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1172531Z %275 = tt.reshape %274 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:45.1172749Z %276 = arith.sitofp %275 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:45.1173038Z %277 = ttg.convert_layout %276 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1173497Z %278 = tt.dot %260, %277, %258#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:45.1173985Z %279 = ttg.local_load %258#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1174406Z %280 = arith.extf %279 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1174712Z %281 = arith.addi %102, %244 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1174906Z %282 = tt.addptr %10, %281 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1175106Z %283 = arith.andi %108, %248 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1175271Z %284 = tt.load %282, %283, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:45.1175524Z %285 = ttg.convert_layout %284 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1175801Z %286 = arith.shli %285, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1176055Z %287 = arith.shrsi %286, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1176289Z %288 = arith.shrsi %285, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1176573Z %289 = tt.expand_dims %287 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1176923Z %290 = tt.expand_dims %288 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1177199Z %291 = tt.broadcast %289 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1177431Z %292 = arith.select %18, %291, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1177664Z %293 = tt.broadcast %290 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1177891Z %294 = arith.select %20, %293, %292 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1178115Z %295 = tt.reshape %294 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:45.1178331Z %296 = arith.sitofp %295 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:45.1178620Z %297 = ttg.convert_layout %296 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1179075Z %298 = tt.dot %280, %297, %278, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:45.1179453Z ttg.local_dealloc %249 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:45.1179682Z %299 = arith.truncf %298 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:51:45.1179943Z %300 = tt.expand_dims %236 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:51:45.1180176Z %301 = arith.muli %300, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:51:45.1180398Z %302 = tt.expand_dims %231 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:51:45.1180653Z %303 = tt.broadcast %301 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1180849Z %304 = tt.broadcast %302 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1181024Z %305 = arith.addi %303, %304 : tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1181205Z %306 = tt.addptr %21, %305 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1181398Z tt.store %306, %299 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:45.1181537Z } {tt.num_stages = 1 : i32} 2026-02-21T09:51:45.1181654Z scf.for %arg3 = %25 to %3 step %c1_i32 : i32 { 2026-02-21T09:51:45.1181788Z %26 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:51:45.1181906Z %27 = arith.muli %26, %c2_i32 : i32 2026-02-21T09:51:45.1182022Z %28 = arith.subi %c128_i32, %27 : i32 2026-02-21T09:51:45.1182135Z %29 = arith.minsi %28, %c2_i32 : i32 2026-02-21T09:51:45.1182265Z %30 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:51:45.1182382Z %31 = arith.remsi %30, %29 : i32 2026-02-21T09:51:45.1182491Z %32 = arith.addi %27, %31 : i32 2026-02-21T09:51:45.1182600Z %33 = arith.divsi %30, %29 : i32 2026-02-21T09:51:45.1182709Z %34 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:51:45.1182863Z %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:45.1183063Z %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:45.1183224Z %37 = arith.muli %33, %c64_i32 : i32 2026-02-21T09:51:45.1183383Z %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:45.1183610Z %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:45.1183816Z %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:45.1184018Z %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:45.1184278Z %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:51:45.1184532Z %43 = arith.muli %42, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:45.1184720Z %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1184888Z %45 = arith.extsi %34 : i32 to i64 2026-02-21T09:51:45.1185047Z %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:45.1185260Z %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:45.1185530Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1185798Z %49 = tt.broadcast %48 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1185992Z %50 = arith.cmpi sge, %48, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1186158Z %51 = arith.cmpi slt, %48, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:45.1186317Z %52 = arith.andi %50, %51 : tensor<1x64xi1, #blocked2> 2026-02-21T09:51:45.1186492Z %53 = tt.broadcast %52 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1186700Z %54 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:45.1186983Z %55 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:45.1187248Z %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1187431Z %57 = arith.addi %44, %56 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1187619Z %58 = tt.addptr %9, %57 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1187816Z %59 = tt.load %58 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:45.1188092Z %60 = ttg.memdesc_index %54[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1188440Z ttg.local_store %59, %60 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1188707Z %61 = arith.addi %8, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:45.1188976Z %62 = tt.expand_dims %61 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:45.1189243Z %63 = tt.broadcast %62 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1189425Z %64 = arith.addi %44, %63 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1189613Z %65 = tt.addptr %9, %64 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1189823Z %66 = tt.load %65 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:45.1190096Z %67 = ttg.memdesc_index %54[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1190439Z ttg.local_store %66, %67 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1190946Z %68:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %60, %arg8 = %67) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:51:45.1191375Z %133 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:51:45.1191499Z %134 = arith.muli %133, %c2_i32 : i32 2026-02-21T09:51:45.1191669Z %135 = tt.splat %134 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:45.1191893Z %136 = arith.addi %135, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:45.1192181Z %137 = tt.expand_dims %136 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:45.1192454Z %138 = tt.broadcast %137 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1192645Z %139 = arith.addi %44, %138 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1192840Z %140 = tt.addptr %9, %139 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:45.1193044Z %141 = tt.load %140 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:45.1193338Z %142 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1193767Z %143 = arith.extf %142 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1194047Z %144 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:45.1194217Z %145 = tt.splat %144 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1194437Z %146 = arith.addi %145, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1194730Z %147 = tt.expand_dims %146 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1194976Z %148 = arith.muli %147, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1195164Z %149 = tt.broadcast %148 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1195352Z %150 = arith.addi %149, %49 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1195548Z %151 = tt.addptr %10, %150 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1195755Z %152 = arith.cmpi sge, %147, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1195926Z %153 = arith.cmpi slt, %147, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1196087Z %154 = arith.andi %152, %153 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:45.1196271Z %155 = tt.broadcast %154 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1196457Z %156 = arith.andi %155, %53 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1196622Z %157 = tt.load %151, %156, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:45.1196881Z %158 = ttg.convert_layout %157 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1197161Z %159 = arith.shli %158, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1197409Z %160 = arith.shrsi %159, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1197645Z %161 = arith.shrsi %158, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1197932Z %162 = tt.expand_dims %160 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1198278Z %163 = tt.expand_dims %161 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1198556Z %164 = tt.broadcast %162 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1198813Z %165 = arith.select %18, %164, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1199046Z %166 = tt.broadcast %163 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1199275Z %167 = arith.select %20, %166, %165 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1199500Z %168 = tt.reshape %167 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:45.1199734Z %169 = arith.sitofp %168 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:45.1200025Z %170 = ttg.convert_layout %169 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1200486Z %171 = tt.dot %143, %170, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:45.1200831Z %172 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:51:45.1200957Z %173 = arith.cmpi slt, %172, %c2_i32 : i32 2026-02-21T09:51:45.1201086Z %174 = arith.select %173, %172, %c0_i32 : i32 2026-02-21T09:51:45.1201347Z %175 = ttg.memdesc_index %54[%174] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1201694Z ttg.local_store %141, %175 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1202077Z scf.yield %171, %174, %arg8, %175 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:45.1202425Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:51:45.1202781Z %69 = ttg.local_load %68#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1203205Z %70 = arith.extf %69 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1203529Z %71 = arith.addi %12, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1203804Z %72 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1204043Z %73 = arith.muli %72, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1204228Z %74 = tt.broadcast %73 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1204414Z %75 = arith.addi %74, %49 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1204599Z %76 = tt.addptr %10, %75 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1204800Z %77 = arith.cmpi sge, %72, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1204966Z %78 = arith.cmpi slt, %72, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1205121Z %79 = arith.andi %77, %78 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:45.1205324Z %80 = tt.broadcast %79 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1205504Z %81 = arith.andi %80, %53 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1205663Z %82 = tt.load %76, %81, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:45.1205909Z %83 = ttg.convert_layout %82 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1206180Z %84 = arith.shli %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1206406Z %85 = arith.shrsi %84, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1206651Z %86 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1206928Z %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1207251Z %88 = tt.expand_dims %86 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1207536Z %89 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1207767Z %90 = arith.select %18, %89, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1207992Z %91 = tt.broadcast %88 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1208213Z %92 = arith.select %20, %91, %90 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1208431Z %93 = tt.reshape %92 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:45.1208642Z %94 = arith.sitofp %93 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:45.1208926Z %95 = ttg.convert_layout %94 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1209375Z %96 = tt.dot %70, %95, %68#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:45.1209855Z %97 = ttg.local_load %68#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1210290Z %98 = arith.extf %97 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1210615Z %99 = arith.addi %12, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:45.1210894Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1211142Z %101 = arith.muli %100, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1211333Z %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1211526Z %103 = arith.addi %102, %49 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1211723Z %104 = tt.addptr %10, %103 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:45.1211933Z %105 = arith.cmpi sge, %100, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1212102Z %106 = arith.cmpi slt, %100, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:45.1212268Z %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:45.1212453Z %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1212638Z %109 = arith.andi %108, %53 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:45.1212805Z %110 = tt.load %104, %109, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:45.1213079Z %111 = ttg.convert_layout %110 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1213360Z %112 = arith.shli %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1213596Z %113 = arith.shrsi %112, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1213832Z %114 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:45.1214119Z %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1214468Z %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:45.1214747Z %117 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1214985Z %118 = arith.select %18, %117, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1215221Z %119 = tt.broadcast %116 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1215468Z %120 = arith.select %20, %119, %118 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:45.1215692Z %121 = tt.reshape %120 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:45.1215912Z %122 = arith.sitofp %121 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:45.1216207Z %123 = ttg.convert_layout %122 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:45.1216664Z %124 = tt.dot %98, %123, %96, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:45.1217043Z ttg.local_dealloc %54 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:45.1217256Z %125 = arith.truncf %124 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:51:45.1217517Z %126 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:51:45.1217754Z %127 = arith.muli %126, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:51:45.1217998Z %128 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:51:45.1218251Z %129 = tt.broadcast %127 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1218449Z %130 = tt.broadcast %128 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1218629Z %131 = arith.addi %129, %130 : tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1218820Z %132 = tt.addptr %21, %131 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:51:45.1219014Z tt.store %132, %125 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:45.1219158Z } {tt.num_stages = 1 : i32} 2026-02-21T09:51:45.1219263Z tt.return 2026-02-21T09:51:45.1219350Z } 2026-02-21T09:51:45.1219431Z } 2026-02-21T09:51:45.1219481Z 2026-02-21T09:51:45.1219514Z {-# 2026-02-21T09:51:45.1219601Z external_resources: { 2026-02-21T09:51:45.1219704Z mlir_reproducer: { 2026-02-21T09:51:45.1220718Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:51:45.1221732Z disable_threading: false, 2026-02-21T09:51:45.1221841Z verify_each: true 2026-02-21T09:51:45.1221939Z } 2026-02-21T09:51:45.1222013Z } 2026-02-21T09:51:45.1222092Z #-} 2026-02-21T09:51:45.1222374Z /tmp/torchinductor_root/lk/clkaq5pwstq5v3fgr6xdujxp46fgalgtiguqham6cdpakbbljfvv.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:51:45.1223063Z /tmp/torchinductor_root/lk/clkaq5pwstq5v3fgr6xdujxp46fgalgtiguqham6cdpakbbljfvv.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:51:45.1223649Z [435s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:51:45.1224443Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:51:45.1225144Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:51:45.1225320Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:51:46.2056911Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:51:46.2062048Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:51:46.2063289Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:51:46.2064153Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:51:46.2065115Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:51:46.2065892Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:51:46.2066974Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:51:46.2067364Z #smem = #ttg.shared_memory 2026-02-21T09:51:46.2067814Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:51:46.2068798Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:51:46.2069731Z %cst = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma> 2026-02-21T09:51:46.2070214Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:51:46.2070523Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:51:46.2070786Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:51:46.2071132Z %cst_0 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2071513Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:51:46.2071755Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:51:46.2071986Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T09:51:46.2072215Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:51:46.2072491Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:51:46.2072842Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:51:46.2073297Z %cst_1 = arith.constant dense<0> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2073868Z %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2074867Z %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2075675Z %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2076061Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2076571Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2076961Z %cst_7 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.2077630Z %cst_8 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2078081Z %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:46.2078328Z %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:46.2078592Z %cst_11 = arith.constant dense<8192> : tensor<64x1xi32, #mma> 2026-02-21T09:51:46.2078878Z %0 = tt.get_program_id x : i32 2026-02-21T09:51:46.2079185Z %1 = arith.muli %0, %c4_i32 : i32 2026-02-21T09:51:46.2079495Z %2 = arith.addi %1, %c4_i32 : i32 2026-02-21T09:51:46.2079705Z %3 = arith.minsi %2, %c32768_i32 : i32 2026-02-21T09:51:46.2080000Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.2080416Z %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.2080959Z %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2081389Z %7 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.2081990Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.2082537Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.2082982Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2083498Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2084322Z %12 = arith.extsi %11 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2085032Z %13 = arith.extsi %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2085676Z %14 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:51:46.2086241Z %15 = tt.expand_dims %14 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:51:46.2086688Z %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:46.2086978Z %17 = arith.cmpi eq, %16, %cst_9 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:46.2087227Z %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:51:46.2087506Z %19 = arith.cmpi eq, %16, %cst_10 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:46.2087853Z %20 = tt.broadcast %19 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:51:46.2088310Z %21 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:46.2088761Z %22 = arith.subi %3, %1 : i32 2026-02-21T09:51:46.2088959Z %23 = arith.remsi %22, %c3_i32 : i32 2026-02-21T09:51:46.2089270Z %24 = arith.subi %22, %23 : i32 2026-02-21T09:51:46.2089623Z %25 = arith.addi %1, %24 : i32 2026-02-21T09:51:46.2090033Z scf.for %arg3 = %1 to %25 step %c3_i32 : i32 { 2026-02-21T09:51:46.2090308Z %26 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:51:46.2090500Z %27 = arith.muli %26, %c2_i32 : i32 2026-02-21T09:51:46.2090727Z %28 = arith.subi %c128_i32, %27 : i32 2026-02-21T09:51:46.2090950Z %29 = arith.minsi %28, %c2_i32 : i32 2026-02-21T09:51:46.2091211Z %30 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:51:46.2091372Z %31 = arith.remsi %30, %29 : i32 2026-02-21T09:51:46.2091637Z %32 = arith.addi %27, %31 : i32 2026-02-21T09:51:46.2091882Z %33 = arith.divsi %30, %29 : i32 2026-02-21T09:51:46.2092117Z %34 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:51:46.2092304Z %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.2092535Z %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.2092795Z %37 = arith.muli %33, %c64_i32 : i32 2026-02-21T09:51:46.2093104Z %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.2093446Z %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.2093692Z %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.2093986Z %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.2094368Z %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.2094689Z %43 = arith.muli %42, %cst_7 : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.2094923Z %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2095167Z %45 = arith.extsi %34 : i32 to i64 2026-02-21T09:51:46.2095494Z %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2095941Z %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2096426Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2096956Z %49 = tt.broadcast %48 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2097432Z %50 = arith.cmpi sge, %48, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2097773Z %51 = arith.cmpi slt, %48, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2098123Z %52 = arith.andi %50, %51 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2098470Z %53 = tt.broadcast %52 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2098853Z %54 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<64x64xf32, #mma>) : i32 { 2026-02-21T09:51:46.2099103Z %139 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:51:46.2099401Z %140 = tt.splat %139 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.2099753Z %141 = arith.addi %140, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.2100065Z %142 = tt.expand_dims %141 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:46.2100470Z %143 = tt.broadcast %142 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2100706Z %144 = arith.addi %44, %143 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2100909Z %145 = tt.addptr %9, %144 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2101146Z %146 = tt.load %145 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.2101421Z %147 = ttg.local_alloc %146 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T09:51:46.2101862Z %148 = ttg.local_load %147 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.2102411Z %149 = arith.extf %148 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.2102696Z %150 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:46.2103008Z %151 = tt.splat %150 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2103323Z %152 = arith.addi %151, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2103710Z %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2104068Z %154 = arith.muli %153, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2104377Z %155 = tt.broadcast %154 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2104678Z %156 = arith.addi %155, %49 : tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2104993Z %157 = tt.addptr %10, %156 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2105311Z %158 = arith.cmpi sge, %153, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2105553Z %159 = arith.cmpi slt, %153, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2105804Z %160 = arith.andi %158, %159 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2106102Z %161 = tt.broadcast %160 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2106400Z %162 = arith.andi %161, %53 : tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2106643Z %163 = tt.load %157, %162, %cst_1 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2106890Z %164 = arith.shli %163, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2107166Z %165 = arith.shrsi %164, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2107408Z %166 = arith.shrsi %163, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2107709Z %167 = tt.expand_dims %165 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.2108048Z %168 = tt.expand_dims %166 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.2108342Z %169 = tt.broadcast %167 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2108595Z %170 = arith.select %18, %169, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2108845Z %171 = tt.broadcast %168 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2109076Z %172 = arith.select %20, %171, %170 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2109303Z %173 = tt.reshape %172 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:51:46.2109522Z %174 = arith.sitofp %173 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:51:46.2109774Z %175 = ttg.local_alloc %174 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:51:46.2110098Z %176 = ttg.local_load %175 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.2110598Z %177 = tt.dot %149, %176, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.2110952Z scf.yield %177 : tensor<64x64xf32, #mma> 2026-02-21T09:51:46.2111115Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:51:46.2111335Z %55 = arith.truncf %54 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:51:46.2111590Z %56 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:51:46.2111822Z %57 = arith.muli %56, %cst_11 : tensor<64x1xi32, #mma> 2026-02-21T09:51:46.2112047Z %58 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:51:46.2112294Z %59 = tt.broadcast %57 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2112498Z %60 = tt.broadcast %58 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2112682Z %61 = arith.addi %59, %60 : tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2112861Z %62 = tt.addptr %21, %61 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2113051Z tt.store %62, %55 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:46.2113186Z %63 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:51:46.2113307Z %64 = arith.divsi %63, %c512_i32 : i32 2026-02-21T09:51:46.2113422Z %65 = arith.muli %64, %c2_i32 : i32 2026-02-21T09:51:46.2113539Z %66 = arith.subi %c128_i32, %65 : i32 2026-02-21T09:51:46.2113670Z %67 = arith.minsi %66, %c2_i32 : i32 2026-02-21T09:51:46.2113785Z %68 = arith.remsi %63, %c512_i32 : i32 2026-02-21T09:51:46.2113903Z %69 = arith.remsi %68, %67 : i32 2026-02-21T09:51:46.2114015Z %70 = arith.addi %65, %69 : i32 2026-02-21T09:51:46.2114141Z %71 = arith.divsi %68, %67 : i32 2026-02-21T09:51:46.2114251Z %72 = arith.muli %70, %c64_i32 : i32 2026-02-21T09:51:46.2114464Z %73 = tt.splat %72 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.2114741Z %74 = arith.addi %73, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.2114904Z %75 = arith.muli %71, %c64_i32 : i32 2026-02-21T09:51:46.2115081Z %76 = tt.splat %75 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.2115309Z %77 = tt.splat %75 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.2115524Z %78 = arith.addi %76, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.2115731Z %79 = arith.addi %77, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.2115999Z %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.2116250Z %81 = arith.muli %80, %cst_7 : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.2116450Z %82 = tt.broadcast %81 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2116622Z %83 = arith.extsi %72 : i32 to i64 2026-02-21T09:51:46.2116822Z %84 = tt.splat %83 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2117131Z %85 = arith.addi %84, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2117515Z %86 = tt.expand_dims %85 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2117954Z %87 = tt.broadcast %86 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2118262Z %88 = arith.cmpi sge, %86, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2118501Z %89 = arith.cmpi slt, %86, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2118745Z %90 = arith.andi %88, %89 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2119164Z %91 = tt.broadcast %90 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2119493Z %92 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<64x64xf32, #mma>) : i32 { 2026-02-21T09:51:46.2119704Z %139 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:51:46.2119876Z %140 = tt.splat %139 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.2120102Z %141 = arith.addi %140, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.2120380Z %142 = tt.expand_dims %141 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:46.2120666Z %143 = tt.broadcast %142 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2120886Z %144 = arith.addi %82, %143 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2121093Z %145 = tt.addptr %9, %144 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2121472Z %146 = tt.load %145 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.2121713Z %147 = ttg.local_alloc %146 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T09:51:46.2122039Z %148 = ttg.local_load %147 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.2122448Z %149 = arith.extf %148 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.2122990Z %150 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:46.2123271Z %151 = tt.splat %150 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2123571Z %152 = arith.addi %151, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2123976Z %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2124330Z %154 = arith.muli %153, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2124640Z %155 = tt.broadcast %154 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2124939Z %156 = arith.addi %155, %87 : tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2125276Z %157 = tt.addptr %10, %156 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2125591Z %158 = arith.cmpi sge, %153, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2125836Z %159 = arith.cmpi slt, %153, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2126082Z %160 = arith.andi %158, %159 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2126397Z %161 = tt.broadcast %160 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2126733Z %162 = arith.andi %161, %91 : tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2127076Z %163 = tt.load %157, %162, %cst_1 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2127321Z %164 = arith.shli %163, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2127578Z %165 = arith.shrsi %164, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2127810Z %166 = arith.shrsi %163, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2128097Z %167 = tt.expand_dims %165 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.2128596Z %168 = tt.expand_dims %166 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.2128905Z %169 = tt.broadcast %167 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2129143Z %170 = arith.select %18, %169, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2129376Z %171 = tt.broadcast %168 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2129608Z %172 = arith.select %20, %171, %170 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2129839Z %173 = tt.reshape %172 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:51:46.2130057Z %174 = arith.sitofp %173 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:51:46.2130330Z %175 = ttg.local_alloc %174 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:51:46.2130649Z %176 = ttg.local_load %175 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.2131119Z %177 = tt.dot %149, %176, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.2131469Z scf.yield %177 : tensor<64x64xf32, #mma> 2026-02-21T09:51:46.2131651Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:51:46.2131855Z %93 = arith.truncf %92 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:51:46.2132110Z %94 = tt.expand_dims %79 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:51:46.2132344Z %95 = arith.muli %94, %cst_11 : tensor<64x1xi32, #mma> 2026-02-21T09:51:46.2132625Z %96 = tt.expand_dims %74 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:51:46.2132871Z %97 = tt.broadcast %95 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2133092Z %98 = tt.broadcast %96 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2133262Z %99 = arith.addi %97, %98 : tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2133457Z %100 = tt.addptr %21, %99 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2133649Z tt.store %100, %93 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:46.2133786Z %101 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:51:46.2133911Z %102 = arith.divsi %101, %c512_i32 : i32 2026-02-21T09:51:46.2134031Z %103 = arith.muli %102, %c2_i32 : i32 2026-02-21T09:51:46.2134172Z %104 = arith.subi %c128_i32, %103 : i32 2026-02-21T09:51:46.2134290Z %105 = arith.minsi %104, %c2_i32 : i32 2026-02-21T09:51:46.2134421Z %106 = arith.remsi %101, %c512_i32 : i32 2026-02-21T09:51:46.2134567Z %107 = arith.remsi %106, %105 : i32 2026-02-21T09:51:46.2134699Z %108 = arith.addi %103, %107 : i32 2026-02-21T09:51:46.2134816Z %109 = arith.divsi %106, %105 : i32 2026-02-21T09:51:46.2134931Z %110 = arith.muli %108, %c64_i32 : i32 2026-02-21T09:51:46.2135092Z %111 = tt.splat %110 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.2135305Z %112 = arith.addi %111, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.2135473Z %113 = arith.muli %109, %c64_i32 : i32 2026-02-21T09:51:46.2135652Z %114 = tt.splat %113 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.2135867Z %115 = tt.splat %113 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.2136082Z %116 = arith.addi %114, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.2136292Z %117 = arith.addi %115, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.2136564Z %118 = tt.expand_dims %116 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.2136815Z %119 = arith.muli %118, %cst_7 : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.2137014Z %120 = tt.broadcast %119 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2137196Z %121 = arith.extsi %110 : i32 to i64 2026-02-21T09:51:46.2137408Z %122 = tt.splat %121 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2137708Z %123 = arith.addi %122, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2138112Z %124 = tt.expand_dims %123 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2138570Z %125 = tt.broadcast %124 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2138884Z %126 = arith.cmpi sge, %124, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2139128Z %127 = arith.cmpi slt, %124, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2139363Z %128 = arith.andi %126, %127 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2139666Z %129 = tt.broadcast %128 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2140001Z %130 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<64x64xf32, #mma>) : i32 { 2026-02-21T09:51:46.2140213Z %139 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:51:46.2140386Z %140 = tt.splat %139 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.2140614Z %141 = arith.addi %140, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.2140891Z %142 = tt.expand_dims %141 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:46.2141219Z %143 = tt.broadcast %142 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2141417Z %144 = arith.addi %120, %143 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2141617Z %145 = tt.addptr %9, %144 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2141824Z %146 = tt.load %145 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.2142049Z %147 = ttg.local_alloc %146 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T09:51:46.2142375Z %148 = ttg.local_load %147 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.2142796Z %149 = arith.extf %148 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.2143076Z %150 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:46.2143288Z %151 = tt.splat %150 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2143609Z %152 = arith.addi %151, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2143996Z %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2144353Z %154 = arith.muli %153, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2144658Z %155 = tt.broadcast %154 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2144983Z %156 = arith.addi %155, %125 : tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2145294Z %157 = tt.addptr %10, %156 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2145610Z %158 = arith.cmpi sge, %153, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2145877Z %159 = arith.cmpi slt, %153, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2146130Z %160 = arith.andi %158, %159 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2146494Z %161 = tt.broadcast %160 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2146795Z %162 = arith.andi %161, %129 : tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2147036Z %163 = tt.load %157, %162, %cst_1 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2147283Z %164 = arith.shli %163, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2147519Z %165 = arith.shrsi %164, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2147752Z %166 = arith.shrsi %163, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2148040Z %167 = tt.expand_dims %165 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.2148372Z %168 = tt.expand_dims %166 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.2148658Z %169 = tt.broadcast %167 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2148896Z %170 = arith.select %18, %169, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2149146Z %171 = tt.broadcast %168 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2149377Z %172 = arith.select %20, %171, %170 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2149611Z %173 = tt.reshape %172 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:51:46.2149852Z %174 = arith.sitofp %173 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:51:46.2150192Z %175 = ttg.local_alloc %174 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:51:46.2150511Z %176 = ttg.local_load %175 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.2150994Z %177 = tt.dot %149, %176, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.2151359Z scf.yield %177 : tensor<64x64xf32, #mma> 2026-02-21T09:51:46.2151521Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:51:46.2151744Z %131 = arith.truncf %130 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:51:46.2152041Z %132 = tt.expand_dims %117 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:51:46.2152361Z %133 = arith.muli %132, %cst_11 : tensor<64x1xi32, #mma> 2026-02-21T09:51:46.2152591Z %134 = tt.expand_dims %112 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:51:46.2152848Z %135 = tt.broadcast %133 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2153047Z %136 = tt.broadcast %134 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2153225Z %137 = arith.addi %135, %136 : tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2153415Z %138 = tt.addptr %21, %137 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2153606Z tt.store %138, %131 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:46.2153750Z } {tt.num_stages = 1 : i32} 2026-02-21T09:51:46.2153874Z scf.for %arg3 = %25 to %3 step %c1_i32 : i32 { 2026-02-21T09:51:46.2154007Z %26 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:51:46.2154131Z %27 = arith.muli %26, %c2_i32 : i32 2026-02-21T09:51:46.2154264Z %28 = arith.subi %c128_i32, %27 : i32 2026-02-21T09:51:46.2154383Z %29 = arith.minsi %28, %c2_i32 : i32 2026-02-21T09:51:46.2154501Z %30 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:51:46.2154619Z %31 = arith.remsi %30, %29 : i32 2026-02-21T09:51:46.2154732Z %32 = arith.addi %27, %31 : i32 2026-02-21T09:51:46.2154841Z %33 = arith.divsi %30, %29 : i32 2026-02-21T09:51:46.2154969Z %34 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:51:46.2155126Z %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.2155337Z %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.2155514Z %37 = arith.muli %33, %c64_i32 : i32 2026-02-21T09:51:46.2155680Z %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.2155887Z %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.2156093Z %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.2156301Z %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.2156562Z %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.2156821Z %43 = arith.muli %42, %cst_7 : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.2157013Z %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2157185Z %45 = arith.extsi %34 : i32 to i64 2026-02-21T09:51:46.2157432Z %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2157722Z %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2166081Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2166582Z %49 = tt.broadcast %48 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2166888Z %50 = arith.cmpi sge, %48, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2167128Z %51 = arith.cmpi slt, %48, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2167395Z %52 = arith.andi %50, %51 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2167689Z %53 = tt.broadcast %52 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2168037Z %54 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<64x64xf32, #mma>) : i32 { 2026-02-21T09:51:46.2168275Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:51:46.2168535Z %64 = tt.splat %63 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.2168753Z %65 = arith.addi %64, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.2169023Z %66 = tt.expand_dims %65 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:46.2169314Z %67 = tt.broadcast %66 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2169504Z %68 = arith.addi %44, %67 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2169702Z %69 = tt.addptr %9, %68 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.2169902Z %70 = tt.load %69 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.2170137Z %71 = ttg.local_alloc %70 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem> 2026-02-21T09:51:46.2170462Z %72 = ttg.local_load %71 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.2170862Z %73 = arith.extf %72 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.2171141Z %74 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:46.2171374Z %75 = tt.splat %74 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2171669Z %76 = arith.addi %75, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:51:46.2172060Z %77 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2172447Z %78 = arith.muli %77, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2172749Z %79 = tt.broadcast %78 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2173047Z %80 = arith.addi %79, %49 : tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2173370Z %81 = tt.addptr %10, %80 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2173687Z %82 = arith.cmpi sge, %77, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2174014Z %83 = arith.cmpi slt, %77, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2174267Z %84 = arith.andi %82, %83 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2174558Z %85 = tt.broadcast %84 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2174877Z %86 = arith.andi %85, %53 : tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2175114Z %87 = tt.load %81, %86, %cst_1 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2175353Z %88 = arith.shli %87, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2175597Z %89 = arith.shrsi %88, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2175828Z %90 = arith.shrsi %87, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.2176108Z %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.2176437Z %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.2176749Z %93 = tt.broadcast %91 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2176981Z %94 = arith.select %18, %93, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2177243Z %95 = tt.broadcast %92 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2177502Z %96 = arith.select %20, %95, %94 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.2177801Z %97 = tt.reshape %96 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:51:46.2178026Z %98 = arith.sitofp %97 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:51:46.2178284Z %99 = ttg.local_alloc %98 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:51:46.2178629Z %100 = ttg.local_load %99 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.2179190Z %101 = tt.dot %73, %100, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.2179573Z scf.yield %101 : tensor<64x64xf32, #mma> 2026-02-21T09:51:46.2179740Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:51:46.2179943Z %55 = arith.truncf %54 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:51:46.2180212Z %56 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:51:46.2180447Z %57 = arith.muli %56, %cst_11 : tensor<64x1xi32, #mma> 2026-02-21T09:51:46.2180686Z %58 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:51:46.2180937Z %59 = tt.broadcast %57 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2181130Z %60 = tt.broadcast %58 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2181308Z %61 = arith.addi %59, %60 : tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2181509Z %62 = tt.addptr %21, %61 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:51:46.2181694Z tt.store %62, %55 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:46.2181831Z } {tt.num_stages = 1 : i32} 2026-02-21T09:51:46.2181935Z tt.return 2026-02-21T09:51:46.2182016Z } 2026-02-21T09:51:46.2182095Z } 2026-02-21T09:51:46.2182141Z 2026-02-21T09:51:46.2182172Z {-# 2026-02-21T09:51:46.2182255Z external_resources: { 2026-02-21T09:51:46.2182357Z mlir_reproducer: { 2026-02-21T09:51:46.2183359Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:51:46.2184486Z disable_threading: false, 2026-02-21T09:51:46.2184805Z verify_each: true 2026-02-21T09:51:46.2184894Z } 2026-02-21T09:51:46.2185062Z } 2026-02-21T09:51:46.2185134Z #-} 2026-02-21T09:51:46.2185488Z /tmp/torchinductor_root/iz/cizbuh3veqzxvqwr47uaeiykj2usnn4s6fiecuivt63my44owufy.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:51:46.2186172Z /tmp/torchinductor_root/iz/cizbuh3veqzxvqwr47uaeiykj2usnn4s6fiecuivt63my44owufy.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:51:46.2186734Z [437s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:51:46.2187507Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:51:46.2188263Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:51:46.2188430Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:51:46.7175299Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:51:46.7187220Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:51:46.7187765Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:51:46.7188374Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:51:46.7188936Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:51:46.7189565Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:51:46.7189978Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:51:46.7190335Z #smem = #ttg.shared_memory 2026-02-21T09:51:46.7190830Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:51:46.7191795Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:51:46.7192599Z %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma> 2026-02-21T09:51:46.7192924Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:46.7193295Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:46.7193572Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma> 2026-02-21T09:51:46.7193836Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:51:46.7194221Z %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.7194867Z %cst_4 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7195420Z %cst_5 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7195792Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7196028Z %cst_7 = arith.constant dense<0> : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7196333Z %cst_8 = arith.constant dense<512> : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7196551Z %cst_9 = arith.constant dense<0> : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7196843Z %cst_10 = arith.constant dense<8192> : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7197185Z %cst_11 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.7197474Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:51:46.7197650Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:51:46.7197873Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:51:46.7198101Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T09:51:46.7198322Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:51:46.7198529Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:51:46.7198795Z %cst_12 = arith.constant dense<0> : tensor<2x64xi8, #blocked2> 2026-02-21T09:51:46.7199123Z %cst_13 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7199395Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:51:46.7199601Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:51:46.7199899Z %cst_14 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7200151Z %0 = tt.get_program_id x : i32 2026-02-21T09:51:46.7200305Z %1 = arith.muli %0, %c4_i32 : i32 2026-02-21T09:51:46.7200537Z %2 = arith.addi %1, %c4_i32 : i32 2026-02-21T09:51:46.7200678Z %3 = arith.minsi %2, %c32768_i32 : i32 2026-02-21T09:51:46.7200963Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.7201480Z %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.7201887Z %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:46.7202232Z %7 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.7202681Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.7203025Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.7203387Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:46.7203837Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7204453Z %12 = arith.extsi %11 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7205216Z %13 = arith.extsi %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:46.7205790Z %14 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:51:46.7206274Z %15 = tt.expand_dims %14 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:51:46.7206691Z %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:46.7206943Z %17 = arith.cmpi eq, %16, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:46.7207263Z %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:51:46.7207560Z %19 = arith.cmpi eq, %16, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:46.7207793Z %20 = tt.broadcast %19 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:51:46.7208002Z %21 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:46.7208174Z %22 = arith.subi %3, %1 : i32 2026-02-21T09:51:46.7208356Z %23 = arith.remsi %22, %c3_i32 : i32 2026-02-21T09:51:46.7208529Z %24 = arith.subi %22, %23 : i32 2026-02-21T09:51:46.7208696Z %25 = arith.addi %1, %24 : i32 2026-02-21T09:51:46.7208878Z scf.for %arg3 = %1 to %25 step %c3_i32 : i32 { 2026-02-21T09:51:46.7209089Z %26 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:51:46.7209282Z %27 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:51:46.7209465Z %28 = arith.muli %26, %c64_i32 : i32 2026-02-21T09:51:46.7209714Z %29 = tt.splat %28 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.7210025Z %30 = arith.addi %29, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.7210278Z %31 = arith.muli %27, %c64_i32 : i32 2026-02-21T09:51:46.7210532Z %32 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.7210856Z %33 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.7211180Z %34 = arith.addi %32, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.7211496Z %35 = arith.addi %33, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.7211885Z %36 = tt.expand_dims %34 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.7212153Z %37 = arith.muli %36, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.7212404Z %38 = tt.broadcast %37 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7212668Z %39 = arith.extsi %28 : i32 to i64 2026-02-21T09:51:46.7212914Z %40 = tt.splat %39 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:46.7213255Z %41 = arith.addi %40, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:46.7213681Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7214104Z %43 = tt.broadcast %42 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7214403Z %44 = arith.cmpi sge, %42, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7214666Z %45 = arith.cmpi slt, %42, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7214917Z %46 = arith.andi %44, %45 : tensor<1x64xi1, #blocked2> 2026-02-21T09:51:46.7215197Z %47 = tt.broadcast %46 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7215520Z %48 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:46.7215934Z %49 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:46.7216366Z %50 = tt.broadcast %49 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7216653Z %51 = arith.addi %38, %50 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7216946Z %52 = tt.addptr %9, %51 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7217248Z %53 = tt.load %52 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.7217677Z %54 = ttg.memdesc_index %48[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7218240Z ttg.local_store %53, %54 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7218514Z %55 = arith.addi %8, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.7218804Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:46.7219170Z %57 = tt.broadcast %56 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7219374Z %58 = arith.addi %38, %57 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7219564Z %59 = tt.addptr %9, %58 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7219763Z %60 = tt.load %59 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.7220079Z %61 = ttg.memdesc_index %48[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7220498Z ttg.local_store %60, %61 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7221014Z %62:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %54, %arg8 = %61) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:51:46.7221429Z %289 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:51:46.7221557Z %290 = arith.muli %289, %c2_i32 : i32 2026-02-21T09:51:46.7221734Z %291 = tt.splat %290 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.7221979Z %292 = arith.addi %291, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.7222257Z %293 = tt.expand_dims %292 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:46.7222537Z %294 = tt.broadcast %293 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7222733Z %295 = arith.addi %38, %294 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7222935Z %296 = tt.addptr %9, %295 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7223140Z %297 = tt.load %296 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.7223444Z %298 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7223991Z %299 = arith.extf %298 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7224274Z %300 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:46.7224449Z %301 = tt.splat %300 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7224674Z %302 = arith.addi %301, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7224953Z %303 = tt.expand_dims %302 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7225213Z %304 = arith.muli %303, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7225411Z %305 = tt.broadcast %304 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7225604Z %306 = arith.addi %305, %43 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7225802Z %307 = tt.addptr %10, %306 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7226015Z %308 = arith.cmpi sge, %303, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7226189Z %309 = arith.cmpi slt, %303, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7226373Z %310 = arith.andi %308, %309 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:46.7226562Z %311 = tt.broadcast %310 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7226751Z %312 = arith.andi %311, %47 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7226921Z %313 = tt.load %307, %312, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:46.7227178Z %314 = ttg.convert_layout %313 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7227490Z %315 = arith.shli %314, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7227728Z %316 = arith.shrsi %315, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7227966Z %317 = arith.shrsi %314, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7228254Z %318 = tt.expand_dims %316 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7228588Z %319 = tt.expand_dims %317 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7228872Z %320 = tt.broadcast %318 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7229116Z %321 = arith.select %18, %320, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7229361Z %322 = tt.broadcast %319 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7229596Z %323 = arith.select %20, %322, %321 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7229842Z %324 = tt.reshape %323 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:46.7230064Z %325 = arith.sitofp %324 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:46.7230362Z %326 = ttg.convert_layout %325 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7230833Z %327 = tt.dot %299, %326, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.7231189Z %328 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:51:46.7231319Z %329 = arith.cmpi slt, %328, %c2_i32 : i32 2026-02-21T09:51:46.7231454Z %330 = arith.select %329, %328, %c0_i32 : i32 2026-02-21T09:51:46.7231718Z %331 = ttg.memdesc_index %48[%330] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7232069Z ttg.local_store %297, %331 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7232464Z scf.yield %327, %330, %arg8, %331 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7232799Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:51:46.7233128Z %63 = ttg.local_load %62#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7233551Z %64 = arith.extf %63 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7233873Z %65 = arith.addi %12, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7234149Z %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7234407Z %67 = arith.muli %66, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7234593Z %68 = tt.broadcast %67 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7234784Z %69 = arith.addi %68, %43 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7234972Z %70 = tt.addptr %10, %69 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7235178Z %71 = arith.cmpi sge, %66, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7235358Z %72 = arith.cmpi slt, %66, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7235520Z %73 = arith.andi %71, %72 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:46.7235704Z %74 = tt.broadcast %73 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7235886Z %75 = arith.andi %74, %47 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7236050Z %76 = tt.load %70, %75, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:46.7236300Z %77 = ttg.convert_layout %76 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7236577Z %78 = arith.shli %77, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7236814Z %79 = arith.shrsi %78, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7237045Z %80 = arith.shrsi %77, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7237326Z %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7237650Z %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7237943Z %83 = tt.broadcast %81 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7238177Z %84 = arith.select %18, %83, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7238407Z %85 = tt.broadcast %82 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7238629Z %86 = arith.select %20, %85, %84 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7238848Z %87 = tt.reshape %86 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:46.7239064Z %88 = arith.sitofp %87 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:46.7239353Z %89 = ttg.convert_layout %88 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7239805Z %90 = tt.dot %64, %89, %62#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.7240288Z %91 = ttg.local_load %62#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7240705Z %92 = arith.extf %91 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7241043Z %93 = arith.addi %12, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7241321Z %94 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7241565Z %95 = arith.muli %94, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7241752Z %96 = tt.broadcast %95 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7241940Z %97 = arith.addi %96, %43 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7242130Z %98 = tt.addptr %10, %97 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7242347Z %99 = arith.cmpi sge, %94, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7242517Z %100 = arith.cmpi slt, %94, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7242778Z %101 = arith.andi %99, %100 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:46.7242964Z %102 = tt.broadcast %101 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7243152Z %103 = arith.andi %102, %47 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7243341Z %104 = tt.load %98, %103, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:46.7243593Z %105 = ttg.convert_layout %104 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7243877Z %106 = arith.shli %105, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7244112Z %107 = arith.shrsi %106, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7244352Z %108 = arith.shrsi %105, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7244641Z %109 = tt.expand_dims %107 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7244974Z %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7245256Z %111 = tt.broadcast %109 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7245495Z %112 = arith.select %18, %111, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7245729Z %113 = tt.broadcast %110 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7245981Z %114 = arith.select %20, %113, %112 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7246208Z %115 = tt.reshape %114 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:46.7246429Z %116 = arith.sitofp %115 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:46.7246723Z %117 = ttg.convert_layout %116 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7247178Z %118 = tt.dot %92, %117, %90, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.7247555Z ttg.local_dealloc %48 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:46.7247763Z %119 = arith.truncf %118 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:51:46.7248027Z %120 = tt.expand_dims %35 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:51:46.7248260Z %121 = arith.muli %120, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:51:46.7248487Z %122 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:51:46.7248758Z %123 = tt.broadcast %121 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7248957Z %124 = tt.broadcast %122 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7249137Z %125 = arith.addi %123, %124 : tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7249327Z %126 = tt.addptr %21, %125 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7249520Z tt.store %126, %119 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:46.7249663Z %127 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:51:46.7249786Z %128 = arith.remsi %127, %c128_i32 : i32 2026-02-21T09:51:46.7249910Z %129 = arith.divsi %127, %c128_i32 : i32 2026-02-21T09:51:46.7250046Z %130 = arith.muli %128, %c64_i32 : i32 2026-02-21T09:51:46.7250209Z %131 = tt.splat %130 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.7250420Z %132 = arith.addi %131, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.7250586Z %133 = arith.muli %129, %c64_i32 : i32 2026-02-21T09:51:46.7250757Z %134 = tt.splat %133 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.7250989Z %135 = tt.splat %133 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.7251204Z %136 = arith.addi %134, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.7251415Z %137 = arith.addi %135, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.7251689Z %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.7251948Z %139 = arith.muli %138, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.7252141Z %140 = tt.broadcast %139 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7252320Z %141 = arith.extsi %130 : i32 to i64 2026-02-21T09:51:46.7252486Z %142 = tt.splat %141 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:46.7252710Z %143 = arith.addi %142, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:46.7252991Z %144 = tt.expand_dims %143 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7253270Z %145 = tt.broadcast %144 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7253491Z %146 = arith.cmpi sge, %144, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7253666Z %147 = arith.cmpi slt, %144, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7253838Z %148 = arith.andi %146, %147 : tensor<1x64xi1, #blocked2> 2026-02-21T09:51:46.7254028Z %149 = tt.broadcast %148 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7254244Z %150 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:46.7254432Z %151 = arith.addi %140, %50 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7254630Z %152 = tt.addptr %9, %151 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7254836Z %153 = tt.load %152 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.7255116Z %154 = ttg.memdesc_index %150[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7255477Z ttg.local_store %153, %154 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7255719Z %155 = arith.addi %140, %57 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7255915Z %156 = tt.addptr %9, %155 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7256121Z %157 = tt.load %156 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.7256416Z %158 = ttg.memdesc_index %150[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7256770Z ttg.local_store %157, %158 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7257291Z %159:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %154, %arg8 = %158) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:51:46.7257704Z %289 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:51:46.7257858Z %290 = arith.muli %289, %c2_i32 : i32 2026-02-21T09:51:46.7258032Z %291 = tt.splat %290 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.7258257Z %292 = arith.addi %291, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.7258538Z %293 = tt.expand_dims %292 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:46.7258831Z %294 = tt.broadcast %293 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7259029Z %295 = arith.addi %140, %294 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7259236Z %296 = tt.addptr %9, %295 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7259444Z %297 = tt.load %296 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.7259740Z %298 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7260174Z %299 = arith.extf %298 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7260457Z %300 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:46.7260633Z %301 = tt.splat %300 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7260855Z %302 = arith.addi %301, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7261132Z %303 = tt.expand_dims %302 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7261400Z %304 = arith.muli %303, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7261596Z %305 = tt.broadcast %304 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7261793Z %306 = arith.addi %305, %145 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7261989Z %307 = tt.addptr %10, %306 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7262202Z %308 = arith.cmpi sge, %303, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7262375Z %309 = arith.cmpi slt, %303, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7262540Z %310 = arith.andi %308, %309 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:46.7262730Z %311 = tt.broadcast %310 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7262922Z %312 = arith.andi %311, %149 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7263093Z %313 = tt.load %307, %312, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:46.7263352Z %314 = ttg.convert_layout %313 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7263635Z %315 = arith.shli %314, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7263875Z %316 = arith.shrsi %315, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7264110Z %317 = arith.shrsi %314, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7264417Z %318 = tt.expand_dims %316 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7264752Z %319 = tt.expand_dims %317 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7265035Z %320 = tt.broadcast %318 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7265281Z %321 = arith.select %18, %320, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7265535Z %322 = tt.broadcast %319 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7265767Z %323 = arith.select %20, %322, %321 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7265996Z %324 = tt.reshape %323 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:46.7266220Z %325 = arith.sitofp %324 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:46.7266540Z %326 = ttg.convert_layout %325 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7267004Z %327 = tt.dot %299, %326, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.7267354Z %328 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:51:46.7267482Z %329 = arith.cmpi slt, %328, %c2_i32 : i32 2026-02-21T09:51:46.7267620Z %330 = arith.select %329, %328, %c0_i32 : i32 2026-02-21T09:51:46.7267889Z %331 = ttg.memdesc_index %150[%330] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7268243Z ttg.local_store %297, %331 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7268633Z scf.yield %327, %330, %arg8, %331 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7268964Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:51:46.7269293Z %160 = ttg.local_load %159#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7269754Z %161 = arith.extf %160 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7270052Z %162 = arith.addi %68, %145 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7270254Z %163 = tt.addptr %10, %162 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7270459Z %164 = arith.andi %74, %149 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7270627Z %165 = tt.load %163, %164, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:46.7270886Z %166 = ttg.convert_layout %165 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7271166Z %167 = arith.shli %166, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7271405Z %168 = arith.shrsi %167, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7271647Z %169 = arith.shrsi %166, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7271935Z %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7272270Z %171 = tt.expand_dims %169 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7272562Z %172 = tt.broadcast %170 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7272804Z %173 = arith.select %18, %172, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7273042Z %174 = tt.broadcast %171 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7273276Z %175 = arith.select %20, %174, %173 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7273510Z %176 = tt.reshape %175 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:46.7273751Z %177 = arith.sitofp %176 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:46.7274050Z %178 = ttg.convert_layout %177 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7274533Z %179 = tt.dot %161, %178, %159#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.7275027Z %180 = ttg.local_load %159#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7275462Z %181 = arith.extf %180 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7275766Z %182 = arith.addi %96, %145 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7275966Z %183 = tt.addptr %10, %182 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7276172Z %184 = arith.andi %102, %149 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7276343Z %185 = tt.load %183, %184, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:46.7276607Z %186 = ttg.convert_layout %185 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7276898Z %187 = arith.shli %186, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7277138Z %188 = arith.shrsi %187, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7277398Z %189 = arith.shrsi %186, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7277687Z %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7278027Z %191 = tt.expand_dims %189 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7278316Z %192 = tt.broadcast %190 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7278557Z %193 = arith.select %18, %192, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7278801Z %194 = tt.broadcast %191 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7279034Z %195 = arith.select %20, %194, %193 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7279266Z %196 = tt.reshape %195 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:46.7279493Z %197 = arith.sitofp %196 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:46.7279791Z %198 = ttg.convert_layout %197 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7280259Z %199 = tt.dot %181, %198, %179, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.7280659Z ttg.local_dealloc %150 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:46.7280874Z %200 = arith.truncf %199 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:51:46.7281144Z %201 = tt.expand_dims %137 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:51:46.7281386Z %202 = arith.muli %201, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:51:46.7281619Z %203 = tt.expand_dims %132 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:51:46.7281897Z %204 = tt.broadcast %202 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7282100Z %205 = tt.broadcast %203 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7282285Z %206 = arith.addi %204, %205 : tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7282475Z %207 = tt.addptr %21, %206 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7282725Z tt.store %207, %200 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:46.7282892Z %208 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:51:46.7283024Z %209 = arith.remsi %208, %c128_i32 : i32 2026-02-21T09:51:46.7283153Z %210 = arith.divsi %208, %c128_i32 : i32 2026-02-21T09:51:46.7283275Z %211 = arith.muli %209, %c64_i32 : i32 2026-02-21T09:51:46.7283444Z %212 = tt.splat %211 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.7283657Z %213 = arith.addi %212, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.7283830Z %214 = arith.muli %210, %c64_i32 : i32 2026-02-21T09:51:46.7284002Z %215 = tt.splat %214 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.7284222Z %216 = tt.splat %214 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.7284442Z %217 = arith.addi %215, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.7284657Z %218 = arith.addi %216, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.7284933Z %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.7285187Z %220 = arith.muli %219, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.7285405Z %221 = tt.broadcast %220 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7285585Z %222 = arith.extsi %211 : i32 to i64 2026-02-21T09:51:46.7285755Z %223 = tt.splat %222 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:46.7285985Z %224 = arith.addi %223, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:46.7286265Z %225 = tt.expand_dims %224 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7286549Z %226 = tt.broadcast %225 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7286758Z %227 = arith.cmpi sge, %225, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7286934Z %228 = arith.cmpi slt, %225, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7287108Z %229 = arith.andi %227, %228 : tensor<1x64xi1, #blocked2> 2026-02-21T09:51:46.7287299Z %230 = tt.broadcast %229 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7287521Z %231 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:46.7287710Z %232 = arith.addi %221, %50 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7287915Z %233 = tt.addptr %9, %232 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7288150Z %234 = tt.load %233 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.7288436Z %235 = ttg.memdesc_index %231[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7288799Z ttg.local_store %234, %235 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7289039Z %236 = arith.addi %221, %57 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7289241Z %237 = tt.addptr %9, %236 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7289449Z %238 = tt.load %237 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.7289758Z %239 = ttg.memdesc_index %231[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7290119Z ttg.local_store %238, %239 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7290658Z %240:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %235, %arg8 = %239) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:51:46.7291074Z %289 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:51:46.7291206Z %290 = arith.muli %289, %c2_i32 : i32 2026-02-21T09:51:46.7291383Z %291 = tt.splat %290 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.7291614Z %292 = arith.addi %291, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.7291902Z %293 = tt.expand_dims %292 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:46.7292183Z %294 = tt.broadcast %293 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7292385Z %295 = arith.addi %221, %294 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7292590Z %296 = tt.addptr %9, %295 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7292804Z %297 = tt.load %296 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.7293110Z %298 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7293564Z %299 = arith.extf %298 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7293854Z %300 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:46.7294031Z %301 = tt.splat %300 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7294263Z %302 = arith.addi %301, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7294548Z %303 = tt.expand_dims %302 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7294803Z %304 = arith.muli %303, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7295004Z %305 = tt.broadcast %304 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7295204Z %306 = arith.addi %305, %226 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7295412Z %307 = tt.addptr %10, %306 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7295630Z %308 = arith.cmpi sge, %303, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7295808Z %309 = arith.cmpi slt, %303, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7295982Z %310 = arith.andi %308, %309 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:46.7296189Z %311 = tt.broadcast %310 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7296386Z %312 = arith.andi %311, %230 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7296558Z %313 = tt.load %307, %312, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:46.7296822Z %314 = ttg.convert_layout %313 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7297112Z %315 = arith.shli %314, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7297354Z %316 = arith.shrsi %315, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7297615Z %317 = arith.shrsi %314, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7297906Z %318 = tt.expand_dims %316 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7298249Z %319 = tt.expand_dims %317 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7298553Z %320 = tt.broadcast %318 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7298794Z %321 = arith.select %18, %320, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7299040Z %322 = tt.broadcast %319 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7299275Z %323 = arith.select %20, %322, %321 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7299511Z %324 = tt.reshape %323 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:46.7299741Z %325 = arith.sitofp %324 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:46.7300040Z %326 = ttg.convert_layout %325 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7300514Z %327 = tt.dot %299, %326, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.7300868Z %328 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:51:46.7301000Z %329 = arith.cmpi slt, %328, %c2_i32 : i32 2026-02-21T09:51:46.7301140Z %330 = arith.select %329, %328, %c0_i32 : i32 2026-02-21T09:51:46.7301426Z %331 = ttg.memdesc_index %231[%330] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7301786Z ttg.local_store %297, %331 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7302180Z scf.yield %327, %330, %arg8, %331 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7302514Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:51:46.7302839Z %241 = ttg.local_load %240#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7303271Z %242 = arith.extf %241 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7303577Z %243 = arith.addi %68, %226 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7303785Z %244 = tt.addptr %10, %243 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7303988Z %245 = arith.andi %74, %230 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7304165Z %246 = tt.load %244, %245, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:46.7304436Z %247 = ttg.convert_layout %246 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7304726Z %248 = arith.shli %247, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7304970Z %249 = arith.shrsi %248, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7305208Z %250 = arith.shrsi %247, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7305501Z %251 = tt.expand_dims %249 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7305852Z %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7306140Z %253 = tt.broadcast %251 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7306387Z %254 = arith.select %18, %253, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7306626Z %255 = tt.broadcast %252 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7306881Z %256 = arith.select %20, %255, %254 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7307111Z %257 = tt.reshape %256 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:46.7307336Z %258 = arith.sitofp %257 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:46.7307634Z %259 = ttg.convert_layout %258 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7308102Z %260 = tt.dot %242, %259, %240#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.7308598Z %261 = ttg.local_load %240#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7309038Z %262 = arith.extf %261 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7309337Z %263 = arith.addi %96, %226 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7309561Z %264 = tt.addptr %10, %263 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7309762Z %265 = arith.andi %102, %230 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7309938Z %266 = tt.load %264, %265, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:46.7310201Z %267 = ttg.convert_layout %266 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7310484Z %268 = arith.shli %267, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7310727Z %269 = arith.shrsi %268, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7310965Z %270 = arith.shrsi %267, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7311260Z %271 = tt.expand_dims %269 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7311608Z %272 = tt.expand_dims %270 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7311891Z %273 = tt.broadcast %271 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7312134Z %274 = arith.select %18, %273, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7312371Z %275 = tt.broadcast %272 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7312621Z %276 = arith.select %20, %275, %274 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7312855Z %277 = tt.reshape %276 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:46.7313078Z %278 = arith.sitofp %277 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:46.7313379Z %279 = ttg.convert_layout %278 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7313849Z %280 = tt.dot %262, %279, %260, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.7314246Z ttg.local_dealloc %231 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:46.7314457Z %281 = arith.truncf %280 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:51:46.7314719Z %282 = tt.expand_dims %218 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:51:46.7314977Z %283 = arith.muli %282, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:51:46.7315205Z %284 = tt.expand_dims %213 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:51:46.7315462Z %285 = tt.broadcast %283 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7315664Z %286 = tt.broadcast %284 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7315842Z %287 = arith.addi %285, %286 : tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7316032Z %288 = tt.addptr %21, %287 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7316227Z tt.store %288, %281 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:46.7316366Z } {tt.num_stages = 1 : i32} 2026-02-21T09:51:46.7316489Z scf.for %arg3 = %25 to %3 step %c1_i32 : i32 { 2026-02-21T09:51:46.7316623Z %26 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:51:46.7316748Z %27 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:51:46.7316869Z %28 = arith.muli %26, %c64_i32 : i32 2026-02-21T09:51:46.7317031Z %29 = tt.splat %28 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.7317233Z %30 = arith.addi %29, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:46.7317422Z %31 = arith.muli %27, %c64_i32 : i32 2026-02-21T09:51:46.7317587Z %32 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.7317799Z %33 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.7318010Z %34 = arith.addi %32, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:46.7318214Z %35 = arith.addi %33, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:46.7318482Z %36 = tt.expand_dims %34 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.7318733Z %37 = arith.muli %36, %cst_11 : tensor<64x1xi32, #blocked1> 2026-02-21T09:51:46.7318922Z %38 = tt.broadcast %37 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7319095Z %39 = arith.extsi %28 : i32 to i64 2026-02-21T09:51:46.7319258Z %40 = tt.splat %39 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:46.7319475Z %41 = arith.addi %40, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:46.7319749Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7320020Z %43 = tt.broadcast %42 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7320234Z %44 = arith.cmpi sge, %42, %cst_9 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7320403Z %45 = arith.cmpi slt, %42, %cst_10 : tensor<1x64xi64, #blocked2> 2026-02-21T09:51:46.7320567Z %46 = arith.andi %44, %45 : tensor<1x64xi1, #blocked2> 2026-02-21T09:51:46.7320744Z %47 = tt.broadcast %46 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7320958Z %48 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:46.7321227Z %49 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:46.7321509Z %50 = tt.broadcast %49 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7321696Z %51 = arith.addi %38, %50 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7321887Z %52 = tt.addptr %9, %51 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7322089Z %53 = tt.load %52 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.7322386Z %54 = ttg.memdesc_index %48[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7322779Z ttg.local_store %53, %54 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7323052Z %55 = arith.addi %8, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.7323324Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:46.7323595Z %57 = tt.broadcast %56 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7323782Z %58 = arith.addi %38, %57 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7323972Z %59 = tt.addptr %9, %58 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7324172Z %60 = tt.load %59 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.7324446Z %61 = ttg.memdesc_index %48[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7324795Z ttg.local_store %60, %61 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7325324Z %62:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %54, %arg8 = %61) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>) : i32 { 2026-02-21T09:51:46.7325740Z %127 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:51:46.7325867Z %128 = arith.muli %127, %c2_i32 : i32 2026-02-21T09:51:46.7326042Z %129 = tt.splat %128 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.7326267Z %130 = arith.addi %129, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:46.7326547Z %131 = tt.expand_dims %130 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:46.7326824Z %132 = tt.broadcast %131 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7327020Z %133 = arith.addi %38, %132 : tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7327223Z %134 = tt.addptr %9, %133 : tensor<64x4x!tt.ptr, #blocked1>, tensor<64x4xi32, #blocked1> 2026-02-21T09:51:46.7327429Z %135 = tt.load %134 : tensor<64x4x!tt.ptr, #blocked1> 2026-02-21T09:51:46.7327728Z %136 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7328178Z %137 = arith.extf %136 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7328465Z %138 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:51:46.7328641Z %139 = tt.splat %138 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7328865Z %140 = arith.addi %139, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7329145Z %141 = tt.expand_dims %140 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7329393Z %142 = arith.muli %141, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7329607Z %143 = tt.broadcast %142 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7329800Z %144 = arith.addi %143, %43 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7329995Z %145 = tt.addptr %10, %144 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7330207Z %146 = arith.cmpi sge, %141, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7330397Z %147 = arith.cmpi slt, %141, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7330563Z %148 = arith.andi %146, %147 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:46.7330750Z %149 = tt.broadcast %148 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7330943Z %150 = arith.andi %149, %47 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7331114Z %151 = tt.load %145, %150, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:46.7331374Z %152 = ttg.convert_layout %151 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7331658Z %153 = arith.shli %152, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7331898Z %154 = arith.shrsi %153, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7332140Z %155 = arith.shrsi %152, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7332435Z %156 = tt.expand_dims %154 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7332771Z %157 = tt.expand_dims %155 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7333072Z %158 = tt.broadcast %156 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7333310Z %159 = arith.select %18, %158, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7333548Z %160 = tt.broadcast %157 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7333782Z %161 = arith.select %20, %160, %159 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7334011Z %162 = tt.reshape %161 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:46.7334234Z %163 = arith.sitofp %162 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:46.7334527Z %164 = ttg.convert_layout %163 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7334997Z %165 = tt.dot %137, %164, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.7335347Z %166 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:51:46.7335474Z %167 = arith.cmpi slt, %166, %c2_i32 : i32 2026-02-21T09:51:46.7335608Z %168 = arith.select %167, %166, %c0_i32 : i32 2026-02-21T09:51:46.7335869Z %169 = ttg.memdesc_index %48[%168] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7336240Z ttg.local_store %135, %169 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7336629Z scf.yield %165, %168, %arg8, %169 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> 2026-02-21T09:51:46.7336964Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:51:46.7337276Z %63 = ttg.local_load %62#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7337716Z %64 = arith.extf %63 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7338042Z %65 = arith.addi %12, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7338336Z %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7338581Z %67 = arith.muli %66, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7338767Z %68 = tt.broadcast %67 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7338956Z %69 = arith.addi %68, %43 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7339145Z %70 = tt.addptr %10, %69 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7339352Z %71 = arith.cmpi sge, %66, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7339517Z %72 = arith.cmpi slt, %66, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7339677Z %73 = arith.andi %71, %72 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:46.7339858Z %74 = tt.broadcast %73 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7340040Z %75 = arith.andi %74, %47 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7340205Z %76 = tt.load %70, %75, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:46.7340452Z %77 = ttg.convert_layout %76 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7340728Z %78 = arith.shli %77, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7340974Z %79 = arith.shrsi %78, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7341207Z %80 = arith.shrsi %77, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7341490Z %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7341816Z %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7342090Z %83 = tt.broadcast %81 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7342321Z %84 = arith.select %18, %83, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7342548Z %85 = tt.broadcast %82 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7342770Z %86 = arith.select %20, %85, %84 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7342985Z %87 = tt.reshape %86 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:46.7343199Z %88 = arith.sitofp %87 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:46.7343483Z %89 = ttg.convert_layout %88 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7343952Z %90 = tt.dot %64, %89, %62#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.7344443Z %91 = ttg.local_load %62#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7344868Z %92 = arith.extf %91 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7345195Z %93 = arith.addi %12, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:46.7345492Z %94 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7345733Z %95 = arith.muli %94, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7345920Z %96 = tt.broadcast %95 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7346106Z %97 = arith.addi %96, %43 : tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7346314Z %98 = tt.addptr %10, %97 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T09:51:46.7346518Z %99 = arith.cmpi sge, %94, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7346688Z %100 = arith.cmpi slt, %94, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:46.7346853Z %101 = arith.andi %99, %100 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:46.7347034Z %102 = tt.broadcast %101 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7347228Z %103 = arith.andi %102, %47 : tensor<2x64xi1, #blocked2> 2026-02-21T09:51:46.7347397Z %104 = tt.load %98, %103, %cst_12 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T09:51:46.7347648Z %105 = ttg.convert_layout %104 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7347932Z %106 = arith.shli %105, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7348168Z %107 = arith.shrsi %106, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7348410Z %108 = arith.shrsi %105, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:46.7348720Z %109 = tt.expand_dims %107 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7349054Z %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:51:46.7349335Z %111 = tt.broadcast %109 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7349571Z %112 = arith.select %18, %111, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7349810Z %113 = tt.broadcast %110 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7350043Z %114 = arith.select %20, %113, %112 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:51:46.7350269Z %115 = tt.reshape %114 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3> 2026-02-21T09:51:46.7350487Z %116 = arith.sitofp %115 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3> 2026-02-21T09:51:46.7350779Z %117 = ttg.convert_layout %116 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:46.7351242Z %118 = tt.dot %92, %117, %90, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:51:46.7351620Z ttg.local_dealloc %48 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> 2026-02-21T09:51:46.7351843Z %119 = arith.truncf %118 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:51:46.7352106Z %120 = tt.expand_dims %35 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:51:46.7352340Z %121 = arith.muli %120, %cst : tensor<64x1xi32, #mma> 2026-02-21T09:51:46.7352567Z %122 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:51:46.7352823Z %123 = tt.broadcast %121 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7353039Z %124 = tt.broadcast %122 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7353218Z %125 = arith.addi %123, %124 : tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7353404Z %126 = tt.addptr %21, %125 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:51:46.7353597Z tt.store %126, %119 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:51:46.7353735Z } {tt.num_stages = 1 : i32} 2026-02-21T09:51:46.7353836Z tt.return 2026-02-21T09:51:46.7353918Z } 2026-02-21T09:51:46.7353993Z } 2026-02-21T09:51:46.7354038Z 2026-02-21T09:51:46.7354085Z {-# 2026-02-21T09:51:46.7354164Z external_resources: { 2026-02-21T09:51:46.7354264Z mlir_reproducer: { 2026-02-21T09:51:46.7355265Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:51:46.7356264Z disable_threading: false, 2026-02-21T09:51:46.7356372Z verify_each: true 2026-02-21T09:51:46.7356460Z } 2026-02-21T09:51:46.7356531Z } 2026-02-21T09:51:46.7356602Z #-} 2026-02-21T09:51:46.7356878Z /tmp/torchinductor_root/2y/c2y7gy7ipehhdbfa3y5ziol7zm3dkrshalpbtxjnxv6uzlhj6n2z.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:51:46.7357576Z /tmp/torchinductor_root/2y/c2y7gy7ipehhdbfa3y5ziol7zm3dkrshalpbtxjnxv6uzlhj6n2z.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:51:46.7358120Z [437s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:51:46.7358900Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:51:46.7359598Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:51:46.7359764Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:51:47.9714358Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:51:47.9716713Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 4], order = [2, 1, 0]}> 2026-02-21T09:51:47.9717216Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:51:47.9718025Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 4], order = [1, 0]}> 2026-02-21T09:51:47.9718473Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 4], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:51:47.9718857Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:51:47.9719216Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:51:47.9719486Z #smem = #ttg.shared_memory 2026-02-21T09:51:47.9726942Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:51:47.9727592Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:51:47.9727976Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:51:47.9728157Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:47.9728391Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:47.9728567Z %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:51:47.9728751Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma> 2026-02-21T09:51:47.9728939Z %cst_4 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:47.9729111Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:47.9729285Z %cst_6 = arith.constant dense<512> : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:47.9729457Z %cst_7 = arith.constant dense<0> : tensor<1x256xi64, #blocked2> 2026-02-21T09:51:47.9729629Z %cst_8 = arith.constant dense<8192> : tensor<1x256xi64, #blocked2> 2026-02-21T09:51:47.9729803Z %cst_9 = arith.constant dense<0> : tensor<2x256xi8, #blocked2> 2026-02-21T09:51:47.9729949Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:51:47.9730068Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:51:47.9730183Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:51:47.9730300Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:51:47.9730443Z %cst_10 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked> 2026-02-21T09:51:47.9730587Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:51:47.9730704Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:51:47.9730980Z %cst_11 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:47.9731177Z %0 = tt.get_program_id x : i32 2026-02-21T09:51:47.9731287Z %1 = arith.divsi %0, %c256_i32 : i32 2026-02-21T09:51:47.9731401Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:51:47.9731514Z %3 = arith.subi %c32_i32, %2 : i32 2026-02-21T09:51:47.9731622Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:51:47.9731735Z %5 = arith.remsi %0, %c256_i32 : i32 2026-02-21T09:51:47.9731846Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:51:47.9731955Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:51:47.9732058Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:51:47.9732170Z %9 = arith.muli %7, %c256_i32 : i32 2026-02-21T09:51:47.9732361Z %10 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:47.9732642Z %11 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:47.9732892Z %12 = tt.splat %9 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:47.9733099Z %13 = arith.addi %12, %10 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:47.9733263Z %14 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:51:47.9733458Z %15 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:47.9733748Z %16 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:47.9733992Z %17 = tt.splat %14 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:47.9734198Z %18 = tt.splat %14 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:47.9734408Z %19 = arith.addi %17, %15 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:47.9734613Z %20 = arith.addi %18, %16 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:47.9734849Z %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:47.9735178Z %22 = tt.expand_dims %19 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:51:47.9735428Z %23 = arith.muli %22, %cst_2 : tensor<128x1xi32, #blocked1> 2026-02-21T09:51:47.9735623Z %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:47.9735853Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:47.9736016Z %26 = arith.extsi %9 : i32 to i64 2026-02-21T09:51:47.9736165Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<2x256x!tt.ptr, #blocked2> 2026-02-21T09:51:47.9736401Z %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:47.9736719Z %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:47.9737010Z %30 = tt.splat %26 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:47.9737303Z %31 = arith.extsi %11 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:47.9737598Z %32 = arith.addi %30, %31 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:47.9737870Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x256xi64, #blocked2> 2026-02-21T09:51:47.9738145Z %34 = tt.broadcast %33 : tensor<1x256xi64, #blocked2> -> tensor<2x256xi64, #blocked2> 2026-02-21T09:51:47.9738343Z %35 = arith.cmpi sge, %33, %cst_7 : tensor<1x256xi64, #blocked2> 2026-02-21T09:51:47.9738530Z %36 = arith.cmpi slt, %33, %cst_8 : tensor<1x256xi64, #blocked2> 2026-02-21T09:51:47.9738691Z %37 = arith.andi %35, %36 : tensor<1x256xi1, #blocked2> 2026-02-21T09:51:47.9738877Z %38 = tt.broadcast %37 : tensor<1x256xi1, #blocked2> -> tensor<2x256xi1, #blocked2> 2026-02-21T09:51:47.9739161Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:51:47.9739572Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:51:47.9739975Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:47.9740227Z %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:47.9740418Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T09:51:47.9740627Z %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:51:47.9740814Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T09:51:47.9741074Z %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg4 = %cst_3) -> (tensor<128x256xf32, #mma>) : i32 { 2026-02-21T09:51:47.9741291Z %56 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:51:47.9741475Z %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:47.9741689Z %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:51:47.9741958Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:51:47.9742224Z %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:51:47.9742414Z %61 = arith.addi %24, %60 : tensor<128x4xi32, #blocked1> 2026-02-21T09:51:47.9742609Z %62 = tt.addptr %25, %61 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:51:47.9742832Z %63 = tt.load %62 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:51:47.9743051Z %64 = ttg.local_alloc %63 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:51:47.9743377Z %65 = ttg.local_load %64 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:47.9743800Z %66 = arith.extf %65 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:47.9744077Z %67 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:51:47.9744245Z %68 = tt.splat %67 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:47.9744460Z %69 = arith.addi %68, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:47.9744729Z %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:51:47.9744973Z %71 = arith.muli %70, %cst_4 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:47.9745157Z %72 = tt.broadcast %71 : tensor<2x1xi64, #blocked2> -> tensor<2x256xi64, #blocked2> 2026-02-21T09:51:47.9745347Z %73 = arith.addi %72, %34 : tensor<2x256xi64, #blocked2> 2026-02-21T09:51:47.9745541Z %74 = tt.addptr %27, %73 : tensor<2x256x!tt.ptr, #blocked2>, tensor<2x256xi64, #blocked2> 2026-02-21T09:51:47.9745746Z %75 = arith.cmpi sge, %70, %cst_5 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:47.9745913Z %76 = arith.cmpi slt, %70, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:51:47.9746069Z %77 = arith.andi %75, %76 : tensor<2x1xi1, #blocked2> 2026-02-21T09:51:47.9746272Z %78 = tt.broadcast %77 : tensor<2x1xi1, #blocked2> -> tensor<2x256xi1, #blocked2> 2026-02-21T09:51:47.9746455Z %79 = arith.andi %78, %38 : tensor<2x256xi1, #blocked2> 2026-02-21T09:51:47.9746622Z %80 = tt.load %74, %79, %cst_9 : tensor<2x256x!tt.ptr, #blocked2> 2026-02-21T09:51:47.9746873Z %81 = ttg.convert_layout %80 : tensor<2x256xi8, #blocked2> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:47.9747154Z %82 = arith.shli %81, %cst_11 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:47.9747388Z %83 = arith.shrsi %82, %cst_11 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:47.9747620Z %84 = arith.shrsi %81, %cst_11 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:47.9747904Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:51:47.9748240Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:51:47.9748521Z %87 = tt.broadcast %85 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:51:47.9748762Z %88 = arith.select %43, %87, %cst_10 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:51:47.9748995Z %89 = tt.broadcast %86 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:51:47.9749242Z %90 = arith.select %45, %89, %88 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:51:47.9749470Z %91 = tt.reshape %90 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked2> 2026-02-21T09:51:47.9749686Z %92 = arith.sitofp %91 : tensor<4x256xi8, #blocked2> to tensor<4x256xf32, #blocked2> 2026-02-21T09:51:47.9749934Z %93 = ttg.local_alloc %92 : (tensor<4x256xf32, #blocked2>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:51:47.9750253Z %94 = ttg.local_load %93 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:47.9750745Z %95 = tt.dot %66, %94, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:51:47.9751097Z scf.yield %95 : tensor<128x256xf32, #mma> 2026-02-21T09:51:47.9751245Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32} 2026-02-21T09:51:47.9751433Z %47 = arith.truncf %46 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:51:47.9751715Z %48 = tt.expand_dims %20 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:51:47.9751946Z %49 = arith.muli %48, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:51:47.9752171Z %50 = tt.expand_dims %13 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:51:47.9752425Z %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:51:47.9752629Z %52 = tt.broadcast %50 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:51:47.9752806Z %53 = arith.addi %51, %52 : tensor<128x256xi32, #mma> 2026-02-21T09:51:47.9752979Z %54 = tt.splat %arg2 : !tt.ptr -> tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:51:47.9753193Z %55 = tt.addptr %54, %53 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi32, #mma> 2026-02-21T09:51:47.9753385Z tt.store %55, %47 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:51:47.9753517Z tt.return 2026-02-21T09:51:47.9753595Z } 2026-02-21T09:51:47.9753673Z } 2026-02-21T09:51:47.9753717Z 2026-02-21T09:51:47.9753748Z {-# 2026-02-21T09:51:47.9753831Z external_resources: { 2026-02-21T09:51:47.9753928Z mlir_reproducer: { 2026-02-21T09:51:47.9754958Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:51:47.9755943Z disable_threading: false, 2026-02-21T09:51:47.9756052Z verify_each: true 2026-02-21T09:51:47.9756141Z } 2026-02-21T09:51:47.9756214Z } 2026-02-21T09:51:47.9756282Z #-} 2026-02-21T09:51:47.9756570Z /tmp/torchinductor_root/su/csukzv5zbzywkh36jczfxtklbfcnhvaid6ak4lorzwsrnck6yo75.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:51:47.9757250Z /tmp/torchinductor_root/su/csukzv5zbzywkh36jczfxtklbfcnhvaid6ak4lorzwsrnck6yo75.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:51:47.9757800Z [438s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:51:47.9758543Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:51:47.9759202Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:51:47.9759368Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:51:48.5776220Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:51:48.5779776Z #blocked = #ttg.blocked<{sizePerThread = [1, 16], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:51:48.5780702Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:51:48.5781646Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:51:48.5782456Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:51:48.5783213Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:51:48.5783905Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:51:48.5784543Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:51:48.5784985Z #smem = #ttg.shared_memory 2026-02-21T09:51:48.5785418Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:51:48.5786299Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:51:48.5787005Z %cst = arith.constant dense<4> : tensor<4x256xi8, #blocked> 2026-02-21T09:51:48.5787288Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:51:48.5787499Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:51:48.5787710Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:51:48.5788033Z %cst_0 = arith.constant dense<0> : tensor<4x2x256xi8, #blocked1> 2026-02-21T09:51:48.5788362Z %cst_1 = arith.constant dense<0> : tensor<4x256xi8, #blocked> 2026-02-21T09:51:48.5788627Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:51:48.5788839Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:51:48.5789051Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:51:48.5789322Z %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:51:48.5789661Z %cst_3 = arith.constant dense<8192> : tensor<1x256xi64, #blocked> 2026-02-21T09:51:48.5789990Z %cst_4 = arith.constant dense<0> : tensor<1x256xi64, #blocked> 2026-02-21T09:51:48.5790308Z %cst_5 = arith.constant dense<512> : tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5790617Z %cst_6 = arith.constant dense<0> : tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5790930Z %cst_7 = arith.constant dense<8192> : tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5791322Z %cst_8 = arith.constant dense<4> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:48.5791778Z %cst_9 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:48.5792125Z %c504_i32 = arith.constant 504 : i32 2026-02-21T09:51:48.5792336Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:51:48.5792557Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:51:48.5792836Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma> 2026-02-21T09:51:48.5793217Z %cst_11 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:51:48.5793537Z %cst_12 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:51:48.5793861Z %cst_13 = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:51:48.5794131Z %0 = tt.get_program_id x : i32 2026-02-21T09:51:48.5794338Z %1 = arith.divsi %0, %c64_i32 : i32 2026-02-21T09:51:48.5794550Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:51:48.5794757Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T09:51:48.5794961Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:51:48.5795113Z %5 = arith.remsi %0, %c64_i32 : i32 2026-02-21T09:51:48.5795289Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:51:48.5795439Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:51:48.5795585Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:51:48.5795730Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:51:48.5796001Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:48.5796390Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:48.5796758Z %12 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:48.5797055Z %13 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:48.5797342Z %14 = arith.addi %12, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:51:48.5797636Z %15 = arith.addi %13, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:51:48.5797863Z %16 = arith.muli %8, %c256_i32 : i32 2026-02-21T09:51:48.5798135Z %17 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:51:48.5798502Z %18 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:48.5798831Z %19 = tt.splat %16 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:48.5799117Z %20 = arith.addi %19, %18 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:51:48.5799437Z %21 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:48.5799883Z %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:51:48.5800232Z %23 = arith.muli %22, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:51:48.5800496Z %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:51:48.5800797Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:51:48.5801011Z %26 = arith.extsi %16 : i32 to i64 2026-02-21T09:51:48.5801219Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<4x256x!tt.ptr, #blocked> 2026-02-21T09:51:48.5801530Z %28 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:48.5801974Z %29 = arith.extsi %28 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:48.5802357Z %30 = tt.splat %26 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:51:48.5802856Z %31 = arith.extsi %17 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:51:48.5803256Z %32 = arith.addi %30, %31 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:51:48.5803623Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T09:51:48.5803995Z %34 = tt.broadcast %33 : tensor<1x256xi64, #blocked> -> tensor<4x256xi64, #blocked> 2026-02-21T09:51:48.5804316Z %35 = arith.cmpi sge, %33, %cst_4 : tensor<1x256xi64, #blocked> 2026-02-21T09:51:48.5804546Z %36 = arith.cmpi slt, %33, %cst_3 : tensor<1x256xi64, #blocked> 2026-02-21T09:51:48.5804760Z %37 = arith.andi %35, %36 : tensor<1x256xi1, #blocked> 2026-02-21T09:51:48.5805029Z %38 = tt.broadcast %37 : tensor<1x256xi1, #blocked> -> tensor<4x256xi1, #blocked> 2026-02-21T09:51:48.5805337Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:51:48.5805791Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:51:48.5806248Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:51:48.5806527Z %42 = arith.cmpi eq, %41, %cst_11 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:51:48.5806742Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked1> -> tensor<4x2x256xi1, #blocked1> 2026-02-21T09:51:48.5806977Z %44 = arith.cmpi eq, %41, %cst_12 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:51:48.5807187Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked1> -> tensor<4x2x256xi1, #blocked1> 2026-02-21T09:51:48.5807423Z %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:51:48.5807715Z %47 = tt.expand_dims %21 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:51:48.5808012Z %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:51:48.5808215Z %49 = arith.addi %24, %48 : tensor<128x8xi32, #blocked2> 2026-02-21T09:51:48.5808426Z %50 = tt.addptr %25, %49 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:51:48.5808646Z %51 = tt.load %50 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:51:48.5808899Z %52 = tt.expand_dims %29 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5809157Z %53 = arith.muli %52, %cst_7 : tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5809350Z %54 = tt.broadcast %53 : tensor<4x1xi64, #blocked> -> tensor<4x256xi64, #blocked> 2026-02-21T09:51:48.5809569Z %55 = arith.addi %54, %34 : tensor<4x256xi64, #blocked> 2026-02-21T09:51:48.5809776Z %56 = tt.addptr %27, %55 : tensor<4x256x!tt.ptr, #blocked>, tensor<4x256xi64, #blocked> 2026-02-21T09:51:48.5809990Z %57 = arith.cmpi sge, %52, %cst_6 : tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5810171Z %58 = arith.cmpi slt, %52, %cst_5 : tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5810336Z %59 = arith.andi %57, %58 : tensor<4x1xi1, #blocked> 2026-02-21T09:51:48.5810528Z %60 = tt.broadcast %59 : tensor<4x1xi1, #blocked> -> tensor<4x256xi1, #blocked> 2026-02-21T09:51:48.5810721Z %61 = arith.andi %60, %38 : tensor<4x256xi1, #blocked> 2026-02-21T09:51:48.5810945Z %62 = tt.load %56, %61, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x256x!tt.ptr, #blocked> 2026-02-21T09:51:48.5811310Z %63 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:51:48.5811703Z ttg.local_store %51, %63 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:51:48.5811998Z %64 = arith.addi %21, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:48.5812303Z %65 = tt.expand_dims %64 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:51:48.5812598Z %66 = tt.broadcast %65 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:51:48.5812819Z %67 = arith.addi %24, %66 : tensor<128x8xi32, #blocked2> 2026-02-21T09:51:48.5813029Z %68 = tt.addptr %25, %67 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:51:48.5813249Z %69 = tt.load %68 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:51:48.5813451Z %70 = arith.addi %29, %cst_8 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:48.5813739Z %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5813997Z %72 = arith.muli %71, %cst_7 : tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5814211Z %73 = tt.broadcast %72 : tensor<4x1xi64, #blocked> -> tensor<4x256xi64, #blocked> 2026-02-21T09:51:48.5814405Z %74 = arith.addi %73, %34 : tensor<4x256xi64, #blocked> 2026-02-21T09:51:48.5814609Z %75 = tt.addptr %27, %74 : tensor<4x256x!tt.ptr, #blocked>, tensor<4x256xi64, #blocked> 2026-02-21T09:51:48.5814829Z %76 = arith.cmpi sge, %71, %cst_6 : tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5814997Z %77 = arith.cmpi slt, %71, %cst_5 : tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5815166Z %78 = arith.andi %76, %77 : tensor<4x1xi1, #blocked> 2026-02-21T09:51:48.5815340Z %79 = tt.broadcast %78 : tensor<4x1xi1, #blocked> -> tensor<4x256xi1, #blocked> 2026-02-21T09:51:48.5815517Z %80 = arith.andi %79, %38 : tensor<4x256xi1, #blocked> 2026-02-21T09:51:48.5815719Z %81 = tt.load %75, %80, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x256x!tt.ptr, #blocked> 2026-02-21T09:51:48.5816047Z %82 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:51:48.5816403Z ttg.local_store %69, %82 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:51:48.5817026Z %83:6 = scf.for %arg3 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg4 = %cst_10, %arg5 = %c1_i32, %arg6 = %63, %arg7 = %82, %arg8 = %62, %arg9 = %81) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x256xi8, #blocked>, tensor<4x256xi8, #blocked>) : i32 { 2026-02-21T09:51:48.5817541Z %129 = arith.addi %arg3, %c8_i32 : i32 2026-02-21T09:51:48.5817662Z %130 = arith.muli %129, %c2_i32 : i32 2026-02-21T09:51:48.5817863Z %131 = tt.splat %130 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:48.5818084Z %132 = arith.addi %131, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:51:48.5818363Z %133 = tt.expand_dims %132 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:51:48.5818642Z %134 = tt.broadcast %133 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:51:48.5818835Z %135 = arith.addi %24, %134 : tensor<128x8xi32, #blocked2> 2026-02-21T09:51:48.5819039Z %136 = tt.addptr %25, %135 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:51:48.5819247Z %137 = tt.load %136 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:51:48.5819552Z %138 = ttg.local_load %arg6 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:48.5819993Z %139 = arith.extf %138 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:48.5820274Z %140 = arith.extsi %129 : i32 to i64 2026-02-21T09:51:48.5820443Z %141 = tt.splat %140 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:48.5820659Z %142 = arith.addi %141, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:51:48.5820945Z %143 = tt.expand_dims %142 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5821188Z %144 = arith.muli %143, %cst_7 : tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5821374Z %145 = tt.broadcast %144 : tensor<4x1xi64, #blocked> -> tensor<4x256xi64, #blocked> 2026-02-21T09:51:48.5821564Z %146 = arith.addi %145, %34 : tensor<4x256xi64, #blocked> 2026-02-21T09:51:48.5821756Z %147 = tt.addptr %27, %146 : tensor<4x256x!tt.ptr, #blocked>, tensor<4x256xi64, #blocked> 2026-02-21T09:51:48.5821964Z %148 = arith.cmpi sge, %143, %cst_6 : tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5822150Z %149 = arith.cmpi slt, %143, %cst_5 : tensor<4x1xi64, #blocked> 2026-02-21T09:51:48.5822311Z %150 = arith.andi %148, %149 : tensor<4x1xi1, #blocked> 2026-02-21T09:51:48.5822493Z %151 = tt.broadcast %150 : tensor<4x1xi1, #blocked> -> tensor<4x256xi1, #blocked> 2026-02-21T09:51:48.5822679Z %152 = arith.andi %151, %38 : tensor<4x256xi1, #blocked> 2026-02-21T09:51:48.5822844Z %153 = tt.load %147, %152, %cst_1 : tensor<4x256x!tt.ptr, #blocked> 2026-02-21T09:51:48.5823031Z %154 = arith.shli %arg8, %cst : tensor<4x256xi8, #blocked> 2026-02-21T09:51:48.5823189Z %155 = arith.shrsi %154, %cst : tensor<4x256xi8, #blocked> 2026-02-21T09:51:48.5823430Z %156 = ttg.convert_layout %155 : tensor<4x256xi8, #blocked> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:48.5823677Z %157 = arith.shrsi %arg8, %cst : tensor<4x256xi8, #blocked> 2026-02-21T09:51:48.5823918Z %158 = ttg.convert_layout %157 : tensor<4x256xi8, #blocked> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:48.5824255Z %159 = tt.expand_dims %156 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x256xi8, #blocked1> 2026-02-21T09:51:48.5824596Z %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x256xi8, #blocked1> 2026-02-21T09:51:48.5824885Z %161 = tt.broadcast %159 : tensor<4x1x256xi8, #blocked1> -> tensor<4x2x256xi8, #blocked1> 2026-02-21T09:51:48.5825130Z %162 = arith.select %43, %161, %cst_0 : tensor<4x2x256xi1, #blocked1>, tensor<4x2x256xi8, #blocked1> 2026-02-21T09:51:48.5825374Z %163 = tt.broadcast %160 : tensor<4x1x256xi8, #blocked1> -> tensor<4x2x256xi8, #blocked1> 2026-02-21T09:51:48.5825626Z %164 = arith.select %45, %163, %162 : tensor<4x2x256xi1, #blocked1>, tensor<4x2x256xi8, #blocked1> 2026-02-21T09:51:48.5825863Z %165 = tt.reshape %164 : tensor<4x2x256xi8, #blocked1> -> tensor<8x256xi8, #blocked3> 2026-02-21T09:51:48.5826092Z %166 = arith.sitofp %165 : tensor<8x256xi8, #blocked3> to tensor<8x256xf32, #blocked3> 2026-02-21T09:51:48.5826344Z %167 = ttg.local_alloc %166 : (tensor<8x256xf32, #blocked3>) -> !ttg.memdesc<8x256xf32, #shared1, #smem> 2026-02-21T09:51:48.5826676Z %168 = ttg.local_load %167 : !ttg.memdesc<8x256xf32, #shared1, #smem> -> tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:48.5827159Z %169 = tt.dot %139, %168, %arg4, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:51:48.5827509Z %170 = arith.addi %arg5, %c1_i32 : i32 2026-02-21T09:51:48.5827637Z %171 = arith.cmpi slt, %170, %c2_i32 : i32 2026-02-21T09:51:48.5827764Z %172 = arith.select %171, %170, %c0_i32 : i32 2026-02-21T09:51:48.5828031Z %173 = ttg.memdesc_index %46[%172] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:51:48.5828388Z ttg.local_store %137, %173 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:51:48.5828889Z scf.yield %169, %172, %arg7, %173, %arg9, %153 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x256xi8, #blocked>, tensor<4x256xi8, #blocked> 2026-02-21T09:51:48.5829294Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32} 2026-02-21T09:51:48.5829586Z %84 = ttg.local_load %83#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:48.5830013Z %85 = arith.extf %84 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:48.5830337Z %86 = arith.shli %83#4, %cst : tensor<4x256xi8, #blocked> 2026-02-21T09:51:48.5830490Z %87 = arith.shrsi %86, %cst : tensor<4x256xi8, #blocked> 2026-02-21T09:51:48.5830728Z %88 = ttg.convert_layout %87 : tensor<4x256xi8, #blocked> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:48.5830967Z %89 = arith.shrsi %83#4, %cst : tensor<4x256xi8, #blocked> 2026-02-21T09:51:48.5831224Z %90 = ttg.convert_layout %89 : tensor<4x256xi8, #blocked> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:48.5831556Z %91 = tt.expand_dims %88 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x256xi8, #blocked1> 2026-02-21T09:51:48.5831892Z %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x256xi8, #blocked1> 2026-02-21T09:51:48.5832174Z %93 = tt.broadcast %91 : tensor<4x1x256xi8, #blocked1> -> tensor<4x2x256xi8, #blocked1> 2026-02-21T09:51:48.5832410Z %94 = arith.select %43, %93, %cst_0 : tensor<4x2x256xi1, #blocked1>, tensor<4x2x256xi8, #blocked1> 2026-02-21T09:51:48.5832648Z %95 = tt.broadcast %92 : tensor<4x1x256xi8, #blocked1> -> tensor<4x2x256xi8, #blocked1> 2026-02-21T09:51:48.5832879Z %96 = arith.select %45, %95, %94 : tensor<4x2x256xi1, #blocked1>, tensor<4x2x256xi8, #blocked1> 2026-02-21T09:51:48.5833103Z %97 = tt.reshape %96 : tensor<4x2x256xi8, #blocked1> -> tensor<8x256xi8, #blocked3> 2026-02-21T09:51:48.5833322Z %98 = arith.sitofp %97 : tensor<8x256xi8, #blocked3> to tensor<8x256xf32, #blocked3> 2026-02-21T09:51:48.5833565Z %99 = ttg.local_alloc %98 : (tensor<8x256xf32, #blocked3>) -> !ttg.memdesc<8x256xf32, #shared1, #smem> 2026-02-21T09:51:48.5833902Z %100 = ttg.local_load %99 : !ttg.memdesc<8x256xf32, #shared1, #smem> -> tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:48.5834369Z %101 = tt.dot %85, %100, %83#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:51:48.5834854Z %102 = ttg.local_load %83#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:48.5835282Z %103 = arith.extf %102 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:48.5835581Z %104 = arith.shli %83#5, %cst : tensor<4x256xi8, #blocked> 2026-02-21T09:51:48.5835737Z %105 = arith.shrsi %104, %cst : tensor<4x256xi8, #blocked> 2026-02-21T09:51:48.5835983Z %106 = ttg.convert_layout %105 : tensor<4x256xi8, #blocked> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:48.5836228Z %107 = arith.shrsi %83#5, %cst : tensor<4x256xi8, #blocked> 2026-02-21T09:51:48.5836469Z %108 = ttg.convert_layout %107 : tensor<4x256xi8, #blocked> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:51:48.5836803Z %109 = tt.expand_dims %106 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x256xi8, #blocked1> 2026-02-21T09:51:48.5837159Z %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x256xi8, #blocked1> 2026-02-21T09:51:48.5837450Z %111 = tt.broadcast %109 : tensor<4x1x256xi8, #blocked1> -> tensor<4x2x256xi8, #blocked1> 2026-02-21T09:51:48.5837691Z %112 = arith.select %43, %111, %cst_0 : tensor<4x2x256xi1, #blocked1>, tensor<4x2x256xi8, #blocked1> 2026-02-21T09:51:48.5837933Z %113 = tt.broadcast %110 : tensor<4x1x256xi8, #blocked1> -> tensor<4x2x256xi8, #blocked1> 2026-02-21T09:51:48.5838173Z %114 = arith.select %45, %113, %112 : tensor<4x2x256xi1, #blocked1>, tensor<4x2x256xi8, #blocked1> 2026-02-21T09:51:48.5838423Z %115 = tt.reshape %114 : tensor<4x2x256xi8, #blocked1> -> tensor<8x256xi8, #blocked3> 2026-02-21T09:51:48.5838648Z %116 = arith.sitofp %115 : tensor<8x256xi8, #blocked3> to tensor<8x256xf32, #blocked3> 2026-02-21T09:51:48.5838898Z %117 = ttg.local_alloc %116 : (tensor<8x256xf32, #blocked3>) -> !ttg.memdesc<8x256xf32, #shared1, #smem> 2026-02-21T09:51:48.5839238Z %118 = ttg.local_load %117 : !ttg.memdesc<8x256xf32, #shared1, #smem> -> tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:51:48.5839704Z %119 = tt.dot %103, %118, %101, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:51:48.5840084Z ttg.local_dealloc %46 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:51:48.5840301Z %120 = arith.truncf %119 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:51:48.5840575Z %121 = tt.expand_dims %15 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:51:48.5840810Z %122 = arith.muli %121, %cst_13 : tensor<128x1xi32, #mma> 2026-02-21T09:51:48.5841046Z %123 = tt.expand_dims %20 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:51:48.5841303Z %124 = tt.broadcast %122 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:51:48.5841508Z %125 = tt.broadcast %123 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:51:48.5841688Z %126 = arith.addi %124, %125 : tensor<128x256xi32, #mma> 2026-02-21T09:51:48.5841883Z %127 = tt.splat %arg2 : !tt.ptr -> tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:51:48.5842102Z %128 = tt.addptr %127, %126 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi32, #mma> 2026-02-21T09:51:48.5842303Z tt.store %128, %120 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:51:48.5842434Z tt.return 2026-02-21T09:51:48.5842513Z } 2026-02-21T09:51:48.5842634Z } 2026-02-21T09:51:48.5842676Z 2026-02-21T09:51:48.5842706Z {-# 2026-02-21T09:51:48.5842788Z external_resources: { 2026-02-21T09:51:48.5842889Z mlir_reproducer: { 2026-02-21T09:51:48.5843877Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:51:48.5844858Z disable_threading: false, 2026-02-21T09:51:48.5844966Z verify_each: true 2026-02-21T09:51:48.5845054Z } 2026-02-21T09:51:48.5845129Z } 2026-02-21T09:51:48.5845198Z #-} 2026-02-21T09:51:48.5845480Z /tmp/torchinductor_root/os/cosunfwyrn6frbsp3rjael2doabicbr5htzyug5z4hpwzysdspuh.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:51:48.5846178Z /tmp/torchinductor_root/os/cosunfwyrn6frbsp3rjael2doabicbr5htzyug5z4hpwzysdspuh.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:51:48.5846727Z [439s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:51:48.5847455Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:51:48.5848126Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:51:48.5848292Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:51:48.7037180Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 80/80 8.2 configs/s 2026-02-21T09:51:59.0282559Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 193/193 13.7 configs/s 2026-02-21T09:52:01.9061790Z [452s] Generation 5 complete: 2026-02-21T09:52:01.9062216Z error=5 2026-02-21T09:52:01.9062424Z ok=79 2026-02-21T09:52:01.9062629Z min=1.0552 2026-02-21T09:52:01.9062834Z mid=1.4344 2026-02-21T09:52:01.9063083Z max=66.5692 2026-02-21T09:52:01.9063312Z best={'block_sizes': [8, 128, 128], 2026-02-21T09:52:01.9063712Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T09:52:01.9064072Z 'l2_groupings': [2], 2026-02-21T09:52:01.9064344Z 'load_eviction_policies': ['', ''], 2026-02-21T09:52:01.9064649Z 'loop_orders': [[0, 1]], 2026-02-21T09:52:01.9064927Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:52:01.9065201Z 'num_stages': 1, 2026-02-21T09:52:01.9065431Z 'num_warps': 4, 2026-02-21T09:52:01.9065665Z 'pid_type': 'flat', 2026-02-21T09:52:01.9065919Z 'range_flattens': [None, None], 2026-02-21T09:52:01.9066224Z 'range_multi_buffers': [None, False], 2026-02-21T09:52:01.9066536Z 'range_num_stages': [0, 2], 2026-02-21T09:52:01.9066815Z 'range_unroll_factors': [0, 0], 2026-02-21T09:52:01.9067120Z 'range_warp_specializes': [], 2026-02-21T09:52:01.9067396Z 'waves_per_eu': 2} 2026-02-21T09:52:01.9199355Z [452s] Fitting surrogate: 590 points, 590 targets 2026-02-21T09:52:02.7329414Z [453s] Generation 6 starting: 81 neighbors, 4 active search path(s) 2026-02-21T09:52:20.9076300Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 0.6 configs/s 2026-02-21T09:52:25.4079851Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:52:25.4094939Z #blocked = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:52:25.4095486Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:52:25.4095955Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:52:25.4096413Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:52:25.4096850Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:52:25.4097250Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:52:25.4097534Z #smem = #ttg.shared_memory 2026-02-21T09:52:25.4097873Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:52:25.4099049Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:52:25.4099645Z %cst = arith.constant dense<4> : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4099892Z %c9728_i32 = arith.constant 9728 : i32 2026-02-21T09:52:25.4100080Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:52:25.4100308Z %cst_0 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4100569Z %c29184_i32 = arith.constant 29184 : i32 2026-02-21T09:52:25.4100750Z %c19456_i32 = arith.constant 19456 : i32 2026-02-21T09:52:25.4101022Z %cst_1 = arith.constant dense<0> : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4101195Z %c38912_i32 = arith.constant 38912 : i32 2026-02-21T09:52:25.4101335Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:52:25.4101478Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:52:25.4101615Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:52:25.4101758Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:52:25.4101896Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:52:25.4102097Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:52:25.4102269Z %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.4102481Z %cst_3 = arith.constant dense<8192> : tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4102738Z %cst_4 = arith.constant dense<4> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4103031Z %cst_5 = arith.constant dense<8> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4103325Z %cst_6 = arith.constant dense<2> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4103605Z %cst_7 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4103824Z %c6_i32 = arith.constant 6 : i32 2026-02-21T09:52:25.4103961Z %c506_i32 = arith.constant 506 : i32 2026-02-21T09:52:25.4104099Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:52:25.4104234Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:52:25.4104370Z %c13823_i32 = arith.constant 13823 : i32 2026-02-21T09:52:25.4104563Z %cst_8 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4104811Z %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:25.4105120Z %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:25.4105327Z %cst_11 = arith.constant dense<8192> : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4105537Z %cst_12 = arith.constant dense<0> : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4105739Z %cst_13 = arith.constant dense<512> : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4105947Z %cst_14 = arith.constant dense<0> : tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4106146Z %cst_15 = arith.constant dense<0> : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4106351Z %cst_16 = arith.constant dense<8192> : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4106553Z %cst_17 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4106753Z %cst_18 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4106953Z %cst_19 = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4107121Z %0 = tt.get_program_id x : i32 2026-02-21T09:52:25.4107359Z %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.4107694Z %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.4108010Z %3 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4108302Z %4 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4108556Z %5 = tt.splat %arg1 : !tt.ptr -> tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4108825Z %6 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4109195Z %7 = arith.extsi %6 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4109572Z %8 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:25.4109946Z %9 = arith.extsi %8 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:25.4110391Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:52:25.4110829Z %11 = tt.expand_dims %10 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:52:25.4111253Z %12 = tt.expand_dims %11 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:25.4111507Z %13 = arith.cmpi eq, %12, %cst_9 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:25.4111712Z %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1> 2026-02-21T09:52:25.4111920Z %15 = arith.cmpi eq, %12, %cst_10 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:25.4112115Z %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1> 2026-02-21T09:52:25.4112333Z %17 = tt.splat %arg2 : !tt.ptr -> tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:52:25.4112604Z %18 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.4112910Z %19 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.4113217Z %20 = arith.extsi %19 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.4113449Z %21 = arith.subi %c13823_i32, %0 : i32 2026-02-21T09:52:25.4113575Z %22 = arith.divui %21, %c9728_i32 : i32 2026-02-21T09:52:25.4113695Z %23 = arith.remsi %22, %c4_i32 : i32 2026-02-21T09:52:25.4113840Z %24 = arith.subi %22, %23 : i32 2026-02-21T09:52:25.4113954Z %25 = arith.muli %24, %c9728_i32 : i32 2026-02-21T09:52:25.4114076Z %26 = arith.addi %0, %25 : i32 2026-02-21T09:52:25.4114205Z scf.for %arg3 = %0 to %26 step %c38912_i32 : i32 { 2026-02-21T09:52:25.4114349Z %27 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:52:25.4114476Z %28 = arith.muli %27, %c8_i32 : i32 2026-02-21T09:52:25.4114595Z %29 = arith.subi %c128_i32, %28 : i32 2026-02-21T09:52:25.4114718Z %30 = arith.minsi %29, %c8_i32 : i32 2026-02-21T09:52:25.4114839Z %31 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:52:25.4114965Z %32 = arith.remsi %31, %30 : i32 2026-02-21T09:52:25.4115079Z %33 = arith.addi %28, %32 : i32 2026-02-21T09:52:25.4115196Z %34 = arith.divsi %31, %30 : i32 2026-02-21T09:52:25.4115311Z %35 = arith.muli %33, %c128_i32 : i32 2026-02-21T09:52:25.4115484Z %36 = tt.splat %35 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.4115714Z %37 = arith.addi %36, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.4115888Z %38 = arith.muli %34, %c256_i32 : i32 2026-02-21T09:52:25.4116115Z %39 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.4116367Z %40 = arith.muli %39, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.4116584Z %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4116761Z %42 = arith.extsi %38 : i32 to i64 2026-02-21T09:52:25.4116927Z %43 = tt.splat %42 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:25.4117145Z %44 = arith.addi %43, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:25.4117415Z %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4117690Z %46 = tt.broadcast %45 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4117911Z %47 = arith.cmpi sge, %45, %cst_14 : tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4118081Z %48 = arith.cmpi slt, %45, %cst_3 : tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4118246Z %49 = arith.andi %47, %48 : tensor<1x256xi1, #blocked> 2026-02-21T09:52:25.4118428Z %50 = tt.broadcast %49 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4118649Z %51 = ttg.local_alloc : () -> !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.4118932Z %52 = tt.expand_dims %3 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.4119207Z %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4119400Z %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4119596Z %55 = tt.addptr %4, %54 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4119805Z %56 = tt.load %55 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4120042Z %57 = tt.expand_dims %7 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4120287Z %58 = arith.muli %57, %cst_11 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4120475Z %59 = tt.broadcast %58 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4120660Z %60 = arith.addi %59, %46 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4120858Z %61 = tt.addptr %5, %60 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4121061Z %62 = arith.cmpi sge, %57, %cst_12 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4121238Z %63 = arith.cmpi slt, %57, %cst_13 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4121422Z %64 = arith.andi %62, %63 : tensor<2x1xi1, #blocked> 2026-02-21T09:52:25.4121598Z %65 = tt.broadcast %64 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4121786Z %66 = arith.andi %65, %50 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4121993Z %67 = tt.load %61, %66, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4122336Z %68 = ttg.memdesc_index %51[%c0_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4122772Z ttg.local_store %56, %68 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4123045Z %69 = arith.addi %3, %cst_7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4123329Z %70 = tt.expand_dims %69 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.4123600Z %71 = tt.broadcast %70 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4123796Z %72 = arith.addi %41, %71 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4123994Z %73 = tt.addptr %4, %72 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4124196Z %74 = tt.load %73 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4124405Z %75 = arith.addi %7, %cst_6 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4124675Z %76 = tt.expand_dims %75 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4124920Z %77 = arith.muli %76, %cst_11 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4125104Z %78 = tt.broadcast %77 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4125296Z %79 = arith.addi %78, %46 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4125504Z %80 = tt.addptr %5, %79 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4125724Z %81 = arith.cmpi sge, %76, %cst_12 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4125896Z %82 = arith.cmpi slt, %76, %cst_13 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4126052Z %83 = arith.andi %81, %82 : tensor<2x1xi1, #blocked> 2026-02-21T09:52:25.4126228Z %84 = tt.broadcast %83 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4126409Z %85 = arith.andi %84, %50 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4126643Z %86 = tt.load %80, %85, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4126978Z %87 = ttg.memdesc_index %51[%c1_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4127335Z ttg.local_store %74, %87 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4127608Z %88 = arith.addi %3, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4127878Z %89 = tt.expand_dims %88 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.4128148Z %90 = tt.broadcast %89 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4128336Z %91 = arith.addi %41, %90 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4128529Z %92 = tt.addptr %4, %91 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4128731Z %93 = tt.load %92 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4128917Z %94 = arith.addi %7, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4129207Z %95 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4129444Z %96 = arith.muli %95, %cst_11 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4129627Z %97 = tt.broadcast %96 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4129813Z %98 = arith.addi %97, %46 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4129997Z %99 = tt.addptr %5, %98 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4130200Z %100 = arith.cmpi sge, %95, %cst_12 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4130368Z %101 = arith.cmpi slt, %95, %cst_13 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4130529Z %102 = arith.andi %100, %101 : tensor<2x1xi1, #blocked> 2026-02-21T09:52:25.4130711Z %103 = tt.broadcast %102 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4130895Z %104 = arith.andi %103, %50 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4131108Z %105 = tt.load %99, %104, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4131449Z %106 = ttg.memdesc_index %51[%c2_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4131812Z ttg.local_store %93, %106 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4132604Z %107:8 = scf.for %arg4 = %c0_i32 to %c506_i32 step %c2_i32 iter_args(%arg5 = %cst_8, %arg6 = %c2_i32, %arg7 = %68, %arg8 = %87, %arg9 = %106, %arg10 = %67, %arg11 = %86, %arg12 = %105) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:52:25.4133271Z %553 = arith.addi %arg4, %c6_i32 : i32 2026-02-21T09:52:25.4133395Z %554 = arith.muli %553, %c2_i32 : i32 2026-02-21T09:52:25.4133586Z %555 = tt.splat %554 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4133807Z %556 = arith.addi %555, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4134084Z %557 = tt.expand_dims %556 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.4134359Z %558 = tt.broadcast %557 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4134570Z %559 = arith.addi %41, %558 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4134773Z %560 = tt.addptr %4, %559 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4134978Z %561 = tt.load %560 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4135282Z %562 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4135724Z %563 = arith.extf %562 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4136009Z %564 = arith.extsi %553 : i32 to i64 2026-02-21T09:52:25.4136182Z %565 = tt.splat %564 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4136404Z %566 = arith.addi %565, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4136681Z %567 = tt.expand_dims %566 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4136926Z %568 = arith.muli %567, %cst_11 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4137135Z %569 = tt.broadcast %568 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4137331Z %570 = arith.addi %569, %46 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4137527Z %571 = tt.addptr %5, %570 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4137736Z %572 = arith.cmpi sge, %567, %cst_12 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4137905Z %573 = arith.cmpi slt, %567, %cst_13 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4138070Z %574 = arith.andi %572, %573 : tensor<2x1xi1, #blocked> 2026-02-21T09:52:25.4138256Z %575 = tt.broadcast %574 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4138443Z %576 = arith.andi %575, %50 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4138610Z %577 = tt.load %571, %576, %cst_1 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4138782Z %578 = arith.shli %arg10, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4138947Z %579 = arith.shrsi %578, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4139192Z %580 = ttg.convert_layout %579 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4139444Z %581 = arith.shrsi %arg10, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4139690Z %582 = ttg.convert_layout %581 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4140040Z %583 = tt.expand_dims %580 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4140389Z %584 = tt.expand_dims %582 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4140681Z %585 = tt.broadcast %583 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4140927Z %586 = arith.select %14, %585, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4141172Z %587 = tt.broadcast %584 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4141430Z %588 = arith.select %16, %587, %586 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4141666Z %589 = tt.reshape %588 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4141893Z %590 = arith.sitofp %589 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4142206Z %591 = ttg.convert_layout %590 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4142682Z %592 = tt.dot %563, %591, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4143041Z %593 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:52:25.4143169Z %594 = arith.cmpi slt, %593, %c3_i32 : i32 2026-02-21T09:52:25.4143307Z %595 = arith.select %594, %593, %c0_i32 : i32 2026-02-21T09:52:25.4143580Z %596 = ttg.memdesc_index %51[%595] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4143947Z ttg.local_store %561, %596 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4144568Z scf.yield %592, %595, %arg8, %arg9, %596, %arg11, %arg12, %577 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4145109Z } {tt.flatten, tt.num_stages = 4 : i32} 2026-02-21T09:52:25.4145392Z %108 = ttg.local_load %107#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4145826Z %109 = arith.extf %108 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4146129Z %110 = arith.shli %107#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4146291Z %111 = arith.shrsi %110, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4146535Z %112 = ttg.convert_layout %111 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4146785Z %113 = arith.shrsi %107#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4147029Z %114 = ttg.convert_layout %113 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4147365Z %115 = tt.expand_dims %112 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4147713Z %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4148004Z %117 = tt.broadcast %115 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4148264Z %118 = arith.select %14, %117, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4148510Z %119 = tt.broadcast %116 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4148747Z %120 = arith.select %16, %119, %118 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4148982Z %121 = tt.reshape %120 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4149207Z %122 = arith.sitofp %121 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4149519Z %123 = ttg.convert_layout %122 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4149991Z %124 = tt.dot %109, %123, %107#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4150503Z %125 = ttg.local_load %107#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4150934Z %126 = arith.extf %125 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4151235Z %127 = arith.shli %107#6, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4151394Z %128 = arith.shrsi %127, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4151643Z %129 = ttg.convert_layout %128 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4151888Z %130 = arith.shrsi %107#6, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4152131Z %131 = ttg.convert_layout %130 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4152469Z %132 = tt.expand_dims %129 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4152811Z %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4153166Z %134 = tt.broadcast %132 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4153427Z %135 = arith.select %14, %134, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4153674Z %136 = tt.broadcast %133 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4153915Z %137 = arith.select %16, %136, %135 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4154148Z %138 = tt.reshape %137 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4154376Z %139 = arith.sitofp %138 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4154670Z %140 = ttg.convert_layout %139 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4155138Z %141 = tt.dot %126, %140, %124, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4155631Z %142 = ttg.local_load %107#4 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4156059Z %143 = arith.extf %142 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4156371Z %144 = arith.shli %107#7, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4156533Z %145 = arith.shrsi %144, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4156775Z %146 = ttg.convert_layout %145 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4157023Z %147 = arith.shrsi %107#7, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4157266Z %148 = ttg.convert_layout %147 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4157603Z %149 = tt.expand_dims %146 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4157968Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4158255Z %151 = tt.broadcast %149 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4158502Z %152 = arith.select %14, %151, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4158763Z %153 = tt.broadcast %150 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4159004Z %154 = arith.select %16, %153, %152 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4159241Z %155 = tt.reshape %154 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4159464Z %156 = arith.sitofp %155 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4159761Z %157 = ttg.convert_layout %156 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4160228Z %158 = tt.dot %143, %157, %141, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4160615Z ttg.local_dealloc %51 : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.4160833Z %159 = arith.truncf %158 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:52:25.4161005Z %160 = arith.extsi %35 : i32 to i64 2026-02-21T09:52:25.4161171Z %161 = tt.splat %160 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.4161404Z %162 = arith.addi %161, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.4161671Z %163 = tt.expand_dims %162 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4161915Z %164 = arith.muli %163, %cst_17 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4162096Z %165 = tt.broadcast %164 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4162305Z %166 = tt.splat %42 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.4162516Z %167 = arith.addi %166, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.4162831Z %168 = tt.expand_dims %167 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4163094Z %169 = tt.broadcast %168 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4163277Z %170 = arith.addi %165, %169 : tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4163476Z %171 = tt.addptr %17, %170 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4163678Z %172 = arith.cmpi sge, %163, %cst_18 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4163847Z %173 = arith.cmpi slt, %163, %cst_19 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4164009Z %174 = arith.andi %172, %173 : tensor<128x1xi1, #mma> 2026-02-21T09:52:25.4164204Z %175 = tt.broadcast %174 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4164392Z %176 = arith.cmpi sge, %168, %cst_15 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4164555Z %177 = arith.cmpi slt, %168, %cst_16 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4164712Z %178 = arith.andi %176, %177 : tensor<1x256xi1, #mma> 2026-02-21T09:52:25.4164887Z %179 = tt.broadcast %178 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4165065Z %180 = arith.andi %175, %179 : tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4165230Z tt.store %171, %159, %180 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:52:25.4165399Z %181 = arith.addi %arg3, %c9728_i32 : i32 2026-02-21T09:52:25.4165526Z %182 = arith.divsi %181, %c256_i32 : i32 2026-02-21T09:52:25.4165645Z %183 = arith.muli %182, %c8_i32 : i32 2026-02-21T09:52:25.4165766Z %184 = arith.subi %c128_i32, %183 : i32 2026-02-21T09:52:25.4165882Z %185 = arith.minsi %184, %c8_i32 : i32 2026-02-21T09:52:25.4166004Z %186 = arith.remsi %181, %c256_i32 : i32 2026-02-21T09:52:25.4166123Z %187 = arith.remsi %186, %185 : i32 2026-02-21T09:52:25.4166235Z %188 = arith.addi %183, %187 : i32 2026-02-21T09:52:25.4166367Z %189 = arith.divsi %186, %185 : i32 2026-02-21T09:52:25.4166481Z %190 = arith.muli %188, %c128_i32 : i32 2026-02-21T09:52:25.4166653Z %191 = tt.splat %190 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.4166878Z %192 = arith.addi %191, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.4167056Z %193 = arith.muli %189, %c256_i32 : i32 2026-02-21T09:52:25.4167290Z %194 = tt.expand_dims %192 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.4167545Z %195 = arith.muli %194, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.4167746Z %196 = tt.broadcast %195 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4167921Z %197 = arith.extsi %193 : i32 to i64 2026-02-21T09:52:25.4168093Z %198 = tt.splat %197 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:25.4168314Z %199 = arith.addi %198, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:25.4168587Z %200 = tt.expand_dims %199 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4168888Z %201 = tt.broadcast %200 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4169092Z %202 = arith.cmpi sge, %200, %cst_14 : tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4169269Z %203 = arith.cmpi slt, %200, %cst_3 : tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4169437Z %204 = arith.andi %202, %203 : tensor<1x256xi1, #blocked> 2026-02-21T09:52:25.4169622Z %205 = tt.broadcast %204 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4169840Z %206 = ttg.local_alloc : () -> !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.4170028Z %207 = arith.addi %196, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4170229Z %208 = tt.addptr %4, %207 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4170434Z %209 = tt.load %208 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4170596Z %210 = arith.addi %59, %201 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4170790Z %211 = tt.addptr %5, %210 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4170990Z %212 = arith.andi %65, %205 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4171205Z %213 = tt.load %211, %212, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4171546Z %214 = ttg.memdesc_index %206[%c0_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4171933Z ttg.local_store %209, %214 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4172177Z %215 = arith.addi %196, %71 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4172375Z %216 = tt.addptr %4, %215 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4172588Z %217 = tt.load %216 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4172744Z %218 = arith.addi %78, %201 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4172951Z %219 = tt.addptr %5, %218 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4173150Z %220 = arith.andi %84, %205 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4173362Z %221 = tt.load %219, %220, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4173703Z %222 = ttg.memdesc_index %206[%c1_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4174080Z ttg.local_store %217, %222 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4174323Z %223 = arith.addi %196, %90 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4174525Z %224 = tt.addptr %4, %223 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4174727Z %225 = tt.load %224 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4174890Z %226 = arith.addi %97, %201 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4175078Z %227 = tt.addptr %5, %226 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4175274Z %228 = arith.andi %103, %205 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4175482Z %229 = tt.load %227, %228, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4175820Z %230 = ttg.memdesc_index %206[%c2_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4176180Z ttg.local_store %225, %230 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4176970Z %231:8 = scf.for %arg4 = %c0_i32 to %c506_i32 step %c2_i32 iter_args(%arg5 = %cst_8, %arg6 = %c2_i32, %arg7 = %214, %arg8 = %222, %arg9 = %230, %arg10 = %213, %arg11 = %221, %arg12 = %229) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:52:25.4177642Z %553 = arith.addi %arg4, %c6_i32 : i32 2026-02-21T09:52:25.4186049Z %554 = arith.muli %553, %c2_i32 : i32 2026-02-21T09:52:25.4186250Z %555 = tt.splat %554 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4186482Z %556 = arith.addi %555, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4186762Z %557 = tt.expand_dims %556 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.4187046Z %558 = tt.broadcast %557 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4187251Z %559 = arith.addi %196, %558 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4187456Z %560 = tt.addptr %4, %559 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4187666Z %561 = tt.load %560 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4188013Z %562 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4188457Z %563 = arith.extf %562 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4188745Z %564 = arith.extsi %553 : i32 to i64 2026-02-21T09:52:25.4188918Z %565 = tt.splat %564 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4189144Z %566 = arith.addi %565, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4189440Z %567 = tt.expand_dims %566 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4189688Z %568 = arith.muli %567, %cst_11 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4189885Z %569 = tt.broadcast %568 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4190078Z %570 = arith.addi %569, %201 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4190294Z %571 = tt.addptr %5, %570 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4190500Z %572 = arith.cmpi sge, %567, %cst_12 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4190675Z %573 = arith.cmpi slt, %567, %cst_13 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4190843Z %574 = arith.andi %572, %573 : tensor<2x1xi1, #blocked> 2026-02-21T09:52:25.4191031Z %575 = tt.broadcast %574 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4191229Z %576 = arith.andi %575, %205 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4191397Z %577 = tt.load %571, %576, %cst_1 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4191574Z %578 = arith.shli %arg10, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4191738Z %579 = arith.shrsi %578, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4191985Z %580 = ttg.convert_layout %579 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4192237Z %581 = arith.shrsi %arg10, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4192481Z %582 = ttg.convert_layout %581 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4192842Z %583 = tt.expand_dims %580 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4193193Z %584 = tt.expand_dims %582 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4193487Z %585 = tt.broadcast %583 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4193739Z %586 = arith.select %14, %585, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4193983Z %587 = tt.broadcast %584 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4194230Z %588 = arith.select %16, %587, %586 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4194467Z %589 = tt.reshape %588 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4194700Z %590 = arith.sitofp %589 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4195001Z %591 = ttg.convert_layout %590 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4195477Z %592 = tt.dot %563, %591, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4195845Z %593 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:52:25.4195977Z %594 = arith.cmpi slt, %593, %c3_i32 : i32 2026-02-21T09:52:25.4196111Z %595 = arith.select %594, %593, %c0_i32 : i32 2026-02-21T09:52:25.4196386Z %596 = ttg.memdesc_index %206[%595] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4196751Z ttg.local_store %561, %596 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4197377Z scf.yield %592, %595, %arg8, %arg9, %596, %arg11, %arg12, %577 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4197919Z } {tt.flatten, tt.num_stages = 4 : i32} 2026-02-21T09:52:25.4198212Z %232 = ttg.local_load %231#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4198650Z %233 = arith.extf %232 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4198955Z %234 = arith.shli %231#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4199118Z %235 = arith.shrsi %234, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4199365Z %236 = ttg.convert_layout %235 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4199614Z %237 = arith.shrsi %231#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4199858Z %238 = ttg.convert_layout %237 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4200197Z %239 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4200542Z %240 = tt.expand_dims %238 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4200834Z %241 = tt.broadcast %239 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4201094Z %242 = arith.select %14, %241, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4201340Z %243 = tt.broadcast %240 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4201583Z %244 = arith.select %16, %243, %242 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4201817Z %245 = tt.reshape %244 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4202047Z %246 = arith.sitofp %245 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4202348Z %247 = ttg.convert_layout %246 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4202885Z %248 = tt.dot %233, %247, %231#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4203390Z %249 = ttg.local_load %231#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4203822Z %250 = arith.extf %249 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4204143Z %251 = arith.shli %231#6, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4204305Z %252 = arith.shrsi %251, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4204552Z %253 = ttg.convert_layout %252 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4204802Z %254 = arith.shrsi %231#6, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4205045Z %255 = ttg.convert_layout %254 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4205383Z %256 = tt.expand_dims %253 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4205746Z %257 = tt.expand_dims %255 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4206034Z %258 = tt.broadcast %256 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4206280Z %259 = arith.select %14, %258, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4206538Z %260 = tt.broadcast %257 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4206780Z %261 = arith.select %16, %260, %259 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4207019Z %262 = tt.reshape %261 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4207246Z %263 = arith.sitofp %262 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4207547Z %264 = ttg.convert_layout %263 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4208012Z %265 = tt.dot %250, %264, %248, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4208511Z %266 = ttg.local_load %231#4 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4208949Z %267 = arith.extf %266 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4209246Z %268 = arith.shli %231#7, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4209426Z %269 = arith.shrsi %268, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4209671Z %270 = ttg.convert_layout %269 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4209918Z %271 = arith.shrsi %231#7, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4210162Z %272 = ttg.convert_layout %271 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4210499Z %273 = tt.expand_dims %270 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4210844Z %274 = tt.expand_dims %272 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4211135Z %275 = tt.broadcast %273 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4211379Z %276 = arith.select %14, %275, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4211628Z %277 = tt.broadcast %274 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4211867Z %278 = arith.select %16, %277, %276 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4212105Z %279 = tt.reshape %278 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4212346Z %280 = arith.sitofp %279 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4212644Z %281 = ttg.convert_layout %280 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4213107Z %282 = tt.dot %267, %281, %265, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4213493Z ttg.local_dealloc %206 : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.4213726Z %283 = arith.truncf %282 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:52:25.4213905Z %284 = arith.extsi %190 : i32 to i64 2026-02-21T09:52:25.4214072Z %285 = tt.splat %284 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.4214288Z %286 = arith.addi %285, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.4214559Z %287 = tt.expand_dims %286 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4214821Z %288 = arith.muli %287, %cst_17 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4215006Z %289 = tt.broadcast %288 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4215214Z %290 = tt.splat %197 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.4215426Z %291 = arith.addi %290, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.4215691Z %292 = tt.expand_dims %291 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4215955Z %293 = tt.broadcast %292 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4216142Z %294 = arith.addi %289, %293 : tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4216335Z %295 = tt.addptr %17, %294 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4216541Z %296 = arith.cmpi sge, %287, %cst_18 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4216711Z %297 = arith.cmpi slt, %287, %cst_19 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4216873Z %298 = arith.andi %296, %297 : tensor<128x1xi1, #mma> 2026-02-21T09:52:25.4217051Z %299 = tt.broadcast %298 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4217252Z %300 = arith.cmpi sge, %292, %cst_15 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4217421Z %301 = arith.cmpi slt, %292, %cst_16 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4217580Z %302 = arith.andi %300, %301 : tensor<1x256xi1, #mma> 2026-02-21T09:52:25.4217756Z %303 = tt.broadcast %302 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4217937Z %304 = arith.andi %299, %303 : tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4218101Z tt.store %295, %283, %304 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:52:25.4218256Z %305 = arith.addi %arg3, %c19456_i32 : i32 2026-02-21T09:52:25.4218383Z %306 = arith.divsi %305, %c256_i32 : i32 2026-02-21T09:52:25.4218508Z %307 = arith.muli %306, %c8_i32 : i32 2026-02-21T09:52:25.4218630Z %308 = arith.subi %c128_i32, %307 : i32 2026-02-21T09:52:25.4218751Z %309 = arith.minsi %308, %c8_i32 : i32 2026-02-21T09:52:25.4218871Z %310 = arith.remsi %305, %c256_i32 : i32 2026-02-21T09:52:25.4218992Z %311 = arith.remsi %310, %309 : i32 2026-02-21T09:52:25.4219112Z %312 = arith.addi %307, %311 : i32 2026-02-21T09:52:25.4219227Z %313 = arith.divsi %310, %309 : i32 2026-02-21T09:52:25.4219346Z %314 = arith.muli %312, %c128_i32 : i32 2026-02-21T09:52:25.4219517Z %315 = tt.splat %314 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.4219765Z %316 = arith.addi %315, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.4219937Z %317 = arith.muli %313, %c256_i32 : i32 2026-02-21T09:52:25.4220168Z %318 = tt.expand_dims %316 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.4220424Z %319 = arith.muli %318, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.4220622Z %320 = tt.broadcast %319 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4220802Z %321 = arith.extsi %317 : i32 to i64 2026-02-21T09:52:25.4220969Z %322 = tt.splat %321 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:25.4221207Z %323 = arith.addi %322, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:25.4221486Z %324 = tt.expand_dims %323 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4221765Z %325 = tt.broadcast %324 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4221971Z %326 = arith.cmpi sge, %324, %cst_14 : tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4222161Z %327 = arith.cmpi slt, %324, %cst_3 : tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4222328Z %328 = arith.andi %326, %327 : tensor<1x256xi1, #blocked> 2026-02-21T09:52:25.4222514Z %329 = tt.broadcast %328 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4222733Z %330 = ttg.local_alloc : () -> !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.4222926Z %331 = arith.addi %320, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4223127Z %332 = tt.addptr %4, %331 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4223336Z %333 = tt.load %332 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4223494Z %334 = arith.addi %59, %325 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4223688Z %335 = tt.addptr %5, %334 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4223887Z %336 = arith.andi %65, %329 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4224100Z %337 = tt.load %335, %336, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4224464Z %338 = ttg.memdesc_index %330[%c0_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4224828Z ttg.local_store %333, %338 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4225075Z %339 = arith.addi %320, %71 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4225275Z %340 = tt.addptr %4, %339 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4225479Z %341 = tt.load %340 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4225638Z %342 = arith.addi %78, %325 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4225829Z %343 = tt.addptr %5, %342 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4226028Z %344 = arith.andi %84, %329 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4226241Z %345 = tt.load %343, %344, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4226579Z %346 = ttg.memdesc_index %330[%c1_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4226941Z ttg.local_store %341, %346 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4227180Z %347 = arith.addi %320, %90 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4227383Z %348 = tt.addptr %4, %347 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4227605Z %349 = tt.load %348 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4227764Z %350 = arith.addi %97, %325 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4227958Z %351 = tt.addptr %5, %350 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4228154Z %352 = arith.andi %103, %329 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4228367Z %353 = tt.load %351, %352, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4228702Z %354 = ttg.memdesc_index %330[%c2_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4229080Z ttg.local_store %349, %354 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4229878Z %355:8 = scf.for %arg4 = %c0_i32 to %c506_i32 step %c2_i32 iter_args(%arg5 = %cst_8, %arg6 = %c2_i32, %arg7 = %338, %arg8 = %346, %arg9 = %354, %arg10 = %337, %arg11 = %345, %arg12 = %353) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:52:25.4230550Z %553 = arith.addi %arg4, %c6_i32 : i32 2026-02-21T09:52:25.4230676Z %554 = arith.muli %553, %c2_i32 : i32 2026-02-21T09:52:25.4230848Z %555 = tt.splat %554 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4231077Z %556 = arith.addi %555, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4231359Z %557 = tt.expand_dims %556 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.4231638Z %558 = tt.broadcast %557 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4231835Z %559 = arith.addi %320, %558 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4232037Z %560 = tt.addptr %4, %559 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4232246Z %561 = tt.load %560 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4232562Z %562 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4233000Z %563 = arith.extf %562 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4233284Z %564 = arith.extsi %553 : i32 to i64 2026-02-21T09:52:25.4233453Z %565 = tt.splat %564 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4233674Z %566 = arith.addi %565, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4233950Z %567 = tt.expand_dims %566 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4234195Z %568 = arith.muli %567, %cst_11 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4234391Z %569 = tt.broadcast %568 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4234584Z %570 = arith.addi %569, %325 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4234784Z %571 = tt.addptr %5, %570 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4234994Z %572 = arith.cmpi sge, %567, %cst_12 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4235166Z %573 = arith.cmpi slt, %567, %cst_13 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4235345Z %574 = arith.andi %572, %573 : tensor<2x1xi1, #blocked> 2026-02-21T09:52:25.4235530Z %575 = tt.broadcast %574 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4235721Z %576 = arith.andi %575, %329 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4235888Z %577 = tt.load %571, %576, %cst_1 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4236067Z %578 = arith.shli %arg10, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4236231Z %579 = arith.shrsi %578, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4236476Z %580 = ttg.convert_layout %579 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4236767Z %581 = arith.shrsi %arg10, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4237012Z %582 = ttg.convert_layout %581 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4237351Z %583 = tt.expand_dims %580 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4237728Z %584 = tt.expand_dims %582 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4238020Z %585 = tt.broadcast %583 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4238271Z %586 = arith.select %14, %585, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4238517Z %587 = tt.broadcast %584 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4238761Z %588 = arith.select %16, %587, %586 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4238999Z %589 = tt.reshape %588 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4239226Z %590 = arith.sitofp %589 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4239528Z %591 = ttg.convert_layout %590 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4240004Z %592 = tt.dot %563, %591, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4240372Z %593 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:52:25.4240503Z %594 = arith.cmpi slt, %593, %c3_i32 : i32 2026-02-21T09:52:25.4240637Z %595 = arith.select %594, %593, %c0_i32 : i32 2026-02-21T09:52:25.4240913Z %596 = ttg.memdesc_index %330[%595] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4241277Z ttg.local_store %561, %596 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4241900Z scf.yield %592, %595, %arg8, %arg9, %596, %arg11, %arg12, %577 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4242421Z } {tt.flatten, tt.num_stages = 4 : i32} 2026-02-21T09:52:25.4242748Z %356 = ttg.local_load %355#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4243185Z %357 = arith.extf %356 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4243489Z %358 = arith.shli %355#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4243669Z %359 = arith.shrsi %358, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4243916Z %360 = ttg.convert_layout %359 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4244163Z %361 = arith.shrsi %355#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4244406Z %362 = ttg.convert_layout %361 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4244746Z %363 = tt.expand_dims %360 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4245115Z %364 = tt.expand_dims %362 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4245408Z %365 = tt.broadcast %363 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4245656Z %366 = arith.select %14, %365, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4245900Z %367 = tt.broadcast %364 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4246158Z %368 = arith.select %16, %367, %366 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4246394Z %369 = tt.reshape %368 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4246619Z %370 = arith.sitofp %369 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4246917Z %371 = ttg.convert_layout %370 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4247389Z %372 = tt.dot %357, %371, %355#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4247891Z %373 = ttg.local_load %355#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4248329Z %374 = arith.extf %373 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4248635Z %375 = arith.shli %355#6, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4248814Z %376 = arith.shrsi %375, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4249057Z %377 = ttg.convert_layout %376 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4249305Z %378 = arith.shrsi %355#6, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4249548Z %379 = ttg.convert_layout %378 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4249884Z %380 = tt.expand_dims %377 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4250232Z %381 = tt.expand_dims %379 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4250519Z %382 = tt.broadcast %380 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4250767Z %383 = arith.select %14, %382, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4251010Z %384 = tt.broadcast %381 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4251248Z %385 = arith.select %16, %384, %383 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4251483Z %386 = tt.reshape %385 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4251727Z %387 = arith.sitofp %386 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4252026Z %388 = ttg.convert_layout %387 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4252495Z %389 = tt.dot %374, %388, %372, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4252991Z %390 = ttg.local_load %355#4 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4253439Z %391 = arith.extf %390 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4253738Z %392 = arith.shli %355#7, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4253899Z %393 = arith.shrsi %392, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4254142Z %394 = ttg.convert_layout %393 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4254405Z %395 = arith.shrsi %355#7, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4254647Z %396 = ttg.convert_layout %395 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4254983Z %397 = tt.expand_dims %394 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4255327Z %398 = tt.expand_dims %396 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4255616Z %399 = tt.broadcast %397 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4255864Z %400 = arith.select %14, %399, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4256104Z %401 = tt.broadcast %398 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4256345Z %402 = arith.select %16, %401, %400 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4256578Z %403 = tt.reshape %402 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4256823Z %404 = arith.sitofp %403 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4257122Z %405 = ttg.convert_layout %404 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4257590Z %406 = tt.dot %391, %405, %389, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4257977Z ttg.local_dealloc %330 : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.4258195Z %407 = arith.truncf %406 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:52:25.4258371Z %408 = arith.extsi %314 : i32 to i64 2026-02-21T09:52:25.4258538Z %409 = tt.splat %408 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.4258752Z %410 = arith.addi %409, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.4259022Z %411 = tt.expand_dims %410 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4259265Z %412 = arith.muli %411, %cst_17 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4259449Z %413 = tt.broadcast %412 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4259660Z %414 = tt.splat %321 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.4259884Z %415 = arith.addi %414, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.4260150Z %416 = tt.expand_dims %415 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4260413Z %417 = tt.broadcast %416 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4260596Z %418 = arith.addi %413, %417 : tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4260793Z %419 = tt.addptr %17, %418 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4260995Z %420 = arith.cmpi sge, %411, %cst_18 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4261180Z %421 = arith.cmpi slt, %411, %cst_19 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4261343Z %422 = arith.andi %420, %421 : tensor<128x1xi1, #mma> 2026-02-21T09:52:25.4261521Z %423 = tt.broadcast %422 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4261710Z %424 = arith.cmpi sge, %416, %cst_15 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4261876Z %425 = arith.cmpi slt, %416, %cst_16 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4262048Z %426 = arith.andi %424, %425 : tensor<1x256xi1, #mma> 2026-02-21T09:52:25.4262220Z %427 = tt.broadcast %426 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4262401Z %428 = arith.andi %423, %427 : tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4262562Z tt.store %419, %407, %428 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:52:25.4262716Z %429 = arith.addi %arg3, %c29184_i32 : i32 2026-02-21T09:52:25.4262844Z %430 = arith.divsi %429, %c256_i32 : i32 2026-02-21T09:52:25.4262966Z %431 = arith.muli %430, %c8_i32 : i32 2026-02-21T09:52:25.4263087Z %432 = arith.subi %c128_i32, %431 : i32 2026-02-21T09:52:25.4263205Z %433 = arith.minsi %432, %c8_i32 : i32 2026-02-21T09:52:25.4263324Z %434 = arith.remsi %429, %c256_i32 : i32 2026-02-21T09:52:25.4263441Z %435 = arith.remsi %434, %433 : i32 2026-02-21T09:52:25.4263556Z %436 = arith.addi %431, %435 : i32 2026-02-21T09:52:25.4263669Z %437 = arith.divsi %434, %433 : i32 2026-02-21T09:52:25.4263788Z %438 = arith.muli %436, %c128_i32 : i32 2026-02-21T09:52:25.4263959Z %439 = tt.splat %438 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.4264184Z %440 = arith.addi %439, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.4264378Z %441 = arith.muli %437, %c256_i32 : i32 2026-02-21T09:52:25.4264604Z %442 = tt.expand_dims %440 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.4264861Z %443 = arith.muli %442, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.4265057Z %444 = tt.broadcast %443 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4265235Z %445 = arith.extsi %441 : i32 to i64 2026-02-21T09:52:25.4265405Z %446 = tt.splat %445 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:25.4265623Z %447 = arith.addi %446, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:25.4265900Z %448 = tt.expand_dims %447 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4266180Z %449 = tt.broadcast %448 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4266385Z %450 = arith.cmpi sge, %448, %cst_14 : tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4266564Z %451 = arith.cmpi slt, %448, %cst_3 : tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4266729Z %452 = arith.andi %450, %451 : tensor<1x256xi1, #blocked> 2026-02-21T09:52:25.4266917Z %453 = tt.broadcast %452 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4267148Z %454 = ttg.local_alloc : () -> !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.4267337Z %455 = arith.addi %444, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4267539Z %456 = tt.addptr %4, %455 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4267745Z %457 = tt.load %456 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4267905Z %458 = arith.addi %59, %449 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4268097Z %459 = tt.addptr %5, %458 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4268293Z %460 = arith.andi %65, %453 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4268524Z %461 = tt.load %459, %460, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4268867Z %462 = ttg.memdesc_index %454[%c0_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4269235Z ttg.local_store %457, %462 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4269491Z %463 = arith.addi %444, %71 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4269692Z %464 = tt.addptr %4, %463 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4269895Z %465 = tt.load %464 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4270054Z %466 = arith.addi %78, %449 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4270248Z %467 = tt.addptr %5, %466 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4270446Z %468 = arith.andi %84, %453 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4270658Z %469 = tt.load %467, %468, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4270994Z %470 = ttg.memdesc_index %454[%c1_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4271356Z ttg.local_store %465, %470 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4271596Z %471 = arith.addi %444, %90 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4271793Z %472 = tt.addptr %4, %471 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4272012Z %473 = tt.load %472 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4272168Z %474 = arith.addi %97, %449 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4272360Z %475 = tt.addptr %5, %474 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4272558Z %476 = arith.andi %103, %453 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4272768Z %477 = tt.load %475, %476, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4273107Z %478 = ttg.memdesc_index %454[%c2_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4273466Z ttg.local_store %473, %478 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4274246Z %479:8 = scf.for %arg4 = %c0_i32 to %c506_i32 step %c2_i32 iter_args(%arg5 = %cst_8, %arg6 = %c2_i32, %arg7 = %462, %arg8 = %470, %arg9 = %478, %arg10 = %461, %arg11 = %469, %arg12 = %477) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:52:25.4274918Z %553 = arith.addi %arg4, %c6_i32 : i32 2026-02-21T09:52:25.4275060Z %554 = arith.muli %553, %c2_i32 : i32 2026-02-21T09:52:25.4275232Z %555 = tt.splat %554 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4275458Z %556 = arith.addi %555, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4275735Z %557 = tt.expand_dims %556 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.4276018Z %558 = tt.broadcast %557 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4276213Z %559 = arith.addi %444, %558 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4276434Z %560 = tt.addptr %4, %559 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4276643Z %561 = tt.load %560 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4276945Z %562 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4277402Z %563 = arith.extf %562 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4277687Z %564 = arith.extsi %553 : i32 to i64 2026-02-21T09:52:25.4277858Z %565 = tt.splat %564 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4278082Z %566 = arith.addi %565, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4278353Z %567 = tt.expand_dims %566 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4278605Z %568 = arith.muli %567, %cst_11 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4278797Z %569 = tt.broadcast %568 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4278990Z %570 = arith.addi %569, %449 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4279188Z %571 = tt.addptr %5, %570 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4279396Z %572 = arith.cmpi sge, %567, %cst_12 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4279571Z %573 = arith.cmpi slt, %567, %cst_13 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4279733Z %574 = arith.andi %572, %573 : tensor<2x1xi1, #blocked> 2026-02-21T09:52:25.4279934Z %575 = tt.broadcast %574 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4280125Z %576 = arith.andi %575, %453 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4280295Z %577 = tt.load %571, %576, %cst_1 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4280473Z %578 = arith.shli %arg10, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4280634Z %579 = arith.shrsi %578, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4280884Z %580 = ttg.convert_layout %579 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4281134Z %581 = arith.shrsi %arg10, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4281376Z %582 = ttg.convert_layout %581 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4281717Z %583 = tt.expand_dims %580 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4282066Z %584 = tt.expand_dims %582 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4282365Z %585 = tt.broadcast %583 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4282664Z %586 = arith.select %14, %585, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4282926Z %587 = tt.broadcast %584 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4283172Z %588 = arith.select %16, %587, %586 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4283408Z %589 = tt.reshape %588 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4283637Z %590 = arith.sitofp %589 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4283940Z %591 = ttg.convert_layout %590 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4284428Z %592 = tt.dot %563, %591, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4284781Z %593 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:52:25.4284908Z %594 = arith.cmpi slt, %593, %c3_i32 : i32 2026-02-21T09:52:25.4285043Z %595 = arith.select %594, %593, %c0_i32 : i32 2026-02-21T09:52:25.4285331Z %596 = ttg.memdesc_index %454[%595] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4285692Z ttg.local_store %561, %596 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4286320Z scf.yield %592, %595, %arg8, %arg9, %596, %arg11, %arg12, %577 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4286845Z } {tt.flatten, tt.num_stages = 4 : i32} 2026-02-21T09:52:25.4287125Z %480 = ttg.local_load %479#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4287566Z %481 = arith.extf %480 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4287873Z %482 = arith.shli %479#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4288051Z %483 = arith.shrsi %482, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4288296Z %484 = ttg.convert_layout %483 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4288545Z %485 = arith.shrsi %479#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4288789Z %486 = ttg.convert_layout %485 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4289126Z %487 = tt.expand_dims %484 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4289474Z %488 = tt.expand_dims %486 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4289769Z %489 = tt.broadcast %487 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4290020Z %490 = arith.select %14, %489, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4290271Z %491 = tt.broadcast %488 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4290521Z %492 = arith.select %16, %491, %490 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4290758Z %493 = tt.reshape %492 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4290996Z %494 = arith.sitofp %493 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4291313Z %495 = ttg.convert_layout %494 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4291786Z %496 = tt.dot %481, %495, %479#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4292292Z %497 = ttg.local_load %479#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4292748Z %498 = arith.extf %497 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4293059Z %499 = arith.shli %479#6, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4293229Z %500 = arith.shrsi %499, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4293480Z %501 = ttg.convert_layout %500 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4293748Z %502 = arith.shrsi %479#6, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4293992Z %503 = ttg.convert_layout %502 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4294334Z %504 = tt.expand_dims %501 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4294682Z %505 = tt.expand_dims %503 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4294976Z %506 = tt.broadcast %504 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4295228Z %507 = arith.select %14, %506, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4295477Z %508 = tt.broadcast %505 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4295725Z %509 = arith.select %16, %508, %507 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4295967Z %510 = tt.reshape %509 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4296196Z %511 = arith.sitofp %510 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4296517Z %512 = ttg.convert_layout %511 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4296986Z %513 = tt.dot %498, %512, %496, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4297487Z %514 = ttg.local_load %479#4 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4297927Z %515 = arith.extf %514 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4298229Z %516 = arith.shli %479#7, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4298396Z %517 = arith.shrsi %516, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4298644Z %518 = ttg.convert_layout %517 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4298894Z %519 = arith.shrsi %479#7, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4299147Z %520 = ttg.convert_layout %519 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4299499Z %521 = tt.expand_dims %518 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4299848Z %522 = tt.expand_dims %520 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4300144Z %523 = tt.broadcast %521 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4300391Z %524 = arith.select %14, %523, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4300643Z %525 = tt.broadcast %522 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4300901Z %526 = arith.select %16, %525, %524 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4301145Z %527 = tt.reshape %526 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4301375Z %528 = arith.sitofp %527 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4301677Z %529 = ttg.convert_layout %528 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4302159Z %530 = tt.dot %515, %529, %513, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4302553Z ttg.local_dealloc %454 : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.4302775Z %531 = arith.truncf %530 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:52:25.4302959Z %532 = arith.extsi %438 : i32 to i64 2026-02-21T09:52:25.4303126Z %533 = tt.splat %532 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.4303348Z %534 = arith.addi %533, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.4303621Z %535 = tt.expand_dims %534 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4303869Z %536 = arith.muli %535, %cst_17 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4304057Z %537 = tt.broadcast %536 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4304268Z %538 = tt.splat %445 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.4304505Z %539 = arith.addi %538, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.4304774Z %540 = tt.expand_dims %539 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4305042Z %541 = tt.broadcast %540 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4305231Z %542 = arith.addi %537, %541 : tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4305423Z %543 = tt.addptr %17, %542 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4305634Z %544 = arith.cmpi sge, %535, %cst_18 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4305807Z %545 = arith.cmpi slt, %535, %cst_19 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4305972Z %546 = arith.andi %544, %545 : tensor<128x1xi1, #mma> 2026-02-21T09:52:25.4306154Z %547 = tt.broadcast %546 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4306345Z %548 = arith.cmpi sge, %540, %cst_15 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4306518Z %549 = arith.cmpi slt, %540, %cst_16 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4306676Z %550 = arith.andi %548, %549 : tensor<1x256xi1, #mma> 2026-02-21T09:52:25.4306855Z %551 = tt.broadcast %550 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4307037Z %552 = arith.andi %547, %551 : tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4307206Z tt.store %543, %531, %552 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:52:25.4307377Z } {tt.num_stages = 1 : i32} 2026-02-21T09:52:25.4307513Z scf.for %arg3 = %26 to %c4096_i32 step %c9728_i32 : i32 { 2026-02-21T09:52:25.4307666Z %27 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:52:25.4307791Z %28 = arith.muli %27, %c8_i32 : i32 2026-02-21T09:52:25.4307914Z %29 = arith.subi %c128_i32, %28 : i32 2026-02-21T09:52:25.4308032Z %30 = arith.minsi %29, %c8_i32 : i32 2026-02-21T09:52:25.4308161Z %31 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:52:25.4308283Z %32 = arith.remsi %31, %30 : i32 2026-02-21T09:52:25.4308401Z %33 = arith.addi %28, %32 : i32 2026-02-21T09:52:25.4308536Z %34 = arith.divsi %31, %30 : i32 2026-02-21T09:52:25.4308648Z %35 = arith.muli %33, %c128_i32 : i32 2026-02-21T09:52:25.4308820Z %36 = tt.splat %35 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.4309045Z %37 = arith.addi %36, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.4309222Z %38 = arith.muli %34, %c256_i32 : i32 2026-02-21T09:52:25.4309460Z %39 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.4309718Z %40 = arith.muli %39, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.4309918Z %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4310097Z %42 = arith.extsi %38 : i32 to i64 2026-02-21T09:52:25.4310268Z %43 = tt.splat %42 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:25.4310486Z %44 = arith.addi %43, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:25.4310763Z %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4311045Z %46 = tt.broadcast %45 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4311245Z %47 = arith.cmpi sge, %45, %cst_14 : tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4311423Z %48 = arith.cmpi slt, %45, %cst_3 : tensor<1x256xi64, #blocked> 2026-02-21T09:52:25.4311586Z %49 = arith.andi %47, %48 : tensor<1x256xi1, #blocked> 2026-02-21T09:52:25.4311769Z %50 = tt.broadcast %49 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4312004Z %51 = ttg.local_alloc : () -> !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.4312275Z %52 = tt.expand_dims %3 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.4312552Z %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4312741Z %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4312942Z %55 = tt.addptr %4, %54 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4313146Z %56 = tt.load %55 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4313384Z %57 = tt.expand_dims %7 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4313629Z %58 = arith.muli %57, %cst_11 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4313816Z %59 = tt.broadcast %58 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4314007Z %60 = arith.addi %59, %46 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4314197Z %61 = tt.addptr %5, %60 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4314403Z %62 = arith.cmpi sge, %57, %cst_12 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4314578Z %63 = arith.cmpi slt, %57, %cst_13 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4314737Z %64 = arith.andi %62, %63 : tensor<2x1xi1, #blocked> 2026-02-21T09:52:25.4314930Z %65 = tt.broadcast %64 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4315112Z %66 = arith.andi %65, %50 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4315323Z %67 = tt.load %61, %66, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4315666Z %68 = ttg.memdesc_index %51[%c0_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4316029Z ttg.local_store %56, %68 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4316325Z %69 = arith.addi %3, %cst_7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4316600Z %70 = tt.expand_dims %69 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.4316881Z %71 = tt.broadcast %70 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4317078Z %72 = arith.addi %41, %71 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4317292Z %73 = tt.addptr %4, %72 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4317499Z %74 = tt.load %73 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4317689Z %75 = arith.addi %7, %cst_6 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4317966Z %76 = tt.expand_dims %75 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4318214Z %77 = arith.muli %76, %cst_11 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4318398Z %78 = tt.broadcast %77 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4318589Z %79 = arith.addi %78, %46 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4318779Z %80 = tt.addptr %5, %79 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4318983Z %81 = arith.cmpi sge, %76, %cst_12 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4319153Z %82 = arith.cmpi slt, %76, %cst_13 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4319317Z %83 = arith.andi %81, %82 : tensor<2x1xi1, #blocked> 2026-02-21T09:52:25.4319498Z %84 = tt.broadcast %83 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4319697Z %85 = arith.andi %84, %50 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4319906Z %86 = tt.load %80, %85, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4320241Z %87 = ttg.memdesc_index %51[%c1_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4320607Z ttg.local_store %74, %87 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4320885Z %88 = arith.addi %3, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4321160Z %89 = tt.expand_dims %88 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.4321436Z %90 = tt.broadcast %89 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4321625Z %91 = arith.addi %41, %90 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4321829Z %92 = tt.addptr %4, %91 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4322038Z %93 = tt.load %92 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4322227Z %94 = arith.addi %7, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4322502Z %95 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4322796Z %96 = arith.muli %95, %cst_11 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4322986Z %97 = tt.broadcast %96 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4323180Z %98 = arith.addi %97, %46 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4323368Z %99 = tt.addptr %5, %98 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4323575Z %100 = arith.cmpi sge, %95, %cst_12 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4323749Z %101 = arith.cmpi slt, %95, %cst_13 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4323913Z %102 = arith.andi %100, %101 : tensor<2x1xi1, #blocked> 2026-02-21T09:52:25.4324116Z %103 = tt.broadcast %102 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4324305Z %104 = arith.andi %103, %50 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4324524Z %105 = tt.load %99, %104, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4324884Z %106 = ttg.memdesc_index %51[%c2_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4325251Z ttg.local_store %93, %106 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4326028Z %107:8 = scf.for %arg4 = %c0_i32 to %c506_i32 step %c2_i32 iter_args(%arg5 = %cst_8, %arg6 = %c2_i32, %arg7 = %68, %arg8 = %87, %arg9 = %106, %arg10 = %67, %arg11 = %86, %arg12 = %105) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:52:25.4326695Z %181 = arith.addi %arg4, %c6_i32 : i32 2026-02-21T09:52:25.4326828Z %182 = arith.muli %181, %c2_i32 : i32 2026-02-21T09:52:25.4327007Z %183 = tt.splat %182 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4327233Z %184 = arith.addi %183, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.4327511Z %185 = tt.expand_dims %184 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.4327806Z %186 = tt.broadcast %185 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4328005Z %187 = arith.addi %41, %186 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4328212Z %188 = tt.addptr %4, %187 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.4328418Z %189 = tt.load %188 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.4328722Z %190 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4329162Z %191 = arith.extf %190 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4329452Z %192 = arith.extsi %181 : i32 to i64 2026-02-21T09:52:25.4329623Z %193 = tt.splat %192 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4329842Z %194 = arith.addi %193, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.4330117Z %195 = tt.expand_dims %194 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4330365Z %196 = arith.muli %195, %cst_11 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4330560Z %197 = tt.broadcast %196 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4330771Z %198 = arith.addi %197, %46 : tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4330966Z %199 = tt.addptr %5, %198 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:52:25.4331176Z %200 = arith.cmpi sge, %195, %cst_12 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4331349Z %201 = arith.cmpi slt, %195, %cst_13 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:25.4331517Z %202 = arith.andi %200, %201 : tensor<2x1xi1, #blocked> 2026-02-21T09:52:25.4331702Z %203 = tt.broadcast %202 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4331892Z %204 = arith.andi %203, %50 : tensor<2x256xi1, #blocked> 2026-02-21T09:52:25.4332080Z %205 = tt.load %199, %204, %cst_1 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:52:25.4332253Z %206 = arith.shli %arg10, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4332419Z %207 = arith.shrsi %206, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4332667Z %208 = ttg.convert_layout %207 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4332941Z %209 = arith.shrsi %arg10, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4333187Z %210 = ttg.convert_layout %209 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4333527Z %211 = tt.expand_dims %208 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4333878Z %212 = tt.expand_dims %210 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4334173Z %213 = tt.broadcast %211 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4334423Z %214 = arith.select %14, %213, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4334674Z %215 = tt.broadcast %212 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4334918Z %216 = arith.select %16, %215, %214 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4335157Z %217 = tt.reshape %216 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4335384Z %218 = arith.sitofp %217 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4335704Z %219 = ttg.convert_layout %218 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4336179Z %220 = tt.dot %191, %219, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4336528Z %221 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:52:25.4336658Z %222 = arith.cmpi slt, %221, %c3_i32 : i32 2026-02-21T09:52:25.4336790Z %223 = arith.select %222, %221, %c0_i32 : i32 2026-02-21T09:52:25.4337059Z %224 = ttg.memdesc_index %51[%223] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4337427Z ttg.local_store %189, %224 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> 2026-02-21T09:52:25.4338050Z scf.yield %220, %223, %arg8, %arg9, %224, %arg11, %arg12, %205 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4338577Z } {tt.flatten, tt.num_stages = 4 : i32} 2026-02-21T09:52:25.4338874Z %108 = ttg.local_load %107#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4339309Z %109 = arith.extf %108 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4339616Z %110 = arith.shli %107#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4339781Z %111 = arith.shrsi %110, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4340025Z %112 = ttg.convert_layout %111 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4340291Z %113 = arith.shrsi %107#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4340532Z %114 = ttg.convert_layout %113 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4340873Z %115 = tt.expand_dims %112 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4341237Z %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4341527Z %117 = tt.broadcast %115 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4341776Z %118 = arith.select %14, %117, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4342020Z %119 = tt.broadcast %116 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4342265Z %120 = arith.select %16, %119, %118 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4342506Z %121 = tt.reshape %120 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4342734Z %122 = arith.sitofp %121 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4343034Z %123 = ttg.convert_layout %122 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4343503Z %124 = tt.dot %109, %123, %107#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4344021Z %125 = ttg.local_load %107#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4344456Z %126 = arith.extf %125 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4344757Z %127 = arith.shli %107#6, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4344923Z %128 = arith.shrsi %127, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4345168Z %129 = ttg.convert_layout %128 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4345415Z %130 = arith.shrsi %107#6, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4345660Z %131 = ttg.convert_layout %130 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4345998Z %132 = tt.expand_dims %129 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4346347Z %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4346641Z %134 = tt.broadcast %132 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4346887Z %135 = arith.select %14, %134, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4347147Z %136 = tt.broadcast %133 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4347386Z %137 = arith.select %16, %136, %135 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4347623Z %138 = tt.reshape %137 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4347852Z %139 = arith.sitofp %138 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4348149Z %140 = ttg.convert_layout %139 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4348632Z %141 = tt.dot %126, %140, %124, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4349132Z %142 = ttg.local_load %107#4 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4349582Z %143 = arith.extf %142 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4349884Z %144 = arith.shli %107#7, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4350047Z %145 = arith.shrsi %144, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4350292Z %146 = ttg.convert_layout %145 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4350544Z %147 = arith.shrsi %107#7, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:52:25.4350787Z %148 = ttg.convert_layout %147 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.4351125Z %149 = tt.expand_dims %146 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4351468Z %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:52:25.4351758Z %151 = tt.broadcast %149 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4352002Z %152 = arith.select %14, %151, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4352261Z %153 = tt.broadcast %150 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4352504Z %154 = arith.select %16, %153, %152 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:52:25.4352741Z %155 = tt.reshape %154 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.4352968Z %156 = arith.sitofp %155 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.4353267Z %157 = ttg.convert_layout %156 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.4353729Z %158 = tt.dot %143, %157, %141, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.4354117Z ttg.local_dealloc %51 : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.4354337Z %159 = arith.truncf %158 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:52:25.4354510Z %160 = arith.extsi %35 : i32 to i64 2026-02-21T09:52:25.4354680Z %161 = tt.splat %160 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.4354895Z %162 = arith.addi %161, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.4355180Z %163 = tt.expand_dims %162 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4355427Z %164 = arith.muli %163, %cst_17 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4355609Z %165 = tt.broadcast %164 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4355821Z %166 = tt.splat %42 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.4356032Z %167 = arith.addi %166, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.4356300Z %168 = tt.expand_dims %167 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4356584Z %169 = tt.broadcast %168 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4356768Z %170 = arith.addi %165, %169 : tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4356966Z %171 = tt.addptr %17, %170 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:52:25.4357168Z %172 = arith.cmpi sge, %163, %cst_18 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4357351Z %173 = arith.cmpi slt, %163, %cst_19 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.4357512Z %174 = arith.andi %172, %173 : tensor<128x1xi1, #mma> 2026-02-21T09:52:25.4357691Z %175 = tt.broadcast %174 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4357880Z %176 = arith.cmpi sge, %168, %cst_15 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4358045Z %177 = arith.cmpi slt, %168, %cst_16 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.4358206Z %178 = arith.andi %176, %177 : tensor<1x256xi1, #mma> 2026-02-21T09:52:25.4358379Z %179 = tt.broadcast %178 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4358561Z %180 = arith.andi %175, %179 : tensor<128x256xi1, #mma> 2026-02-21T09:52:25.4358723Z tt.store %171, %159, %180 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:52:25.4358873Z } {tt.num_stages = 1 : i32} 2026-02-21T09:52:25.4358979Z tt.return 2026-02-21T09:52:25.4359058Z } 2026-02-21T09:52:25.4359138Z } 2026-02-21T09:52:25.4359183Z 2026-02-21T09:52:25.4359215Z {-# 2026-02-21T09:52:25.4359299Z external_resources: { 2026-02-21T09:52:25.4359397Z mlir_reproducer: { 2026-02-21T09:52:25.4360416Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:52:25.4360462Z disable_threading: false, 2026-02-21T09:52:25.4360502Z verify_each: true 2026-02-21T09:52:25.4360534Z } 2026-02-21T09:52:25.4360565Z } 2026-02-21T09:52:25.4360598Z #-} 2026-02-21T09:52:25.4360839Z /tmp/torchinductor_root/5y/c5yo5boeljmnkfo5b4u2etg2hep4hcalbcy3we66musd3ytrm6yd.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:52:25.4361264Z /tmp/torchinductor_root/5y/c5yo5boeljmnkfo5b4u2etg2hep4hcalbcy3we66musd3ytrm6yd.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:52:25.4361380Z [476s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:52:25.4362013Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:52:25.4362082Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:52:25.4362166Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:52:25.7094585Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:52:25.7113177Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}> 2026-02-21T09:52:25.7113479Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:52:25.7113686Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:52:25.7114011Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}> 2026-02-21T09:52:25.7114158Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:52:25.7114294Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:52:25.7114427Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:52:25.7114486Z #smem = #ttg.shared_memory 2026-02-21T09:52:25.7114701Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:52:25.7115061Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:52:25.7115159Z %cst = arith.constant dense<8192> : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7115252Z %cst_0 = arith.constant dense<0> : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7115348Z %cst_1 = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7115422Z %cst_2 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7115621Z %cst_3 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7115718Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:52:25.7115810Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:52:25.7115912Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7115975Z %c13823_i32 = arith.constant 13823 : i32 2026-02-21T09:52:25.7116033Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:52:25.7116089Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:52:25.7116239Z %cst_7 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7116380Z %cst_8 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7116513Z %cst_9 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7116612Z %cst_10 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7116707Z %cst_11 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.7116765Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:52:25.7116821Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:52:25.7116876Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:52:25.7116941Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:52:25.7116997Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:52:25.7117114Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:52:25.7117174Z %c38912_i32 = arith.constant 38912 : i32 2026-02-21T09:52:25.7117235Z %c19456_i32 = arith.constant 19456 : i32 2026-02-21T09:52:25.7117295Z %c29184_i32 = arith.constant 29184 : i32 2026-02-21T09:52:25.7117384Z %cst_12 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7117438Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:52:25.7117497Z %c9728_i32 = arith.constant 9728 : i32 2026-02-21T09:52:25.7117636Z %cst_13 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7117697Z %0 = tt.get_program_id x : i32 2026-02-21T09:52:25.7117892Z %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.7118040Z %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.7118196Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:52:25.7118392Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.7118542Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7118686Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7118804Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7118907Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7119112Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:52:25.7119412Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:52:25.7119609Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:52:25.7119702Z %12 = arith.cmpi eq, %11, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:52:25.7119824Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T09:52:25.7119930Z %14 = arith.cmpi eq, %11, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:52:25.7120046Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T09:52:25.7120157Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:52:25.7120354Z %17 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.7120576Z %18 = arith.extsi %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.7120631Z %19 = arith.subi %c13823_i32, %0 : i32 2026-02-21T09:52:25.7120684Z %20 = arith.divui %19, %c9728_i32 : i32 2026-02-21T09:52:25.7120732Z %21 = arith.remsi %20, %c4_i32 : i32 2026-02-21T09:52:25.7120780Z %22 = arith.subi %20, %21 : i32 2026-02-21T09:52:25.7120831Z %23 = arith.muli %22, %c9728_i32 : i32 2026-02-21T09:52:25.7120879Z %24 = arith.addi %0, %23 : i32 2026-02-21T09:52:25.7120945Z scf.for %arg3 = %0 to %24 step %c38912_i32 : i32 { 2026-02-21T09:52:25.7121003Z %25 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:52:25.7121053Z %26 = arith.muli %25, %c8_i32 : i32 2026-02-21T09:52:25.7121103Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:52:25.7121151Z %28 = arith.minsi %27, %c8_i32 : i32 2026-02-21T09:52:25.7121209Z %29 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:52:25.7121275Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:52:25.7121321Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:52:25.7121371Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:52:25.7121421Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:52:25.7121537Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.7121653Z %35 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.7121702Z %36 = arith.muli %32, %c256_i32 : i32 2026-02-21T09:52:25.7121808Z %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:52:25.7121930Z %38 = arith.addi %37, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:52:25.7122109Z %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.7122183Z %40 = arith.muli %39, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.7122296Z %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7122494Z %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:52:25.7122675Z %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7122784Z %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.7122950Z %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.7123057Z %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7123134Z %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7123257Z %48 = tt.addptr %7, %47 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7123336Z %49 = tt.load %48 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7123560Z %50 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7123728Z ttg.local_store %49, %50 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7123866Z %51 = arith.addi %6, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7124034Z %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.7124149Z %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7124220Z %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7124338Z %55 = tt.addptr %7, %54 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7124416Z %56 = tt.load %55 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7124630Z %57 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7124792Z ttg.local_store %56, %57 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7125210Z %58:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %50, %arg8 = %57) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:52:25.7125335Z %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7125460Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7125518Z %411 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:52:25.7125573Z %412 = arith.muli %411, %c2_i32 : i32 2026-02-21T09:52:25.7125689Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7125799Z %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7125971Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.7126110Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7126183Z %417 = arith.addi %41, %416 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7126312Z %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7126388Z %419 = tt.load %418 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7126660Z %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7126898Z %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7127075Z %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7127171Z %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7127284Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7127353Z %425 = arith.addi %424, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7127479Z %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7127571Z %427 = tt.load %426 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7127752Z %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7127868Z %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7128010Z %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7128128Z %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7128314Z %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7128487Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7128614Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7128738Z %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7128849Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7128978Z %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7129084Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7129197Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7129349Z %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7129587Z %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7129911Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7129969Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:52:25.7130026Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:52:25.7130092Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:52:25.7130310Z %446 = ttg.memdesc_index %44[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7130498Z ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7130720Z scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7130785Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:52:25.7130878Z %59 = arith.addi %5, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7131075Z %60 = ttg.local_load %58#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7131273Z %61 = arith.extf %60 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7131415Z %62 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7131477Z %63 = arith.muli %62, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7131569Z %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7131629Z %65 = arith.addi %64, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7131730Z %66 = tt.addptr %8, %65 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7131791Z %67 = tt.load %66 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7131949Z %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7132045Z %69 = arith.shli %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7132147Z %70 = arith.shrsi %69, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7132243Z %71 = arith.shrsi %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7132390Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7132537Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7132631Z %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7132733Z %75 = arith.select %13, %74, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7132827Z %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7132923Z %77 = arith.select %15, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7133011Z %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7133106Z %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7141898Z %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7142093Z %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7142360Z %82 = tt.dot %61, %81, %58#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7142455Z %83 = arith.addi %5, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7142691Z %84 = ttg.local_load %58#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7142884Z %85 = arith.extf %84 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7143048Z %86 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7143115Z %87 = arith.muli %86, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7143205Z %88 = tt.broadcast %87 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7143267Z %89 = arith.addi %88, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7143369Z %90 = tt.addptr %8, %89 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7143430Z %91 = tt.load %90 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7143574Z %92 = ttg.convert_layout %91 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7143671Z %93 = arith.shli %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7143770Z %94 = arith.shrsi %93, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7143866Z %95 = arith.shrsi %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7144015Z %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7144178Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7144274Z %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7144384Z %99 = arith.select %13, %98, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7144477Z %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7144578Z %101 = arith.select %15, %100, %99 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7144673Z %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7144766Z %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7144885Z %104 = ttg.local_alloc %103 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7145058Z %105 = ttg.local_load %104 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7145319Z %106 = tt.dot %85, %105, %82, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7145407Z ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.7145517Z %107 = arith.truncf %106 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:52:25.7145561Z %108 = arith.extsi %33 : i32 to i64 2026-02-21T09:52:25.7145603Z %109 = arith.extsi %36 : i32 to i64 2026-02-21T09:52:25.7145694Z %110 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.7145780Z %111 = arith.addi %110, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.7145925Z %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7146006Z %113 = arith.muli %112, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7146091Z %114 = tt.broadcast %113 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7146175Z %115 = tt.splat %109 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.7146261Z %116 = arith.addi %115, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.7146420Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7146503Z %118 = tt.broadcast %117 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7146562Z %119 = arith.addi %114, %118 : tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7146662Z %120 = tt.addptr %16, %119 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7146731Z %121 = arith.cmpi sge, %112, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7146798Z %122 = arith.cmpi slt, %112, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7146861Z %123 = arith.andi %121, %122 : tensor<128x1xi1, #mma> 2026-02-21T09:52:25.7146943Z %124 = tt.broadcast %123 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7147005Z %125 = arith.cmpi sge, %117, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7147070Z %126 = arith.cmpi slt, %117, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7147128Z %127 = arith.andi %125, %126 : tensor<1x256xi1, #mma> 2026-02-21T09:52:25.7147208Z %128 = tt.broadcast %127 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7147271Z %129 = arith.andi %124, %128 : tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7147359Z tt.store %120, %107, %129 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:52:25.7147407Z %130 = arith.addi %arg3, %c9728_i32 : i32 2026-02-21T09:52:25.7147454Z %131 = arith.divsi %130, %c256_i32 : i32 2026-02-21T09:52:25.7147501Z %132 = arith.muli %131, %c8_i32 : i32 2026-02-21T09:52:25.7147546Z %133 = arith.subi %c128_i32, %132 : i32 2026-02-21T09:52:25.7147590Z %134 = arith.minsi %133, %c8_i32 : i32 2026-02-21T09:52:25.7147636Z %135 = arith.remsi %130, %c256_i32 : i32 2026-02-21T09:52:25.7147678Z %136 = arith.remsi %135, %134 : i32 2026-02-21T09:52:25.7147718Z %137 = arith.addi %132, %136 : i32 2026-02-21T09:52:25.7147760Z %138 = arith.divsi %135, %134 : i32 2026-02-21T09:52:25.7147804Z %139 = arith.muli %137, %c128_i32 : i32 2026-02-21T09:52:25.7147899Z %140 = tt.splat %139 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.7147993Z %141 = arith.addi %140, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.7148048Z %142 = arith.muli %138, %c256_i32 : i32 2026-02-21T09:52:25.7148140Z %143 = tt.splat %142 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:52:25.7148233Z %144 = arith.addi %143, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:52:25.7148386Z %145 = tt.expand_dims %141 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.7148467Z %146 = arith.muli %145, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.7148562Z %147 = tt.broadcast %146 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7148713Z %148 = tt.expand_dims %144 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:52:25.7148807Z %149 = tt.broadcast %148 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7148897Z %150 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.7148963Z %151 = arith.addi %147, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7149085Z %152 = tt.addptr %7, %151 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7149150Z %153 = tt.load %152 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7149340Z %154 = ttg.memdesc_index %150[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7149499Z ttg.local_store %153, %154 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7149560Z %155 = arith.addi %147, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7149664Z %156 = tt.addptr %7, %155 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7149729Z %157 = tt.load %156 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7149910Z %158 = ttg.memdesc_index %150[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7150055Z ttg.local_store %157, %158 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7150403Z %159:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %154, %arg8 = %158) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:52:25.7150502Z %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7150598Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7150658Z %411 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:52:25.7150702Z %412 = arith.muli %411, %c2_i32 : i32 2026-02-21T09:52:25.7150797Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7150886Z %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7151033Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.7151129Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7151193Z %417 = arith.addi %147, %416 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7151297Z %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7151364Z %419 = tt.load %418 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7151566Z %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7151767Z %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7151914Z %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7151992Z %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7152085Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7152149Z %425 = arith.addi %424, %149 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7152248Z %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7152311Z %427 = tt.load %426 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7152461Z %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7152585Z %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7152684Z %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7152786Z %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7152951Z %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7153099Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7153200Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7153308Z %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7153406Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7153508Z %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7153601Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7153696Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7153818Z %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7153992Z %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7154283Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7154334Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:52:25.7154384Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:52:25.7154436Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:52:25.7154619Z %446 = ttg.memdesc_index %150[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7154765Z ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7154985Z scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7155033Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:52:25.7155234Z %160 = ttg.local_load %159#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7155432Z %161 = arith.extf %160 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7155510Z %162 = arith.addi %64, %149 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7155613Z %163 = tt.addptr %8, %162 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7155675Z %164 = tt.load %163 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7155819Z %165 = ttg.convert_layout %164 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7155922Z %166 = arith.shli %165, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7156037Z %167 = arith.shrsi %166, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7156134Z %168 = arith.shrsi %165, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7156287Z %169 = tt.expand_dims %167 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7156435Z %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7156545Z %171 = tt.broadcast %169 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7156653Z %172 = arith.select %13, %171, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7156748Z %173 = tt.broadcast %170 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7156848Z %174 = arith.select %15, %173, %172 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7156944Z %175 = tt.reshape %174 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7157037Z %176 = arith.sitofp %175 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7157157Z %177 = ttg.local_alloc %176 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7157328Z %178 = ttg.local_load %177 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7157593Z %179 = tt.dot %161, %178, %159#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7157803Z %180 = ttg.local_load %159#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7158002Z %181 = arith.extf %180 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7158063Z %182 = arith.addi %88, %149 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7158165Z %183 = tt.addptr %8, %182 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7158229Z %184 = tt.load %183 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7158373Z %185 = ttg.convert_layout %184 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7158471Z %186 = arith.shli %185, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7158573Z %187 = arith.shrsi %186, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7158671Z %188 = arith.shrsi %185, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7158823Z %189 = tt.expand_dims %187 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7158995Z %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7159091Z %191 = tt.broadcast %189 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7159196Z %192 = arith.select %13, %191, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7159291Z %193 = tt.broadcast %190 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7159427Z %194 = arith.select %15, %193, %192 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7159517Z %195 = tt.reshape %194 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7159629Z %196 = arith.sitofp %195 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7159745Z %197 = ttg.local_alloc %196 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7159913Z %198 = ttg.local_load %197 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7160190Z %199 = tt.dot %181, %198, %179, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7160278Z ttg.local_dealloc %150 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.7160368Z %200 = arith.truncf %199 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:52:25.7160418Z %201 = arith.extsi %139 : i32 to i64 2026-02-21T09:52:25.7160460Z %202 = arith.extsi %142 : i32 to i64 2026-02-21T09:52:25.7160547Z %203 = tt.splat %201 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.7160635Z %204 = arith.addi %203, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.7160776Z %205 = tt.expand_dims %204 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7160837Z %206 = arith.muli %205, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7160924Z %207 = tt.broadcast %206 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7161008Z %208 = tt.splat %202 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.7161105Z %209 = arith.addi %208, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.7161245Z %210 = tt.expand_dims %209 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7161329Z %211 = tt.broadcast %210 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7161387Z %212 = arith.addi %207, %211 : tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7161488Z %213 = tt.addptr %16, %212 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7161554Z %214 = arith.cmpi sge, %205, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7161620Z %215 = arith.cmpi slt, %205, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7161681Z %216 = arith.andi %214, %215 : tensor<128x1xi1, #mma> 2026-02-21T09:52:25.7161763Z %217 = tt.broadcast %216 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7161826Z %218 = arith.cmpi sge, %210, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7161889Z %219 = arith.cmpi slt, %210, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7161949Z %220 = arith.andi %218, %219 : tensor<1x256xi1, #mma> 2026-02-21T09:52:25.7162030Z %221 = tt.broadcast %220 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7162088Z %222 = arith.andi %217, %221 : tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7162172Z tt.store %213, %200, %222 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:52:25.7162220Z %223 = arith.addi %arg3, %c19456_i32 : i32 2026-02-21T09:52:25.7162265Z %224 = arith.divsi %223, %c256_i32 : i32 2026-02-21T09:52:25.7162311Z %225 = arith.muli %224, %c8_i32 : i32 2026-02-21T09:52:25.7162355Z %226 = arith.subi %c128_i32, %225 : i32 2026-02-21T09:52:25.7162399Z %227 = arith.minsi %226, %c8_i32 : i32 2026-02-21T09:52:25.7162442Z %228 = arith.remsi %223, %c256_i32 : i32 2026-02-21T09:52:25.7162487Z %229 = arith.remsi %228, %227 : i32 2026-02-21T09:52:25.7162529Z %230 = arith.addi %225, %229 : i32 2026-02-21T09:52:25.7162616Z %231 = arith.divsi %228, %227 : i32 2026-02-21T09:52:25.7162685Z %232 = arith.muli %230, %c128_i32 : i32 2026-02-21T09:52:25.7162779Z %233 = tt.splat %232 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.7162872Z %234 = arith.addi %233, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.7162919Z %235 = arith.muli %231, %c256_i32 : i32 2026-02-21T09:52:25.7163010Z %236 = tt.splat %235 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:52:25.7163123Z %237 = arith.addi %236, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:52:25.7163276Z %238 = tt.expand_dims %234 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.7163344Z %239 = arith.muli %238, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.7163439Z %240 = tt.broadcast %239 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7163591Z %241 = tt.expand_dims %237 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:52:25.7163684Z %242 = tt.broadcast %241 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7163773Z %243 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.7163837Z %244 = arith.addi %240, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7163941Z %245 = tt.addptr %7, %244 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7164006Z %246 = tt.load %245 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7164210Z %247 = ttg.memdesc_index %243[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7164356Z ttg.local_store %246, %247 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7164417Z %248 = arith.addi %240, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7164518Z %249 = tt.addptr %7, %248 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7164584Z %250 = tt.load %249 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7164766Z %251 = ttg.memdesc_index %243[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7164906Z ttg.local_store %250, %251 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7165254Z %252:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %247, %arg8 = %251) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:52:25.7165354Z %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7165448Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7165511Z %411 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:52:25.7165554Z %412 = arith.muli %411, %c2_i32 : i32 2026-02-21T09:52:25.7165646Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7165738Z %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7165882Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.7165976Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7166057Z %417 = arith.addi %240, %416 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7166160Z %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7166224Z %419 = tt.load %418 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7166427Z %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7166643Z %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7166787Z %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7166854Z %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7166948Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7167010Z %425 = arith.addi %424, %242 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7167112Z %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7167174Z %427 = tt.load %426 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7167322Z %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7167424Z %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7167524Z %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7167635Z %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7167788Z %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7167937Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7168035Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7168144Z %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7168239Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7168338Z %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7168433Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7168526Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7168646Z %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7168819Z %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7169100Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7169148Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:52:25.7169200Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:52:25.7169250Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:52:25.7169432Z %446 = ttg.memdesc_index %243[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7169595Z ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7169814Z scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7169861Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:52:25.7170075Z %253 = ttg.local_load %252#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7170270Z %254 = arith.extf %253 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7170333Z %255 = arith.addi %64, %242 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7170437Z %256 = tt.addptr %8, %255 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7170498Z %257 = tt.load %256 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7170641Z %258 = ttg.convert_layout %257 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7170745Z %259 = arith.shli %258, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7170844Z %260 = arith.shrsi %259, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7170942Z %261 = arith.shrsi %258, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7171109Z %262 = tt.expand_dims %260 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7171257Z %263 = tt.expand_dims %261 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7171354Z %264 = tt.broadcast %262 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7171462Z %265 = arith.select %13, %264, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7171556Z %266 = tt.broadcast %263 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7171654Z %267 = arith.select %15, %266, %265 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7171748Z %268 = tt.reshape %267 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7171840Z %269 = arith.sitofp %268 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7171959Z %270 = ttg.local_alloc %269 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7172130Z %271 = ttg.local_load %270 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7172394Z %272 = tt.dot %254, %271, %252#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7172604Z %273 = ttg.local_load %252#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7172803Z %274 = arith.extf %273 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7172865Z %275 = arith.addi %88, %242 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7172967Z %276 = tt.addptr %8, %275 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7173045Z %277 = tt.load %276 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7173190Z %278 = ttg.convert_layout %277 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7173288Z %279 = arith.shli %278, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7173390Z %280 = arith.shrsi %279, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7173511Z %281 = arith.shrsi %278, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7173659Z %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7173808Z %283 = tt.expand_dims %281 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7173903Z %284 = tt.broadcast %282 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7174008Z %285 = arith.select %13, %284, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7174103Z %286 = tt.broadcast %283 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7174202Z %287 = arith.select %15, %286, %285 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7174293Z %288 = tt.reshape %287 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7174392Z %289 = arith.sitofp %288 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7174510Z %290 = ttg.local_alloc %289 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7174693Z %291 = ttg.local_load %290 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7174957Z %292 = tt.dot %274, %291, %272, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7175045Z ttg.local_dealloc %243 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.7175134Z %293 = arith.truncf %292 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:52:25.7175181Z %294 = arith.extsi %232 : i32 to i64 2026-02-21T09:52:25.7175224Z %295 = arith.extsi %235 : i32 to i64 2026-02-21T09:52:25.7175310Z %296 = tt.splat %294 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.7175398Z %297 = arith.addi %296, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.7175544Z %298 = tt.expand_dims %297 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7175606Z %299 = arith.muli %298, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7175689Z %300 = tt.broadcast %299 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7175775Z %301 = tt.splat %295 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.7175874Z %302 = arith.addi %301, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.7176012Z %303 = tt.expand_dims %302 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7176097Z %304 = tt.broadcast %303 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7176156Z %305 = arith.addi %300, %304 : tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7176255Z %306 = tt.addptr %16, %305 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7176323Z %307 = arith.cmpi sge, %298, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7176401Z %308 = arith.cmpi slt, %298, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7176457Z %309 = arith.andi %307, %308 : tensor<128x1xi1, #mma> 2026-02-21T09:52:25.7176538Z %310 = tt.broadcast %309 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7176605Z %311 = arith.cmpi sge, %303, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7176669Z %312 = arith.cmpi slt, %303, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7176739Z %313 = arith.andi %311, %312 : tensor<1x256xi1, #mma> 2026-02-21T09:52:25.7176823Z %314 = tt.broadcast %313 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7176881Z %315 = arith.andi %310, %314 : tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7176949Z tt.store %306, %293, %315 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:52:25.7177000Z %316 = arith.addi %arg3, %c29184_i32 : i32 2026-02-21T09:52:25.7177045Z %317 = arith.divsi %316, %c256_i32 : i32 2026-02-21T09:52:25.7177088Z %318 = arith.muli %317, %c8_i32 : i32 2026-02-21T09:52:25.7177134Z %319 = arith.subi %c128_i32, %318 : i32 2026-02-21T09:52:25.7177177Z %320 = arith.minsi %319, %c8_i32 : i32 2026-02-21T09:52:25.7177220Z %321 = arith.remsi %316, %c256_i32 : i32 2026-02-21T09:52:25.7177261Z %322 = arith.remsi %321, %320 : i32 2026-02-21T09:52:25.7177304Z %323 = arith.addi %318, %322 : i32 2026-02-21T09:52:25.7177344Z %324 = arith.divsi %321, %320 : i32 2026-02-21T09:52:25.7177387Z %325 = arith.muli %323, %c128_i32 : i32 2026-02-21T09:52:25.7177484Z %326 = tt.splat %325 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.7177576Z %327 = arith.addi %326, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.7177632Z %328 = arith.muli %324, %c256_i32 : i32 2026-02-21T09:52:25.7177727Z %329 = tt.splat %328 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:52:25.7177819Z %330 = arith.addi %329, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:52:25.7177969Z %331 = tt.expand_dims %327 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.7178039Z %332 = arith.muli %331, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.7178133Z %333 = tt.broadcast %332 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7178280Z %334 = tt.expand_dims %330 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:52:25.7178375Z %335 = tt.broadcast %334 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7178462Z %336 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.7178522Z %337 = arith.addi %333, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7178625Z %338 = tt.addptr %7, %337 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7178693Z %339 = tt.load %338 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7178878Z %340 = ttg.memdesc_index %336[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7179036Z ttg.local_store %339, %340 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7179098Z %341 = arith.addi %333, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7179199Z %342 = tt.addptr %7, %341 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7179264Z %343 = tt.load %342 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7179452Z %344 = ttg.memdesc_index %336[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7179615Z ttg.local_store %343, %344 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7179959Z %345:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %340, %arg8 = %344) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:52:25.7180075Z %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7180166Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7180212Z %411 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:52:25.7180258Z %412 = arith.muli %411, %c2_i32 : i32 2026-02-21T09:52:25.7180351Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7180442Z %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7180588Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.7180681Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7180744Z %417 = arith.addi %333, %416 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7180852Z %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7180915Z %419 = tt.load %418 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7181129Z %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7181334Z %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7181477Z %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7181543Z %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7181638Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7181700Z %425 = arith.addi %424, %335 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7181800Z %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7181865Z %427 = tt.load %426 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7182013Z %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7182113Z %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7182216Z %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7182329Z %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7182483Z %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7182632Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7182729Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7182836Z %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7182950Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7183050Z %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7183140Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7183237Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7183367Z %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7183538Z %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7183806Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7183855Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:52:25.7183903Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:52:25.7183956Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:52:25.7184138Z %446 = ttg.memdesc_index %336[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7184284Z ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7184519Z scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7184566Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:52:25.7184765Z %346 = ttg.local_load %345#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7184965Z %347 = arith.extf %346 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7185027Z %348 = arith.addi %64, %335 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7185130Z %349 = tt.addptr %8, %348 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7185193Z %350 = tt.load %349 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7185337Z %351 = ttg.convert_layout %350 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7185439Z %352 = arith.shli %351, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7185540Z %353 = arith.shrsi %352, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7185639Z %354 = arith.shrsi %351, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7185789Z %355 = tt.expand_dims %353 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7185951Z %356 = tt.expand_dims %354 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7186048Z %357 = tt.broadcast %355 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7186152Z %358 = arith.select %13, %357, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7186249Z %359 = tt.broadcast %356 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7186348Z %360 = arith.select %15, %359, %358 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7186457Z %361 = tt.reshape %360 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7186551Z %362 = arith.sitofp %361 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7186670Z %363 = ttg.local_alloc %362 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7186853Z %364 = ttg.local_load %363 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7187121Z %365 = tt.dot %347, %364, %345#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7187317Z %366 = ttg.local_load %345#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7187513Z %367 = arith.extf %366 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7187577Z %368 = arith.addi %88, %335 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7187679Z %369 = tt.addptr %8, %368 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7187739Z %370 = tt.load %369 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7187887Z %371 = ttg.convert_layout %370 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7187985Z %372 = arith.shli %371, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7188101Z %373 = arith.shrsi %372, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7188201Z %374 = arith.shrsi %371, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7188351Z %375 = tt.expand_dims %373 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7188496Z %376 = tt.expand_dims %374 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7188596Z %377 = tt.broadcast %375 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7188701Z %378 = arith.select %13, %377, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7188795Z %379 = tt.broadcast %376 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7188898Z %380 = arith.select %15, %379, %378 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7188988Z %381 = tt.reshape %380 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7189081Z %382 = arith.sitofp %381 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7189200Z %383 = ttg.local_alloc %382 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7189382Z %384 = ttg.local_load %383 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7189644Z %385 = tt.dot %367, %384, %365, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7189733Z ttg.local_dealloc %336 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.7189824Z %386 = arith.truncf %385 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:52:25.7189867Z %387 = arith.extsi %325 : i32 to i64 2026-02-21T09:52:25.7189928Z %388 = arith.extsi %328 : i32 to i64 2026-02-21T09:52:25.7190016Z %389 = tt.splat %387 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.7190103Z %390 = arith.addi %389, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.7190244Z %391 = tt.expand_dims %390 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7190320Z %392 = arith.muli %391, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7190404Z %393 = tt.broadcast %392 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7190488Z %394 = tt.splat %388 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.7190576Z %395 = arith.addi %394, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.7190713Z %396 = tt.expand_dims %395 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7190796Z %397 = tt.broadcast %396 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7190859Z %398 = arith.addi %393, %397 : tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7190956Z %399 = tt.addptr %16, %398 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7191022Z %400 = arith.cmpi sge, %391, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7191090Z %401 = arith.cmpi slt, %391, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7191147Z %402 = arith.andi %400, %401 : tensor<128x1xi1, #mma> 2026-02-21T09:52:25.7191229Z %403 = tt.broadcast %402 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7191308Z %404 = arith.cmpi sge, %396, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7191372Z %405 = arith.cmpi slt, %396, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7191429Z %406 = arith.andi %404, %405 : tensor<1x256xi1, #mma> 2026-02-21T09:52:25.7191512Z %407 = tt.broadcast %406 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7191570Z %408 = arith.andi %403, %407 : tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7191638Z tt.store %399, %386, %408 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:52:25.7191682Z } {tt.num_stages = 1 : i32} 2026-02-21T09:52:25.7191747Z scf.for %arg3 = %24 to %c4096_i32 step %c9728_i32 : i32 { 2026-02-21T09:52:25.7191794Z %25 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:52:25.7191836Z %26 = arith.muli %25, %c8_i32 : i32 2026-02-21T09:52:25.7191881Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:52:25.7191924Z %28 = arith.minsi %27, %c8_i32 : i32 2026-02-21T09:52:25.7191968Z %29 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:52:25.7192016Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:52:25.7192056Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:52:25.7192099Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:52:25.7192141Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:52:25.7192235Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.7192326Z %35 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:25.7192386Z %36 = arith.muli %32, %c256_i32 : i32 2026-02-21T09:52:25.7192480Z %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:52:25.7192568Z %38 = arith.addi %37, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:52:25.7192715Z %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.7192784Z %40 = arith.muli %39, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:25.7192875Z %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7193034Z %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:52:25.7193128Z %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7193214Z %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.7193367Z %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.7193459Z %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7193517Z %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7193619Z %48 = tt.addptr %7, %47 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7193681Z %49 = tt.load %48 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7193868Z %50 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7194008Z ttg.local_store %49, %50 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7194103Z %51 = arith.addi %6, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7194243Z %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.7194330Z %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7194389Z %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7194506Z %55 = tt.addptr %7, %54 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7194567Z %56 = tt.load %55 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7194747Z %57 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7194886Z ttg.local_store %56, %57 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7195228Z %58:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %50, %arg8 = %57) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:52:25.7195329Z %130 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7195422Z %131 = arith.addi %130, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7195472Z %132 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:52:25.7195517Z %133 = arith.muli %132, %c2_i32 : i32 2026-02-21T09:52:25.7195612Z %134 = tt.splat %133 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7195702Z %135 = arith.addi %134, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:25.7195860Z %136 = tt.expand_dims %135 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:25.7195955Z %137 = tt.broadcast %136 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7196017Z %138 = arith.addi %41, %137 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7196123Z %139 = tt.addptr %7, %138 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:25.7196190Z %140 = tt.load %139 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:25.7196405Z %141 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7196602Z %142 = arith.extf %141 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7196748Z %143 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7196828Z %144 = arith.muli %143, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7196920Z %145 = tt.broadcast %144 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7196984Z %146 = arith.addi %145, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7197086Z %147 = tt.addptr %8, %146 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7197147Z %148 = tt.load %147 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7197296Z %149 = ttg.convert_layout %148 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7197395Z %150 = arith.shli %149, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7197495Z %151 = arith.shrsi %150, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7197598Z %152 = arith.shrsi %149, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7197748Z %153 = tt.expand_dims %151 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7197912Z %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7198012Z %155 = tt.broadcast %153 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7198120Z %156 = arith.select %13, %155, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7198216Z %157 = tt.broadcast %154 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7198321Z %158 = arith.select %15, %157, %156 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7198412Z %159 = tt.reshape %158 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7198506Z %160 = arith.sitofp %159 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7198632Z %161 = ttg.local_alloc %160 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7198804Z %162 = ttg.local_load %161 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7199071Z %163 = tt.dot %142, %162, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7199133Z %164 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:52:25.7199182Z %165 = arith.cmpi slt, %164, %c2_i32 : i32 2026-02-21T09:52:25.7199233Z %166 = arith.select %165, %164, %c0_i32 : i32 2026-02-21T09:52:25.7199418Z %167 = ttg.memdesc_index %44[%166] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7199562Z ttg.local_store %140, %167 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7199785Z scf.yield %163, %166, %arg8, %167 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:52:25.7199851Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:52:25.7199943Z %59 = arith.addi %5, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7200137Z %60 = ttg.local_load %58#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7200348Z %61 = arith.extf %60 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7200489Z %62 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7200551Z %63 = arith.muli %62, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7200644Z %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7200704Z %65 = arith.addi %64, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7200802Z %66 = tt.addptr %8, %65 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7200865Z %67 = tt.load %66 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7201007Z %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7201104Z %69 = arith.shli %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7201205Z %70 = arith.shrsi %69, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7201300Z %71 = arith.shrsi %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7201469Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7201616Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7201710Z %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7201812Z %75 = arith.select %13, %74, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7201906Z %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7202001Z %77 = arith.select %15, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7202088Z %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7202180Z %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7202296Z %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7202464Z %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7202763Z %82 = tt.dot %61, %81, %58#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7202873Z %83 = arith.addi %5, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:25.7203065Z %84 = ttg.local_load %58#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7203264Z %85 = arith.extf %84 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7203423Z %86 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7203484Z %87 = arith.muli %86, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:52:25.7203573Z %88 = tt.broadcast %87 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7203632Z %89 = arith.addi %88, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7203744Z %90 = tt.addptr %8, %89 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:52:25.7203805Z %91 = tt.load %90 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:52:25.7203946Z %92 = ttg.convert_layout %91 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7204041Z %93 = arith.shli %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7204138Z %94 = arith.shrsi %93, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7204236Z %95 = arith.shrsi %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:25.7204380Z %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7204523Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:52:25.7204619Z %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7204719Z %99 = arith.select %13, %98, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7204837Z %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7204939Z %101 = arith.select %15, %100, %99 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:52:25.7205030Z %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:52:25.7205123Z %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:52:25.7205245Z %104 = ttg.local_alloc %103 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:52:25.7205414Z %105 = ttg.local_load %104 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:25.7205671Z %106 = tt.dot %85, %105, %82, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:52:25.7205760Z ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:25.7205849Z %107 = arith.truncf %106 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:52:25.7205894Z %108 = arith.extsi %33 : i32 to i64 2026-02-21T09:52:25.7205938Z %109 = arith.extsi %36 : i32 to i64 2026-02-21T09:52:25.7206025Z %110 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.7206123Z %111 = arith.addi %110, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:25.7206264Z %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7206324Z %113 = arith.muli %112, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7206407Z %114 = tt.broadcast %113 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7206493Z %115 = tt.splat %109 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.7206578Z %116 = arith.addi %115, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:25.7206731Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7206817Z %118 = tt.broadcast %117 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7206875Z %119 = arith.addi %114, %118 : tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7206971Z %120 = tt.addptr %16, %119 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:52:25.7207053Z %121 = arith.cmpi sge, %112, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7207118Z %122 = arith.cmpi slt, %112, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:52:25.7207174Z %123 = arith.andi %121, %122 : tensor<128x1xi1, #mma> 2026-02-21T09:52:25.7207256Z %124 = tt.broadcast %123 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7207324Z %125 = arith.cmpi sge, %117, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7207390Z %126 = arith.cmpi slt, %117, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:52:25.7207446Z %127 = arith.andi %125, %126 : tensor<1x256xi1, #mma> 2026-02-21T09:52:25.7207528Z %128 = tt.broadcast %127 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7207587Z %129 = arith.andi %124, %128 : tensor<128x256xi1, #mma> 2026-02-21T09:52:25.7207654Z tt.store %120, %107, %129 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:52:25.7207697Z } {tt.num_stages = 1 : i32} 2026-02-21T09:52:25.7207731Z tt.return 2026-02-21T09:52:25.7207764Z } 2026-02-21T09:52:25.7207797Z } 2026-02-21T09:52:25.7207803Z 2026-02-21T09:52:25.7207833Z {-# 2026-02-21T09:52:25.7207874Z external_resources: { 2026-02-21T09:52:25.7207911Z mlir_reproducer: { 2026-02-21T09:52:25.7208863Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:52:25.7208908Z disable_threading: false, 2026-02-21T09:52:25.7208947Z verify_each: true 2026-02-21T09:52:25.7208979Z } 2026-02-21T09:52:25.7209010Z } 2026-02-21T09:52:25.7209040Z #-} 2026-02-21T09:52:25.7209280Z /tmp/torchinductor_root/su/csuirppd7sjaegtojwbrn3szezptbqy4dkrx3wujv6lrttfuq6gg.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:52:25.7209705Z /tmp/torchinductor_root/su/csuirppd7sjaegtojwbrn3szezptbqy4dkrx3wujv6lrttfuq6gg.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:52:25.7209819Z [476s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:52:25.7210455Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:52:25.7210532Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:52:25.7210616Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:52:28.7390219Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:52:28.7398056Z #blocked = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:52:28.7398400Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:52:28.7398805Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:52:28.7399103Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:52:28.7399406Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:52:28.7399670Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:52:28.7399904Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:52:28.7400089Z #smem = #ttg.shared_memory 2026-02-21T09:52:28.7400327Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:52:28.7400798Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:52:28.7401261Z %cst = arith.constant dense<4> : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7401405Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:52:28.7401522Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:52:28.7401634Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:52:28.7401864Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:52:28.7402007Z %cst_0 = arith.constant dense<0> : tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7402190Z %cst_1 = arith.constant dense<0> : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7402339Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T09:52:28.7402457Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:52:28.7402645Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:52:28.7402757Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:52:28.7402902Z %cst_2 = arith.constant dense<1024> : tensor<64x1xi32, #blocked2> 2026-02-21T09:52:28.7403079Z %cst_3 = arith.constant dense<8192> : tensor<1x64xi64, #blocked> 2026-02-21T09:52:28.7403257Z %cst_4 = arith.constant dense<0> : tensor<1x64xi64, #blocked> 2026-02-21T09:52:28.7403427Z %cst_5 = arith.constant dense<512> : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7403594Z %cst_6 = arith.constant dense<0> : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7403767Z %cst_7 = arith.constant dense<8192> : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7403974Z %cst_8 = arith.constant dense<4> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:28.7404227Z %cst_9 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:28.7404420Z %c504_i32 = arith.constant 504 : i32 2026-02-21T09:52:28.7404538Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:52:28.7404748Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma> 2026-02-21T09:52:28.7404935Z %cst_11 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:28.7405116Z %cst_12 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:28.7405290Z %cst_13 = arith.constant dense<8192> : tensor<64x1xi32, #mma> 2026-02-21T09:52:28.7405439Z %0 = tt.get_program_id x : i32 2026-02-21T09:52:28.7405558Z %1 = arith.muli %0, %c2_i32 : i32 2026-02-21T09:52:28.7405671Z %2 = arith.addi %1, %c2_i32 : i32 2026-02-21T09:52:28.7405792Z %3 = arith.minsi %2, %c32768_i32 : i32 2026-02-21T09:52:28.7405993Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:28.7406297Z %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:28.7406562Z %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:28.7406829Z %7 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:28.7407115Z %8 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:28.7407357Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<64x8x!tt.ptr, #blocked2> 2026-02-21T09:52:28.7407564Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<4x64x!tt.ptr, #blocked> 2026-02-21T09:52:28.7407800Z %11 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:28.7408122Z %12 = arith.extsi %11 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:28.7408484Z %13 = arith.extsi %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:28.7408851Z %14 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:52:28.7409279Z %15 = tt.expand_dims %14 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:52:28.7409711Z %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:28.7409973Z %17 = arith.cmpi eq, %16, %cst_11 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:28.7410191Z %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked1> -> tensor<4x2x64xi1, #blocked1> 2026-02-21T09:52:28.7410408Z %19 = arith.cmpi eq, %16, %cst_12 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:28.7410615Z %20 = tt.broadcast %19 : tensor<1x2x1xi1, #blocked1> -> tensor<4x2x64xi1, #blocked1> 2026-02-21T09:52:28.7410840Z %21 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:52:28.7411010Z %22 = arith.subi %3, %1 : i32 2026-02-21T09:52:28.7411132Z %23 = arith.remsi %22, %c2_i32 : i32 2026-02-21T09:52:28.7411260Z %24 = arith.subi %22, %23 : i32 2026-02-21T09:52:28.7411382Z %25 = arith.addi %1, %24 : i32 2026-02-21T09:52:28.7411516Z scf.for %arg3 = %1 to %25 step %c2_i32 : i32 { 2026-02-21T09:52:28.7411669Z %26 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:52:28.7411795Z %27 = arith.muli %26, %c2_i32 : i32 2026-02-21T09:52:28.7411922Z %28 = arith.subi %c128_i32, %27 : i32 2026-02-21T09:52:28.7412045Z %29 = arith.minsi %28, %c2_i32 : i32 2026-02-21T09:52:28.7412170Z %30 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:52:28.7412298Z %31 = arith.remsi %30, %29 : i32 2026-02-21T09:52:28.7412412Z %32 = arith.addi %27, %31 : i32 2026-02-21T09:52:28.7412530Z %33 = arith.divsi %30, %29 : i32 2026-02-21T09:52:28.7412665Z %34 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:52:28.7412828Z %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:28.7413040Z %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:28.7413205Z %37 = arith.muli %33, %c64_i32 : i32 2026-02-21T09:52:28.7413380Z %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:28.7413595Z %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:28.7413811Z %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:28.7414043Z %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:28.7414316Z %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:52:28.7414603Z %43 = arith.muli %42, %cst_2 : tensor<64x1xi32, #blocked2> 2026-02-21T09:52:28.7414798Z %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked2> -> tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7414995Z %45 = arith.extsi %34 : i32 to i64 2026-02-21T09:52:28.7415164Z %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:28.7415388Z %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:28.7415663Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x64xi64, #blocked> 2026-02-21T09:52:28.7415935Z %49 = tt.broadcast %48 : tensor<1x64xi64, #blocked> -> tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7416137Z %50 = arith.cmpi sge, %48, %cst_4 : tensor<1x64xi64, #blocked> 2026-02-21T09:52:28.7416307Z %51 = arith.cmpi slt, %48, %cst_3 : tensor<1x64xi64, #blocked> 2026-02-21T09:52:28.7416473Z %52 = arith.andi %50, %51 : tensor<1x64xi1, #blocked> 2026-02-21T09:52:28.7416656Z %53 = tt.broadcast %52 : tensor<1x64xi1, #blocked> -> tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7416868Z %54 = ttg.local_alloc : () -> !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> 2026-02-21T09:52:28.7417147Z %55 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:52:28.7417419Z %56 = tt.broadcast %55 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7417633Z %57 = arith.addi %44, %56 : tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7417835Z %58 = tt.addptr %9, %57 : tensor<64x8x!tt.ptr, #blocked2>, tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7418042Z %59 = tt.load %58 : tensor<64x8x!tt.ptr, #blocked2> 2026-02-21T09:52:28.7418289Z %60 = tt.expand_dims %12 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7418535Z %61 = arith.muli %60, %cst_7 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7418725Z %62 = tt.broadcast %61 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7418915Z %63 = arith.addi %62, %49 : tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7419111Z %64 = tt.addptr %10, %63 : tensor<4x64x!tt.ptr, #blocked>, tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7419318Z %65 = arith.cmpi sge, %60, %cst_6 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7419489Z %66 = arith.cmpi slt, %60, %cst_5 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7419654Z %67 = arith.andi %65, %66 : tensor<4x1xi1, #blocked> 2026-02-21T09:52:28.7419834Z %68 = tt.broadcast %67 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7420023Z %69 = arith.andi %68, %53 : tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7420239Z %70 = tt.load %64, %69, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x64x!tt.ptr, #blocked> 2026-02-21T09:52:28.7420611Z %71 = ttg.memdesc_index %54[%c0_i32] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7420995Z ttg.local_store %59, %71 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7421284Z %72 = arith.addi %8, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:28.7421587Z %73 = tt.expand_dims %72 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:52:28.7421870Z %74 = tt.broadcast %73 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7422078Z %75 = arith.addi %44, %74 : tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7422276Z %76 = tt.addptr %9, %75 : tensor<64x8x!tt.ptr, #blocked2>, tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7422477Z %77 = tt.load %76 : tensor<64x8x!tt.ptr, #blocked2> 2026-02-21T09:52:28.7422671Z %78 = arith.addi %12, %cst_8 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:28.7422962Z %79 = tt.expand_dims %78 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7423201Z %80 = arith.muli %79, %cst_7 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7423389Z %81 = tt.broadcast %80 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7423576Z %82 = arith.addi %81, %49 : tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7423766Z %83 = tt.addptr %10, %82 : tensor<4x64x!tt.ptr, #blocked>, tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7423975Z %84 = arith.cmpi sge, %79, %cst_6 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7424141Z %85 = arith.cmpi slt, %79, %cst_5 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7424305Z %86 = arith.andi %84, %85 : tensor<4x1xi1, #blocked> 2026-02-21T09:52:28.7424481Z %87 = tt.broadcast %86 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7424667Z %88 = arith.andi %87, %53 : tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7424873Z %89 = tt.load %83, %88, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x64x!tt.ptr, #blocked> 2026-02-21T09:52:28.7425206Z %90 = ttg.memdesc_index %54[%c1_i32] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7425590Z ttg.local_store %77, %90 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7426205Z %91:6 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %71, %arg8 = %90, %arg9 = %70, %arg10 = %89) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, tensor<4x64xi8, #blocked>, tensor<4x64xi8, #blocked>) : i32 { 2026-02-21T09:52:28.7426725Z %227 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:52:28.7426858Z %228 = arith.muli %227, %c2_i32 : i32 2026-02-21T09:52:28.7427037Z %229 = tt.splat %228 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:28.7427269Z %230 = arith.addi %229, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:28.7427549Z %231 = tt.expand_dims %230 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:52:28.7427833Z %232 = tt.broadcast %231 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7428033Z %233 = arith.addi %44, %232 : tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7428234Z %234 = tt.addptr %9, %233 : tensor<64x8x!tt.ptr, #blocked2>, tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7428461Z %235 = tt.load %234 : tensor<64x8x!tt.ptr, #blocked2> 2026-02-21T09:52:28.7428762Z %236 = ttg.local_load %arg7 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7429203Z %237 = arith.extf %236 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7429491Z %238 = arith.extsi %227 : i32 to i64 2026-02-21T09:52:28.7429670Z %239 = tt.splat %238 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:28.7429919Z %240 = arith.addi %239, %12 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:28.7430214Z %241 = tt.expand_dims %240 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7430470Z %242 = arith.muli %241, %cst_7 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7430668Z %243 = tt.broadcast %242 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7430862Z %244 = arith.addi %243, %49 : tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7431080Z %245 = tt.addptr %10, %244 : tensor<4x64x!tt.ptr, #blocked>, tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7431289Z %246 = arith.cmpi sge, %241, %cst_6 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7431472Z %247 = arith.cmpi slt, %241, %cst_5 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7431642Z %248 = arith.andi %246, %247 : tensor<4x1xi1, #blocked> 2026-02-21T09:52:28.7431830Z %249 = tt.broadcast %248 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7432024Z %250 = arith.andi %249, %53 : tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7432193Z %251 = tt.load %245, %250, %cst_1 : tensor<4x64x!tt.ptr, #blocked> 2026-02-21T09:52:28.7432376Z %252 = arith.shli %arg9, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7432537Z %253 = arith.shrsi %252, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7432787Z %254 = ttg.convert_layout %253 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7433040Z %255 = arith.shrsi %arg9, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7433282Z %256 = ttg.convert_layout %255 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7433639Z %257 = tt.expand_dims %254 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7433983Z %258 = tt.expand_dims %256 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7434274Z %259 = tt.broadcast %257 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7434529Z %260 = arith.select %18, %259, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7434776Z %261 = tt.broadcast %258 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7435021Z %262 = arith.select %20, %261, %260 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7435255Z %263 = tt.reshape %262 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3> 2026-02-21T09:52:28.7435486Z %264 = arith.sitofp %263 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3> 2026-02-21T09:52:28.7435745Z %265 = ttg.local_alloc %264 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:52:28.7436073Z %266 = ttg.local_load %265 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7436573Z %267 = tt.dot %237, %266, %arg5, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:52:28.7436929Z %268 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:52:28.7437060Z %269 = arith.cmpi slt, %268, %c2_i32 : i32 2026-02-21T09:52:28.7437203Z %270 = arith.select %269, %268, %c0_i32 : i32 2026-02-21T09:52:28.7437470Z %271 = ttg.memdesc_index %54[%270] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7437844Z ttg.local_store %235, %271 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7438336Z scf.yield %267, %270, %arg8, %271, %arg10, %251 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, tensor<4x64xi8, #blocked>, tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7438757Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:52:28.7439081Z %92 = ttg.local_load %91#2 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7439504Z %93 = arith.extf %92 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7439797Z %94 = arith.shli %91#4, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7439955Z %95 = arith.shrsi %94, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7440194Z %96 = ttg.convert_layout %95 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7440431Z %97 = arith.shrsi %91#4, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7440667Z %98 = ttg.convert_layout %97 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7440995Z %99 = tt.expand_dims %96 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7441333Z %100 = tt.expand_dims %98 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7441638Z %101 = tt.broadcast %99 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7441876Z %102 = arith.select %18, %101, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7442119Z %103 = tt.broadcast %100 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7442354Z %104 = arith.select %20, %103, %102 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7442620Z %105 = tt.reshape %104 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3> 2026-02-21T09:52:28.7442845Z %106 = arith.sitofp %105 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3> 2026-02-21T09:52:28.7443097Z %107 = ttg.local_alloc %106 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:52:28.7443422Z %108 = ttg.local_load %107 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7443887Z %109 = tt.dot %93, %108, %91#0, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:52:28.7444371Z %110 = ttg.local_load %91#3 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7444796Z %111 = arith.extf %110 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7445109Z %112 = arith.shli %91#5, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7445267Z %113 = arith.shrsi %112, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7445509Z %114 = ttg.convert_layout %113 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7445751Z %115 = arith.shrsi %91#5, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7445991Z %116 = ttg.convert_layout %115 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7446342Z %117 = tt.expand_dims %114 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7446681Z %118 = tt.expand_dims %116 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7446970Z %119 = tt.broadcast %117 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7447242Z %120 = arith.select %18, %119, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7447482Z %121 = tt.broadcast %118 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7447719Z %122 = arith.select %20, %121, %120 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7447949Z %123 = tt.reshape %122 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3> 2026-02-21T09:52:28.7448172Z %124 = arith.sitofp %123 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3> 2026-02-21T09:52:28.7448419Z %125 = ttg.local_alloc %124 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:52:28.7448739Z %126 = ttg.local_load %125 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7449205Z %127 = tt.dot %111, %126, %109, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:52:28.7449583Z ttg.local_dealloc %54 : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> 2026-02-21T09:52:28.7449811Z %128 = arith.truncf %127 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:52:28.7450071Z %129 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:52:28.7450309Z %130 = arith.muli %129, %cst_13 : tensor<64x1xi32, #mma> 2026-02-21T09:52:28.7450540Z %131 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:52:28.7450792Z %132 = tt.broadcast %130 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:52:28.7450992Z %133 = tt.broadcast %131 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:52:28.7451168Z %134 = arith.addi %132, %133 : tensor<64x64xi32, #mma> 2026-02-21T09:52:28.7451356Z %135 = tt.addptr %21, %134 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:52:28.7451552Z tt.store %135, %128 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:52:28.7451693Z %136 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:52:28.7451819Z %137 = arith.divsi %136, %c512_i32 : i32 2026-02-21T09:52:28.7451938Z %138 = arith.muli %137, %c2_i32 : i32 2026-02-21T09:52:28.7452060Z %139 = arith.subi %c128_i32, %138 : i32 2026-02-21T09:52:28.7452177Z %140 = arith.minsi %139, %c2_i32 : i32 2026-02-21T09:52:28.7452297Z %141 = arith.remsi %136, %c512_i32 : i32 2026-02-21T09:52:28.7452414Z %142 = arith.remsi %141, %140 : i32 2026-02-21T09:52:28.7452543Z %143 = arith.addi %138, %142 : i32 2026-02-21T09:52:28.7452656Z %144 = arith.divsi %141, %140 : i32 2026-02-21T09:52:28.7452771Z %145 = arith.muli %143, %c64_i32 : i32 2026-02-21T09:52:28.7452933Z %146 = tt.splat %145 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:28.7453142Z %147 = arith.addi %146, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:28.7453311Z %148 = arith.muli %144, %c64_i32 : i32 2026-02-21T09:52:28.7453477Z %149 = tt.splat %148 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:28.7453691Z %150 = tt.splat %148 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:28.7453928Z %151 = arith.addi %149, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:28.7454139Z %152 = arith.addi %150, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:28.7454413Z %153 = tt.expand_dims %151 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:52:28.7454664Z %154 = arith.muli %153, %cst_2 : tensor<64x1xi32, #blocked2> 2026-02-21T09:52:28.7454882Z %155 = tt.broadcast %154 : tensor<64x1xi32, #blocked2> -> tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7455060Z %156 = arith.extsi %145 : i32 to i64 2026-02-21T09:52:28.7455225Z %157 = tt.splat %156 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:28.7455444Z %158 = arith.addi %157, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:28.7455718Z %159 = tt.expand_dims %158 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x64xi64, #blocked> 2026-02-21T09:52:28.7455994Z %160 = tt.broadcast %159 : tensor<1x64xi64, #blocked> -> tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7456194Z %161 = arith.cmpi sge, %159, %cst_4 : tensor<1x64xi64, #blocked> 2026-02-21T09:52:28.7456366Z %162 = arith.cmpi slt, %159, %cst_3 : tensor<1x64xi64, #blocked> 2026-02-21T09:52:28.7456530Z %163 = arith.andi %161, %162 : tensor<1x64xi1, #blocked> 2026-02-21T09:52:28.7456712Z %164 = tt.broadcast %163 : tensor<1x64xi1, #blocked> -> tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7456926Z %165 = ttg.local_alloc : () -> !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> 2026-02-21T09:52:28.7457112Z %166 = arith.addi %155, %56 : tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7457333Z %167 = tt.addptr %9, %166 : tensor<64x8x!tt.ptr, #blocked2>, tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7457541Z %168 = tt.load %167 : tensor<64x8x!tt.ptr, #blocked2> 2026-02-21T09:52:28.7457697Z %169 = arith.addi %62, %160 : tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7457890Z %170 = tt.addptr %10, %169 : tensor<4x64x!tt.ptr, #blocked>, tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7458085Z %171 = arith.andi %68, %164 : tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7458299Z %172 = tt.load %170, %171, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x64x!tt.ptr, #blocked> 2026-02-21T09:52:28.7458639Z %173 = ttg.memdesc_index %165[%c0_i32] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7458997Z ttg.local_store %168, %173 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7459237Z %174 = arith.addi %155, %74 : tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7459433Z %175 = tt.addptr %9, %174 : tensor<64x8x!tt.ptr, #blocked2>, tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7459641Z %176 = tt.load %175 : tensor<64x8x!tt.ptr, #blocked2> 2026-02-21T09:52:28.7459798Z %177 = arith.addi %81, %160 : tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7459989Z %178 = tt.addptr %10, %177 : tensor<4x64x!tt.ptr, #blocked>, tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7468589Z %179 = arith.andi %87, %164 : tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7468800Z %180 = tt.load %178, %179, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x64x!tt.ptr, #blocked> 2026-02-21T09:52:28.7469135Z %181 = ttg.memdesc_index %165[%c1_i32] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7469491Z ttg.local_store %176, %181 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7470108Z %182:6 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %173, %arg8 = %181, %arg9 = %172, %arg10 = %180) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, tensor<4x64xi8, #blocked>, tensor<4x64xi8, #blocked>) : i32 { 2026-02-21T09:52:28.7470660Z %227 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:52:28.7470790Z %228 = arith.muli %227, %c2_i32 : i32 2026-02-21T09:52:28.7470979Z %229 = tt.splat %228 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:28.7471206Z %230 = arith.addi %229, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:28.7471483Z %231 = tt.expand_dims %230 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:52:28.7471762Z %232 = tt.broadcast %231 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7471962Z %233 = arith.addi %155, %232 : tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7472162Z %234 = tt.addptr %9, %233 : tensor<64x8x!tt.ptr, #blocked2>, tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7472368Z %235 = tt.load %234 : tensor<64x8x!tt.ptr, #blocked2> 2026-02-21T09:52:28.7472669Z %236 = ttg.local_load %arg7 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7473112Z %237 = arith.extf %236 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7473397Z %238 = arith.extsi %227 : i32 to i64 2026-02-21T09:52:28.7473583Z %239 = tt.splat %238 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:28.7473804Z %240 = arith.addi %239, %12 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:28.7474081Z %241 = tt.expand_dims %240 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7474325Z %242 = arith.muli %241, %cst_7 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7474517Z %243 = tt.broadcast %242 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7474707Z %244 = arith.addi %243, %160 : tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7474907Z %245 = tt.addptr %10, %244 : tensor<4x64x!tt.ptr, #blocked>, tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7475113Z %246 = arith.cmpi sge, %241, %cst_6 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7475285Z %247 = arith.cmpi slt, %241, %cst_5 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7475449Z %248 = arith.andi %246, %247 : tensor<4x1xi1, #blocked> 2026-02-21T09:52:28.7475632Z %249 = tt.broadcast %248 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7475822Z %250 = arith.andi %249, %164 : tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7475986Z %251 = tt.load %245, %250, %cst_1 : tensor<4x64x!tt.ptr, #blocked> 2026-02-21T09:52:28.7476161Z %252 = arith.shli %arg9, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7476332Z %253 = arith.shrsi %252, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7476573Z %254 = ttg.convert_layout %253 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7476819Z %255 = arith.shrsi %arg9, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7477059Z %256 = ttg.convert_layout %255 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7477395Z %257 = tt.expand_dims %254 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7477748Z %258 = tt.expand_dims %256 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7478035Z %259 = tt.broadcast %257 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7478282Z %260 = arith.select %18, %259, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7478524Z %261 = tt.broadcast %258 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7478781Z %262 = arith.select %20, %261, %260 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7479012Z %263 = tt.reshape %262 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3> 2026-02-21T09:52:28.7479237Z %264 = arith.sitofp %263 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3> 2026-02-21T09:52:28.7479489Z %265 = ttg.local_alloc %264 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:52:28.7479813Z %266 = ttg.local_load %265 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7480288Z %267 = tt.dot %237, %266, %arg5, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:52:28.7480638Z %268 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:52:28.7480767Z %269 = arith.cmpi slt, %268, %c2_i32 : i32 2026-02-21T09:52:28.7480901Z %270 = arith.select %269, %268, %c0_i32 : i32 2026-02-21T09:52:28.7481164Z %271 = ttg.memdesc_index %165[%270] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7481536Z ttg.local_store %235, %271 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7482014Z scf.yield %267, %270, %arg8, %271, %arg10, %251 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, tensor<4x64xi8, #blocked>, tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7482428Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:52:28.7482779Z %183 = ttg.local_load %182#2 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7483206Z %184 = arith.extf %183 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7483505Z %185 = arith.shli %182#4, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7483667Z %186 = arith.shrsi %185, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7483908Z %187 = ttg.convert_layout %186 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7484156Z %188 = arith.shrsi %182#4, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7484394Z %189 = ttg.convert_layout %188 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7484745Z %190 = tt.expand_dims %187 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7485085Z %191 = tt.expand_dims %189 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7485367Z %192 = tt.broadcast %190 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7485610Z %193 = arith.select %18, %192, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7485870Z %194 = tt.broadcast %191 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7486101Z %195 = arith.select %20, %194, %193 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7486332Z %196 = tt.reshape %195 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3> 2026-02-21T09:52:28.7486552Z %197 = arith.sitofp %196 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3> 2026-02-21T09:52:28.7486822Z %198 = ttg.local_alloc %197 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:52:28.7487148Z %199 = ttg.local_load %198 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7487614Z %200 = tt.dot %184, %199, %182#0, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:52:28.7488111Z %201 = ttg.local_load %182#3 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7488539Z %202 = arith.extf %201 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7488833Z %203 = arith.shli %182#5, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7488998Z %204 = arith.shrsi %203, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7489237Z %205 = ttg.convert_layout %204 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7489481Z %206 = arith.shrsi %182#5, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7489739Z %207 = ttg.convert_layout %206 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7490071Z %208 = tt.expand_dims %205 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7490409Z %209 = tt.expand_dims %207 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7490693Z %210 = tt.broadcast %208 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7490935Z %211 = arith.select %18, %210, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7491177Z %212 = tt.broadcast %209 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7491410Z %213 = arith.select %20, %212, %211 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7491643Z %214 = tt.reshape %213 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3> 2026-02-21T09:52:28.7491861Z %215 = arith.sitofp %214 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3> 2026-02-21T09:52:28.7492113Z %216 = ttg.local_alloc %215 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:52:28.7492435Z %217 = ttg.local_load %216 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7492914Z %218 = tt.dot %202, %217, %200, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:52:28.7493295Z ttg.local_dealloc %165 : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> 2026-02-21T09:52:28.7493505Z %219 = arith.truncf %218 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:52:28.7493772Z %220 = tt.expand_dims %152 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:52:28.7494039Z %221 = arith.muli %220, %cst_13 : tensor<64x1xi32, #mma> 2026-02-21T09:52:28.7494289Z %222 = tt.expand_dims %147 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:52:28.7494544Z %223 = tt.broadcast %221 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:52:28.7494746Z %224 = tt.broadcast %222 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:52:28.7494926Z %225 = arith.addi %223, %224 : tensor<64x64xi32, #mma> 2026-02-21T09:52:28.7495128Z %226 = tt.addptr %21, %225 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:52:28.7495324Z tt.store %226, %219 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:52:28.7495461Z } {tt.num_stages = 1 : i32} 2026-02-21T09:52:28.7495585Z scf.for %arg3 = %25 to %3 step %c1_i32 : i32 { 2026-02-21T09:52:28.7495718Z %26 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:52:28.7495846Z %27 = arith.muli %26, %c2_i32 : i32 2026-02-21T09:52:28.7495972Z %28 = arith.subi %c128_i32, %27 : i32 2026-02-21T09:52:28.7496089Z %29 = arith.minsi %28, %c2_i32 : i32 2026-02-21T09:52:28.7496210Z %30 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:52:28.7496326Z %31 = arith.remsi %30, %29 : i32 2026-02-21T09:52:28.7496440Z %32 = arith.addi %27, %31 : i32 2026-02-21T09:52:28.7496548Z %33 = arith.divsi %30, %29 : i32 2026-02-21T09:52:28.7496661Z %34 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:52:28.7496817Z %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:28.7497023Z %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:28.7497188Z %37 = arith.muli %33, %c64_i32 : i32 2026-02-21T09:52:28.7497369Z %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:28.7497578Z %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:28.7497785Z %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:28.7497993Z %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:28.7498258Z %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2> 2026-02-21T09:52:28.7498502Z %43 = arith.muli %42, %cst_2 : tensor<64x1xi32, #blocked2> 2026-02-21T09:52:28.7498692Z %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked2> -> tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7498863Z %45 = arith.extsi %34 : i32 to i64 2026-02-21T09:52:28.7499026Z %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:28.7499242Z %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:28.7499505Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x64xi64, #blocked> 2026-02-21T09:52:28.7499774Z %49 = tt.broadcast %48 : tensor<1x64xi64, #blocked> -> tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7499970Z %50 = arith.cmpi sge, %48, %cst_4 : tensor<1x64xi64, #blocked> 2026-02-21T09:52:28.7500154Z %51 = arith.cmpi slt, %48, %cst_3 : tensor<1x64xi64, #blocked> 2026-02-21T09:52:28.7500315Z %52 = arith.andi %50, %51 : tensor<1x64xi1, #blocked> 2026-02-21T09:52:28.7500494Z %53 = tt.broadcast %52 : tensor<1x64xi1, #blocked> -> tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7500710Z %54 = ttg.local_alloc : () -> !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> 2026-02-21T09:52:28.7500978Z %55 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:52:28.7501247Z %56 = tt.broadcast %55 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7501452Z %57 = arith.addi %44, %56 : tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7501646Z %58 = tt.addptr %9, %57 : tensor<64x8x!tt.ptr, #blocked2>, tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7509723Z %59 = tt.load %58 : tensor<64x8x!tt.ptr, #blocked2> 2026-02-21T09:52:28.7509992Z %60 = tt.expand_dims %12 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7510234Z %61 = arith.muli %60, %cst_7 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7510463Z %62 = tt.broadcast %61 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7510651Z %63 = arith.addi %62, %49 : tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7510841Z %64 = tt.addptr %10, %63 : tensor<4x64x!tt.ptr, #blocked>, tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7511044Z %65 = arith.cmpi sge, %60, %cst_6 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7511211Z %66 = arith.cmpi slt, %60, %cst_5 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7511369Z %67 = arith.andi %65, %66 : tensor<4x1xi1, #blocked> 2026-02-21T09:52:28.7511541Z %68 = tt.broadcast %67 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7511724Z %69 = arith.andi %68, %53 : tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7511932Z %70 = tt.load %64, %69, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x64x!tt.ptr, #blocked> 2026-02-21T09:52:28.7512266Z %71 = ttg.memdesc_index %54[%c0_i32] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7512620Z ttg.local_store %59, %71 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7512911Z %72 = arith.addi %8, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:28.7513188Z %73 = tt.expand_dims %72 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:52:28.7513460Z %74 = tt.broadcast %73 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7513647Z %75 = arith.addi %44, %74 : tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7513842Z %76 = tt.addptr %9, %75 : tensor<64x8x!tt.ptr, #blocked2>, tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7514038Z %77 = tt.load %76 : tensor<64x8x!tt.ptr, #blocked2> 2026-02-21T09:52:28.7514227Z %78 = arith.addi %12, %cst_8 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:28.7514495Z %79 = tt.expand_dims %78 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7514730Z %80 = arith.muli %79, %cst_7 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7514911Z %81 = tt.broadcast %80 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7515094Z %82 = arith.addi %81, %49 : tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7515280Z %83 = tt.addptr %10, %82 : tensor<4x64x!tt.ptr, #blocked>, tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7515481Z %84 = arith.cmpi sge, %79, %cst_6 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7515658Z %85 = arith.cmpi slt, %79, %cst_5 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7515814Z %86 = arith.andi %84, %85 : tensor<4x1xi1, #blocked> 2026-02-21T09:52:28.7515987Z %87 = tt.broadcast %86 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7516168Z %88 = arith.andi %87, %53 : tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7516369Z %89 = tt.load %83, %88, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x64x!tt.ptr, #blocked> 2026-02-21T09:52:28.7516704Z %90 = ttg.memdesc_index %54[%c1_i32] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7517070Z ttg.local_store %77, %90 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7517680Z %91:6 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %71, %arg8 = %90, %arg9 = %70, %arg10 = %89) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, tensor<4x64xi8, #blocked>, tensor<4x64xi8, #blocked>) : i32 { 2026-02-21T09:52:28.7518209Z %136 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:52:28.7518339Z %137 = arith.muli %136, %c2_i32 : i32 2026-02-21T09:52:28.7518513Z %138 = tt.splat %137 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:28.7518739Z %139 = arith.addi %138, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:28.7519015Z %140 = tt.expand_dims %139 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:52:28.7519294Z %141 = tt.broadcast %140 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7519489Z %142 = arith.addi %44, %141 : tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7519690Z %143 = tt.addptr %9, %142 : tensor<64x8x!tt.ptr, #blocked2>, tensor<64x8xi32, #blocked2> 2026-02-21T09:52:28.7519895Z %144 = tt.load %143 : tensor<64x8x!tt.ptr, #blocked2> 2026-02-21T09:52:28.7520196Z %145 = ttg.local_load %arg7 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7520650Z %146 = arith.extf %145 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7520932Z %147 = arith.extsi %136 : i32 to i64 2026-02-21T09:52:28.7521103Z %148 = tt.splat %147 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:28.7521325Z %149 = arith.addi %148, %12 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:28.7521596Z %150 = tt.expand_dims %149 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7521845Z %151 = arith.muli %150, %cst_7 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7522037Z %152 = tt.broadcast %151 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7522225Z %153 = arith.addi %152, %49 : tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7522422Z %154 = tt.addptr %10, %153 : tensor<4x64x!tt.ptr, #blocked>, tensor<4x64xi64, #blocked> 2026-02-21T09:52:28.7522682Z %155 = arith.cmpi sge, %150, %cst_6 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7522855Z %156 = arith.cmpi slt, %150, %cst_5 : tensor<4x1xi64, #blocked> 2026-02-21T09:52:28.7523023Z %157 = arith.andi %155, %156 : tensor<4x1xi1, #blocked> 2026-02-21T09:52:28.7523207Z %158 = tt.broadcast %157 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7523396Z %159 = arith.andi %158, %53 : tensor<4x64xi1, #blocked> 2026-02-21T09:52:28.7523580Z %160 = tt.load %154, %159, %cst_1 : tensor<4x64x!tt.ptr, #blocked> 2026-02-21T09:52:28.7523754Z %161 = arith.shli %arg9, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7523913Z %162 = arith.shrsi %161, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7524156Z %163 = ttg.convert_layout %162 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7524405Z %164 = arith.shrsi %arg9, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7524644Z %165 = ttg.convert_layout %164 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7525002Z %166 = tt.expand_dims %163 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7525341Z %167 = tt.expand_dims %165 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7525629Z %168 = tt.broadcast %166 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7525898Z %169 = arith.select %18, %168, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7526138Z %170 = tt.broadcast %167 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7526378Z %171 = arith.select %20, %170, %169 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7526613Z %172 = tt.reshape %171 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3> 2026-02-21T09:52:28.7526838Z %173 = arith.sitofp %172 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3> 2026-02-21T09:52:28.7527089Z %174 = ttg.local_alloc %173 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:52:28.7527413Z %175 = ttg.local_load %174 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7527888Z %176 = tt.dot %146, %175, %arg5, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:52:28.7528238Z %177 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:52:28.7528365Z %178 = arith.cmpi slt, %177, %c2_i32 : i32 2026-02-21T09:52:28.7528520Z %179 = arith.select %178, %177, %c0_i32 : i32 2026-02-21T09:52:28.7528780Z %180 = ttg.memdesc_index %54[%179] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7529133Z ttg.local_store %144, %180 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> 2026-02-21T09:52:28.7529606Z scf.yield %176, %179, %arg8, %180, %arg10, %160 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, tensor<4x64xi8, #blocked>, tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7530026Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:52:28.7530336Z %92 = ttg.local_load %91#2 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7530758Z %93 = arith.extf %92 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7531050Z %94 = arith.shli %91#4, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7531208Z %95 = arith.shrsi %94, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7531442Z %96 = ttg.convert_layout %95 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7531698Z %97 = arith.shrsi %91#4, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7531931Z %98 = ttg.convert_layout %97 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7532258Z %99 = tt.expand_dims %96 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7532596Z %100 = tt.expand_dims %98 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7532877Z %101 = tt.broadcast %99 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7533135Z %102 = arith.select %18, %101, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7533376Z %103 = tt.broadcast %100 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7533610Z %104 = arith.select %20, %103, %102 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7533843Z %105 = tt.reshape %104 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3> 2026-02-21T09:52:28.7534080Z %106 = arith.sitofp %105 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3> 2026-02-21T09:52:28.7534329Z %107 = ttg.local_alloc %106 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:52:28.7534653Z %108 = ttg.local_load %107 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7535118Z %109 = tt.dot %93, %108, %91#0, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:52:28.7535605Z %110 = ttg.local_load %91#3 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7536032Z %111 = arith.extf %110 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7536327Z %112 = arith.shli %91#5, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7536487Z %113 = arith.shrsi %112, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7536742Z %114 = ttg.convert_layout %113 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7536987Z %115 = arith.shrsi %91#5, %cst : tensor<4x64xi8, #blocked> 2026-02-21T09:52:28.7537229Z %116 = ttg.convert_layout %115 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:28.7537560Z %117 = tt.expand_dims %114 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7537898Z %118 = tt.expand_dims %116 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1> 2026-02-21T09:52:28.7538180Z %119 = tt.broadcast %117 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7538423Z %120 = arith.select %18, %119, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7538666Z %121 = tt.broadcast %118 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7538899Z %122 = arith.select %20, %121, %120 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1> 2026-02-21T09:52:28.7539132Z %123 = tt.reshape %122 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3> 2026-02-21T09:52:28.7539349Z %124 = arith.sitofp %123 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3> 2026-02-21T09:52:28.7539599Z %125 = ttg.local_alloc %124 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:52:28.7539942Z %126 = ttg.local_load %125 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:28.7540403Z %127 = tt.dot %111, %126, %109, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma> 2026-02-21T09:52:28.7540787Z ttg.local_dealloc %54 : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> 2026-02-21T09:52:28.7540998Z %128 = arith.truncf %127 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma> 2026-02-21T09:52:28.7541276Z %129 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma> 2026-02-21T09:52:28.7541511Z %130 = arith.muli %129, %cst_13 : tensor<64x1xi32, #mma> 2026-02-21T09:52:28.7541740Z %131 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:52:28.7541994Z %132 = tt.broadcast %130 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:52:28.7542207Z %133 = tt.broadcast %131 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma> 2026-02-21T09:52:28.7542385Z %134 = arith.addi %132, %133 : tensor<64x64xi32, #mma> 2026-02-21T09:52:28.7542574Z %135 = tt.addptr %21, %134 : tensor<64x64x!tt.ptr, #mma>, tensor<64x64xi32, #mma> 2026-02-21T09:52:28.7542768Z tt.store %135, %128 : tensor<64x64x!tt.ptr, #mma> 2026-02-21T09:52:28.7542908Z } {tt.num_stages = 1 : i32} 2026-02-21T09:52:28.7543013Z tt.return 2026-02-21T09:52:28.7543098Z } 2026-02-21T09:52:28.7543172Z } 2026-02-21T09:52:28.7543220Z 2026-02-21T09:52:28.7543251Z {-# 2026-02-21T09:52:28.7543333Z external_resources: { 2026-02-21T09:52:28.7543430Z mlir_reproducer: { 2026-02-21T09:52:28.7544460Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:52:28.7545449Z disable_threading: false, 2026-02-21T09:52:28.7545554Z verify_each: true 2026-02-21T09:52:28.7545646Z } 2026-02-21T09:52:28.7545717Z } 2026-02-21T09:52:28.7545787Z #-} 2026-02-21T09:52:28.7546063Z /tmp/torchinductor_root/uu/cuut2pp4drnpd2em2agvv3vob4ubf3hxd3hqoqovoqkudileffxn.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:52:28.7546746Z /tmp/torchinductor_root/uu/cuut2pp4drnpd2em2agvv3vob4ubf3hxd3hqoqovoqkudileffxn.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:52:28.7547298Z [479s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:52:28.7548078Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 64, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:52:28.7548777Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:52:28.7548964Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:52:30.4939615Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:52:30.4942679Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:52:30.4943044Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}> 2026-02-21T09:52:30.4943354Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:52:30.4944031Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:52:30.4944306Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:52:30.4944545Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:52:30.4944722Z #smem = #ttg.shared_memory 2026-02-21T09:52:30.4945023Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:52:30.4945490Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:52:30.4946017Z %cst = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:52:30.4946189Z %cst_0 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:52:30.4946363Z %cst_1 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:52:30.4946530Z %cst_2 = arith.constant dense<8192> : tensor<1x128xi64, #mma> 2026-02-21T09:52:30.4946698Z %cst_3 = arith.constant dense<0> : tensor<1x128xi64, #mma> 2026-02-21T09:52:30.4946870Z %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked> 2026-02-21T09:52:30.4947037Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked> 2026-02-21T09:52:30.4947208Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked> 2026-02-21T09:52:30.4947379Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:30.4947549Z %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:30.4947828Z %cst_9 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:30.4948012Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:52:30.4948178Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:52:30.4948294Z %c510_i32 = arith.constant 510 : i32 2026-02-21T09:52:30.4948474Z %cst_11 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:30.4948663Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:52:30.4948789Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:52:30.4948930Z %cst_12 = arith.constant dense<0> : tensor<2x128xi8, #blocked> 2026-02-21T09:52:30.4949109Z %cst_13 = arith.constant dense<8192> : tensor<1x128xi64, #blocked> 2026-02-21T09:52:30.4949281Z %cst_14 = arith.constant dense<0> : tensor<1x128xi64, #blocked> 2026-02-21T09:52:30.4949452Z %cst_15 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked1> 2026-02-21T09:52:30.4949599Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:52:30.4949780Z %cst_16 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:30.4949974Z %0 = tt.get_program_id x : i32 2026-02-21T09:52:30.4950085Z %1 = arith.divsi %0, %c128_i32 : i32 2026-02-21T09:52:30.4950202Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:52:30.4950313Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T09:52:30.4950425Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:52:30.4950591Z %5 = arith.remsi %0, %c128_i32 : i32 2026-02-21T09:52:30.4950702Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:52:30.4950811Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:52:30.4950914Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:52:30.4951022Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:52:30.4951222Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:30.4951500Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:30.4951764Z %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:30.4952047Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:30.4952304Z %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:30.4952528Z %15 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:52:30.4952696Z %16 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:52:30.4984010Z %17 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:30.4984347Z %18 = tt.expand_dims %15 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:52:30.4984600Z %19 = arith.muli %18, %cst_9 : tensor<128x1xi32, #blocked2> 2026-02-21T09:52:30.4984793Z %20 = tt.broadcast %19 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:30.4985019Z %21 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:30.4985185Z %22 = arith.extsi %16 : i32 to i64 2026-02-21T09:52:30.4985335Z %23 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:52:30.4985567Z %24 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:30.4985879Z %25 = arith.extsi %24 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:30.4986166Z %26 = tt.splat %22 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:30.4986599Z %27 = arith.extsi %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:30.4986884Z %28 = arith.addi %26, %27 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:52:30.4987152Z %29 = tt.expand_dims %28 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi64, #blocked> 2026-02-21T09:52:30.4987423Z %30 = tt.broadcast %29 : tensor<1x128xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:52:30.4987622Z %31 = arith.cmpi sge, %29, %cst_14 : tensor<1x128xi64, #blocked> 2026-02-21T09:52:30.4987790Z %32 = arith.cmpi slt, %29, %cst_13 : tensor<1x128xi64, #blocked> 2026-02-21T09:52:30.4987952Z %33 = arith.andi %31, %32 : tensor<1x128xi1, #blocked> 2026-02-21T09:52:30.4988127Z %34 = tt.broadcast %33 : tensor<1x128xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:52:30.4988413Z %35 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:52:30.4988826Z %36 = tt.expand_dims %35 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:52:30.4989223Z %37 = tt.expand_dims %36 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:30.4989477Z %38 = arith.cmpi eq, %37, %cst_8 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:30.4989708Z %39 = tt.broadcast %38 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1> 2026-02-21T09:52:30.4989904Z %40 = arith.cmpi eq, %37, %cst_7 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:52:30.4990097Z %41 = tt.broadcast %40 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1> 2026-02-21T09:52:30.4990310Z %42 = ttg.local_alloc : () -> !ttg.memdesc<1x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:30.4990579Z %43 = tt.expand_dims %17 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:30.4990843Z %44 = tt.broadcast %43 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:30.4991057Z %45 = arith.addi %20, %44 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:30.4991253Z %46 = tt.addptr %21, %45 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:30.4991452Z %47 = tt.load %46 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:30.4991736Z %48 = ttg.memdesc_index %42[%c0_i32] : !ttg.memdesc<1x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4> 2026-02-21T09:52:30.4992110Z ttg.local_store %47, %48 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4> 2026-02-21T09:52:30.4992541Z %49:3 = scf.for %arg3 = %c0_i32 to %c510_i32 step %c2_i32 iter_args(%arg4 = %cst_10, %arg5 = %c0_i32, %arg6 = %48) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4>) : i32 { 2026-02-21T09:52:30.4992873Z %104 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:52:30.4992997Z %105 = arith.muli %104, %c2_i32 : i32 2026-02-21T09:52:30.4993168Z %106 = tt.splat %105 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:30.4993391Z %107 = arith.addi %106, %17 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:52:30.4993668Z %108 = tt.expand_dims %107 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:52:30.4993946Z %109 = tt.broadcast %108 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:52:30.4994140Z %110 = arith.addi %20, %109 : tensor<128x4xi32, #blocked2> 2026-02-21T09:52:30.4994343Z %111 = tt.addptr %21, %110 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:52:30.4994568Z %112 = tt.load %111 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:52:30.4994872Z %113 = ttg.local_load %arg6 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:30.4995318Z %114 = arith.extf %113 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:30.4995594Z %115 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:52:30.4995763Z %116 = tt.splat %115 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:30.4995982Z %117 = arith.addi %116, %25 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:30.4996253Z %118 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:52:30.4996497Z %119 = arith.muli %118, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:30.4996684Z %120 = tt.broadcast %119 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:52:30.4996875Z %121 = arith.addi %120, %30 : tensor<2x128xi64, #blocked> 2026-02-21T09:52:30.4997069Z %122 = tt.addptr %23, %121 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:52:30.4997277Z %123 = arith.cmpi sge, %118, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:30.4997465Z %124 = arith.cmpi slt, %118, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:30.4997626Z %125 = arith.andi %123, %124 : tensor<2x1xi1, #blocked> 2026-02-21T09:52:30.4997813Z %126 = tt.broadcast %125 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:52:30.4997999Z %127 = arith.andi %126, %34 : tensor<2x128xi1, #blocked> 2026-02-21T09:52:30.4998169Z %128 = tt.load %122, %127, %cst_12 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:52:30.4998431Z %129 = ttg.convert_layout %128 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:30.4998718Z %130 = arith.shli %129, %cst_16 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:30.4998979Z %131 = arith.shrsi %130, %cst_16 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:30.4999220Z %132 = arith.shrsi %129, %cst_16 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:30.4999518Z %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:52:30.4999877Z %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:52:30.5000163Z %135 = tt.broadcast %133 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:52:30.5000417Z %136 = arith.select %39, %135, %cst_15 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:52:30.5000660Z %137 = tt.broadcast %134 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:52:30.5000905Z %138 = arith.select %41, %137, %136 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:52:30.5001137Z %139 = tt.reshape %138 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked> 2026-02-21T09:52:30.5001356Z %140 = arith.sitofp %139 : tensor<4x128xi8, #blocked> to tensor<4x128xf32, #blocked> 2026-02-21T09:52:30.5001604Z %141 = ttg.local_alloc %140 : (tensor<4x128xf32, #blocked>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:52:30.5001928Z %142 = ttg.local_load %141 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:30.5002423Z %143 = tt.dot %114, %142, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:52:30.5002798Z %144 = arith.addi %arg5, %c1_i32 : i32 2026-02-21T09:52:30.5002923Z %145 = arith.cmpi slt, %144, %c1_i32 : i32 2026-02-21T09:52:30.5003053Z %146 = arith.select %145, %144, %c0_i32 : i32 2026-02-21T09:52:30.5003315Z %147 = ttg.memdesc_index %42[%146] : !ttg.memdesc<1x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4> 2026-02-21T09:52:30.5003678Z ttg.local_store %112, %147 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4> 2026-02-21T09:52:30.5003992Z scf.yield %143, %146, %147 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4> 2026-02-21T09:52:30.5004270Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:52:30.5004610Z %50 = ttg.local_load %49#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:30.5005034Z %51 = arith.extf %50 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:30.5005359Z %52 = arith.addi %25, %cst_11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:52:30.5005649Z %53 = tt.expand_dims %52 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:52:30.5005883Z %54 = arith.muli %53, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:30.5006066Z %55 = tt.broadcast %54 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:52:30.5006248Z %56 = arith.addi %55, %30 : tensor<2x128xi64, #blocked> 2026-02-21T09:52:30.5006436Z %57 = tt.addptr %23, %56 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:52:30.5006639Z %58 = arith.cmpi sge, %53, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:30.5006822Z %59 = arith.cmpi slt, %53, %cst_4 : tensor<2x1xi64, #blocked> 2026-02-21T09:52:30.5006978Z %60 = arith.andi %58, %59 : tensor<2x1xi1, #blocked> 2026-02-21T09:52:30.5007152Z %61 = tt.broadcast %60 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:52:30.5007332Z %62 = arith.andi %61, %34 : tensor<2x128xi1, #blocked> 2026-02-21T09:52:30.5007491Z %63 = tt.load %57, %62, %cst_12 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:52:30.5007756Z %64 = ttg.convert_layout %63 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:30.5008034Z %65 = arith.shli %64, %cst_16 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:30.5008262Z %66 = arith.shrsi %65, %cst_16 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:30.5008500Z %67 = arith.shrsi %64, %cst_16 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:52:30.5008785Z %68 = tt.expand_dims %66 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:52:30.5009120Z %69 = tt.expand_dims %67 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:52:30.5009402Z %70 = tt.broadcast %68 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:52:30.5009638Z %71 = arith.select %39, %70, %cst_15 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:52:30.5009875Z %72 = tt.broadcast %69 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:52:30.5010106Z %73 = arith.select %41, %72, %71 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:52:30.5010346Z %74 = tt.reshape %73 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked> 2026-02-21T09:52:30.5010562Z %75 = arith.sitofp %74 : tensor<4x128xi8, #blocked> to tensor<4x128xf32, #blocked> 2026-02-21T09:52:30.5010802Z %76 = ttg.local_alloc %75 : (tensor<4x128xf32, #blocked>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:52:30.5011117Z %77 = ttg.local_load %76 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:52:30.5011583Z %78 = tt.dot %51, %77, %49#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:52:30.5011959Z ttg.local_dealloc %42 : !ttg.memdesc<1x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:52:30.5012168Z %79 = arith.truncf %78 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:52:30.5012334Z %80 = arith.extsi %9 : i32 to i64 2026-02-21T09:52:30.5012492Z %81 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:52:30.5012693Z %82 = tt.splat %80 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:30.5012964Z %83 = arith.extsi %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:30.5013296Z %84 = arith.extsi %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:30.5013578Z %85 = arith.addi %82, %83 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:52:30.5013841Z %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:52:30.5014076Z %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:52:30.5014247Z %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:52:30.5014449Z %89 = tt.splat %22 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:30.5014666Z %90 = arith.addi %89, %84 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:52:30.5014922Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma> 2026-02-21T09:52:30.5015175Z %92 = tt.broadcast %91 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma> 2026-02-21T09:52:30.5015349Z %93 = arith.addi %88, %92 : tensor<128x128xi64, #mma> 2026-02-21T09:52:30.5015536Z %94 = tt.addptr %81, %93 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi64, #mma> 2026-02-21T09:52:30.5015750Z %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma> 2026-02-21T09:52:30.5015909Z %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma> 2026-02-21T09:52:30.5016055Z %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma> 2026-02-21T09:52:30.5016222Z %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:52:30.5016402Z %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x128xi64, #mma> 2026-02-21T09:52:30.5016564Z %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x128xi64, #mma> 2026-02-21T09:52:30.5016721Z %101 = arith.andi %99, %100 : tensor<1x128xi1, #mma> 2026-02-21T09:52:30.5016890Z %102 = tt.broadcast %101 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma> 2026-02-21T09:52:30.5017068Z %103 = arith.andi %98, %102 : tensor<128x128xi1, #mma> 2026-02-21T09:52:30.5017226Z tt.store %94, %79, %103 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:52:30.5017361Z tt.return 2026-02-21T09:52:30.5017442Z } 2026-02-21T09:52:30.5017517Z } 2026-02-21T09:52:30.5017559Z 2026-02-21T09:52:30.5017595Z {-# 2026-02-21T09:52:30.5017673Z external_resources: { 2026-02-21T09:52:30.5017773Z mlir_reproducer: { 2026-02-21T09:52:30.5018775Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:52:30.5019775Z disable_threading: false, 2026-02-21T09:52:30.5019880Z verify_each: true 2026-02-21T09:52:30.5019970Z } 2026-02-21T09:52:30.5020039Z } 2026-02-21T09:52:30.5020109Z #-} 2026-02-21T09:52:30.5020393Z /tmp/torchinductor_root/lo/cloq6ns7tap3eo3pmp6ejokhmdqe7g4cg5xwcrxxbveenhy2w4e2.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:52:30.5021090Z /tmp/torchinductor_root/lo/cloq6ns7tap3eo3pmp6ejokhmdqe7g4cg5xwcrxxbveenhy2w4e2.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:52:30.5021640Z [481s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:52:30.5022372Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:52:30.5023040Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:52:30.5023208Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:52:31.4821137Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 82/82 7.8 configs/s 2026-02-21T09:52:38.4043166Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 194/194 19.5 configs/s 2026-02-21T09:52:41.4864505Z [492s] Generation 6 complete: 2026-02-21T09:52:41.4864918Z error=8 2026-02-21T09:52:41.4865116Z ok=77 2026-02-21T09:52:41.4865332Z min=1.0169 2026-02-21T09:52:41.4865537Z mid=1.6862 2026-02-21T09:52:41.4865732Z max=62.3544 2026-02-21T09:52:41.4866014Z best={'block_sizes': [8, 128, 128], 2026-02-21T09:52:41.4866385Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T09:52:41.4866743Z 'l2_groupings': [2], 2026-02-21T09:52:41.4867009Z 'load_eviction_policies': ['', ''], 2026-02-21T09:52:41.4867777Z 'loop_orders': [[0, 1]], 2026-02-21T09:52:41.4868054Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:52:41.4868322Z 'num_stages': 1, 2026-02-21T09:52:41.4868547Z 'num_warps': 4, 2026-02-21T09:52:41.4868675Z 'pid_type': 'flat', 2026-02-21T09:52:41.4868774Z 'range_flattens': [None, None], 2026-02-21T09:52:41.4868896Z 'range_multi_buffers': [None, False], 2026-02-21T09:52:41.4869011Z 'range_num_stages': [0, 2], 2026-02-21T09:52:41.4869127Z 'range_unroll_factors': [0, 0], 2026-02-21T09:52:41.4869236Z 'range_warp_specializes': [], 2026-02-21T09:52:41.4869335Z 'waves_per_eu': 2} 2026-02-21T09:52:41.4979498Z [492s] Fitting surrogate: 675 points, 675 targets 2026-02-21T09:52:42.4037147Z [493s] Generation 7 starting: 81 neighbors, 4 active search path(s) 2026-02-21T09:53:05.9831108Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 0.6 configs/s 2026-02-21T09:53:09.3261353Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:53:09.3271463Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}> 2026-02-21T09:53:09.3274363Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:53:09.3274716Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:53:09.3275136Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}> 2026-02-21T09:53:09.3275607Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:53:09.3276009Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:53:09.3276354Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:53:09.3276627Z #smem = #ttg.shared_memory 2026-02-21T09:53:09.3276977Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:53:09.3277699Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:53:09.3278199Z %cst = arith.constant dense<8192> : tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3278376Z %cst_0 = arith.constant dense<0> : tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3278546Z %cst_1 = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3278796Z %cst_2 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3278961Z %cst_3 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3279136Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:53:09.3279305Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:53:09.3279490Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3279663Z %c13823_i32 = arith.constant 13823 : i32 2026-02-21T09:53:09.3279787Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:53:09.3279909Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:53:09.3280172Z %cst_7 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3280428Z %cst_8 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3280681Z %cst_9 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3280900Z %cst_10 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3281143Z %cst_11 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:09.3281294Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:53:09.3281418Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:53:09.3281536Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:53:09.3281656Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:53:09.3281774Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:53:09.3281897Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:53:09.3282022Z %c38912_i32 = arith.constant 38912 : i32 2026-02-21T09:53:09.3282143Z %c19456_i32 = arith.constant 19456 : i32 2026-02-21T09:53:09.3282268Z %c29184_i32 = arith.constant 29184 : i32 2026-02-21T09:53:09.3282415Z %cst_12 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3282659Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:53:09.3282777Z %c9728_i32 = arith.constant 9728 : i32 2026-02-21T09:53:09.3282966Z %cst_13 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3283162Z %0 = tt.get_program_id x : i32 2026-02-21T09:53:09.3283360Z %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:09.3283677Z %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:09.3283954Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:09.3284256Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:09.3284526Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3284792Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3285037Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3285240Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3285510Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:53:09.3285931Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:53:09.3286337Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:53:09.3286589Z %12 = arith.cmpi eq, %11, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:53:09.3286809Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T09:53:09.3287005Z %14 = arith.cmpi eq, %11, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:53:09.3287200Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T09:53:09.3287408Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:53:09.3287683Z %17 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:09.3288022Z %18 = arith.extsi %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:09.3288275Z %19 = arith.subi %c13823_i32, %0 : i32 2026-02-21T09:53:09.3288400Z %20 = arith.divui %19, %c9728_i32 : i32 2026-02-21T09:53:09.3288519Z %21 = arith.remsi %20, %c4_i32 : i32 2026-02-21T09:53:09.3288641Z %22 = arith.subi %20, %21 : i32 2026-02-21T09:53:09.3288761Z %23 = arith.muli %22, %c9728_i32 : i32 2026-02-21T09:53:09.3288876Z %24 = arith.addi %0, %23 : i32 2026-02-21T09:53:09.3289029Z scf.for %arg3 = %0 to %24 step %c38912_i32 : i32 { 2026-02-21T09:53:09.3289170Z %25 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:53:09.3289297Z %26 = arith.muli %25, %c8_i32 : i32 2026-02-21T09:53:09.3289416Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:53:09.3289538Z %28 = arith.minsi %27, %c8_i32 : i32 2026-02-21T09:53:09.3289659Z %29 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:53:09.3289784Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:53:09.3289904Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:53:09.3290017Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:53:09.3290134Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:53:09.3290304Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:09.3290535Z %35 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:09.3290804Z %36 = arith.muli %32, %c256_i32 : i32 2026-02-21T09:53:09.3291068Z %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:09.3291422Z %38 = arith.addi %37, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:09.3291787Z %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:53:09.3292045Z %40 = arith.muli %39, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:09.3292242Z %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3292524Z %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:53:09.3292807Z %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3293028Z %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:09.3293301Z %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:09.3293569Z %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3293768Z %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3293970Z %48 = tt.addptr %7, %47 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3294174Z %49 = tt.load %48 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3294462Z %50 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3294847Z ttg.local_store %49, %50 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3295126Z %51 = arith.addi %6, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3295404Z %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:09.3295676Z %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3295880Z %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3296093Z %55 = tt.addptr %7, %54 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3296332Z %56 = tt.load %55 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3296618Z %57 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3296976Z ttg.local_store %56, %57 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3297525Z %58:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %50, %arg8 = %57) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:53:09.3298006Z %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3298239Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3298426Z %411 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:53:09.3298551Z %412 = arith.muli %411, %c2_i32 : i32 2026-02-21T09:53:09.3298727Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3298955Z %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3299232Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:09.3299515Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3299710Z %417 = arith.addi %41, %416 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3300104Z %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3300314Z %419 = tt.load %418 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3300626Z %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3301075Z %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3301464Z %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3301719Z %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3301922Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3302118Z %425 = arith.addi %424, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3302340Z %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3302571Z %427 = tt.load %426 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3302821Z %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3303128Z %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3303374Z %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3303616Z %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3303914Z %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3304262Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3304569Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3304820Z %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3305067Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3305310Z %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3305568Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3305796Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3306075Z %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3306408Z %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3306898Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3307279Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:53:09.3307428Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:53:09.3307598Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:53:09.3307904Z %446 = ttg.memdesc_index %44[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3308316Z ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3308776Z scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3309116Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:53:09.3309318Z %59 = arith.addi %5, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3309688Z %60 = ttg.local_load %58#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3310143Z %61 = arith.extf %60 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3310579Z %62 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3310900Z %63 = arith.muli %62, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3311189Z %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3311433Z %65 = arith.addi %64, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3311722Z %66 = tt.addptr %8, %65 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3312020Z %67 = tt.load %66 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3312352Z %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3312761Z %69 = arith.shli %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3313116Z %70 = arith.shrsi %69, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3313473Z %71 = arith.shrsi %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3313913Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3314445Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3314872Z %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3315239Z %75 = arith.select %13, %74, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3315616Z %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3315966Z %77 = arith.select %15, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3316312Z %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3316645Z %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3317030Z %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3317523Z %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3318249Z %82 = tt.dot %61, %81, %58#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3318860Z %83 = arith.addi %5, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3319361Z %84 = ttg.local_load %58#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3320050Z %85 = arith.extf %84 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3320638Z %86 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3321008Z %87 = arith.muli %86, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3321297Z %88 = tt.broadcast %87 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3321571Z %89 = arith.addi %88, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3321864Z %90 = tt.addptr %8, %89 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3322159Z %91 = tt.load %90 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3322517Z %92 = ttg.convert_layout %91 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3323003Z %93 = arith.shli %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3323356Z %94 = arith.shrsi %93, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3323711Z %95 = arith.shrsi %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3324147Z %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3324685Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3325113Z %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3325471Z %99 = arith.select %13, %98, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3325835Z %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3326193Z %101 = arith.select %15, %100, %99 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3326571Z %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3326915Z %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3327302Z %104 = ttg.local_alloc %103 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3327832Z %105 = ttg.local_load %104 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3328562Z %106 = tt.dot %85, %105, %82, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3329151Z ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:09.3329481Z %107 = arith.truncf %106 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:53:09.3329730Z %108 = arith.extsi %33 : i32 to i64 2026-02-21T09:53:09.3329907Z %109 = arith.extsi %36 : i32 to i64 2026-02-21T09:53:09.3330153Z %110 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:09.3330474Z %111 = arith.addi %110, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:09.3330888Z %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3331251Z %113 = arith.muli %112, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3331522Z %114 = tt.broadcast %113 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3331867Z %115 = tt.splat %109 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:09.3332187Z %116 = arith.addi %115, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:09.3332603Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3333004Z %118 = tt.broadcast %117 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3333284Z %119 = arith.addi %114, %118 : tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3333576Z %120 = tt.addptr %16, %119 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3333882Z %121 = arith.cmpi sge, %112, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3334115Z %122 = arith.cmpi slt, %112, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3334351Z %123 = arith.andi %121, %122 : tensor<128x1xi1, #mma> 2026-02-21T09:53:09.3334625Z %124 = tt.broadcast %123 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3334907Z %125 = arith.cmpi sge, %117, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3335157Z %126 = arith.cmpi slt, %117, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3335392Z %127 = arith.andi %125, %126 : tensor<1x256xi1, #mma> 2026-02-21T09:53:09.3335655Z %128 = tt.broadcast %127 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3335951Z %129 = arith.andi %124, %128 : tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3336191Z tt.store %120, %107, %129 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:53:09.3336424Z %130 = arith.addi %arg3, %c9728_i32 : i32 2026-02-21T09:53:09.3336611Z %131 = arith.divsi %130, %c256_i32 : i32 2026-02-21T09:53:09.3336791Z %132 = arith.muli %131, %c8_i32 : i32 2026-02-21T09:53:09.3336972Z %133 = arith.subi %c128_i32, %132 : i32 2026-02-21T09:53:09.3337148Z %134 = arith.minsi %133, %c8_i32 : i32 2026-02-21T09:53:09.3337332Z %135 = arith.remsi %130, %c256_i32 : i32 2026-02-21T09:53:09.3337507Z %136 = arith.remsi %135, %134 : i32 2026-02-21T09:53:09.3337706Z %137 = arith.addi %132, %136 : i32 2026-02-21T09:53:09.3337875Z %138 = arith.divsi %135, %134 : i32 2026-02-21T09:53:09.3338051Z %139 = arith.muli %137, %c128_i32 : i32 2026-02-21T09:53:09.3338312Z %140 = tt.splat %139 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:09.3338656Z %141 = arith.addi %140, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:09.3338886Z %142 = arith.muli %138, %c256_i32 : i32 2026-02-21T09:53:09.3339177Z %143 = tt.splat %142 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:09.3339519Z %144 = arith.addi %143, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:09.3339928Z %145 = tt.expand_dims %141 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:53:09.3340321Z %146 = arith.muli %145, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:09.3340626Z %147 = tt.broadcast %146 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3341062Z %148 = tt.expand_dims %144 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:53:09.3341503Z %149 = tt.broadcast %148 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3341841Z %150 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:09.3342127Z %151 = arith.addi %147, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3342432Z %152 = tt.addptr %7, %151 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3342763Z %153 = tt.load %152 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3343205Z %154 = ttg.memdesc_index %150[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3343770Z ttg.local_store %153, %154 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3344139Z %155 = arith.addi %147, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3344445Z %156 = tt.addptr %7, %155 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3344755Z %157 = tt.load %156 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3345192Z %158 = ttg.memdesc_index %150[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3345754Z ttg.local_store %157, %158 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3346570Z %159:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %154, %arg8 = %158) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:53:09.3347313Z %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3347679Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3347913Z %411 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:53:09.3348097Z %412 = arith.muli %411, %c2_i32 : i32 2026-02-21T09:53:09.3348349Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3348688Z %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3349107Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:09.3349554Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3349855Z %417 = arith.addi %147, %416 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3350165Z %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3350483Z %419 = tt.load %418 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3350969Z %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3351649Z %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3352249Z %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3352628Z %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3352923Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3353218Z %425 = arith.addi %424, %149 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3353528Z %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3353841Z %427 = tt.load %426 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3354219Z %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3354656Z %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3355043Z %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3355413Z %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3355866Z %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3356389Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3356833Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3357207Z %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3357579Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3357948Z %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3358346Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3358696Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3359089Z %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3359473Z %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3359951Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3360300Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:53:09.3360432Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:53:09.3360567Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:53:09.3360856Z %446 = ttg.memdesc_index %150[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3361219Z ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3361751Z scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3362072Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:53:09.3362352Z %160 = ttg.local_load %159#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3362958Z %161 = arith.extf %160 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3363275Z %162 = arith.addi %64, %149 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3363478Z %163 = tt.addptr %8, %162 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3363678Z %164 = tt.load %163 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3363925Z %165 = ttg.convert_layout %164 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3364208Z %166 = arith.shli %165, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3364446Z %167 = arith.shrsi %166, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3364685Z %168 = arith.shrsi %165, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3364999Z %169 = tt.expand_dims %167 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3365339Z %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3365705Z %171 = tt.broadcast %169 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3365977Z %172 = arith.select %13, %171, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3366218Z %173 = tt.broadcast %170 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3366449Z %174 = arith.select %15, %173, %172 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3366682Z %175 = tt.reshape %174 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3366906Z %176 = arith.sitofp %175 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3367160Z %177 = ttg.local_alloc %176 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3367485Z %178 = ttg.local_load %177 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3367953Z %179 = tt.dot %161, %178, %159#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3368469Z %180 = ttg.local_load %159#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3368901Z %181 = arith.extf %180 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3369196Z %182 = arith.addi %88, %149 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3369416Z %183 = tt.addptr %8, %182 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3369614Z %184 = tt.load %183 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3369859Z %185 = ttg.convert_layout %184 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3370144Z %186 = arith.shli %185, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3370395Z %187 = arith.shrsi %186, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3370632Z %188 = arith.shrsi %185, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3370922Z %189 = tt.expand_dims %187 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3371258Z %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3371554Z %191 = tt.broadcast %189 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3371795Z %192 = arith.select %13, %191, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3372037Z %193 = tt.broadcast %190 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3372268Z %194 = arith.select %15, %193, %192 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3372501Z %195 = tt.reshape %194 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3372729Z %196 = arith.sitofp %195 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3372999Z %197 = ttg.local_alloc %196 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3373322Z %198 = ttg.local_load %197 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3373794Z %199 = tt.dot %181, %198, %179, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3374177Z ttg.local_dealloc %150 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:09.3374396Z %200 = arith.truncf %199 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:53:09.3374568Z %201 = arith.extsi %139 : i32 to i64 2026-02-21T09:53:09.3374687Z %202 = arith.extsi %142 : i32 to i64 2026-02-21T09:53:09.3374849Z %203 = tt.splat %201 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:09.3375066Z %204 = arith.addi %203, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:09.3375333Z %205 = tt.expand_dims %204 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3375574Z %206 = arith.muli %205, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3375757Z %207 = tt.broadcast %206 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3375985Z %208 = tt.splat %202 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:09.3376196Z %209 = arith.addi %208, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:09.3376461Z %210 = tt.expand_dims %209 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3376720Z %211 = tt.broadcast %210 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3384843Z %212 = arith.addi %207, %211 : tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3385065Z %213 = tt.addptr %16, %212 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3385317Z %214 = arith.cmpi sge, %205, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3385480Z %215 = arith.cmpi slt, %205, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3385638Z %216 = arith.andi %214, %215 : tensor<128x1xi1, #mma> 2026-02-21T09:53:09.3385815Z %217 = tt.broadcast %216 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3386001Z %218 = arith.cmpi sge, %210, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3386183Z %219 = arith.cmpi slt, %210, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3386337Z %220 = arith.andi %218, %219 : tensor<1x256xi1, #mma> 2026-02-21T09:53:09.3386510Z %221 = tt.broadcast %220 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3386691Z %222 = arith.andi %217, %221 : tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3386854Z tt.store %213, %200, %222 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:53:09.3387008Z %223 = arith.addi %arg3, %c19456_i32 : i32 2026-02-21T09:53:09.3387138Z %224 = arith.divsi %223, %c256_i32 : i32 2026-02-21T09:53:09.3387259Z %225 = arith.muli %224, %c8_i32 : i32 2026-02-21T09:53:09.3387378Z %226 = arith.subi %c128_i32, %225 : i32 2026-02-21T09:53:09.3387499Z %227 = arith.minsi %226, %c8_i32 : i32 2026-02-21T09:53:09.3387615Z %228 = arith.remsi %223, %c256_i32 : i32 2026-02-21T09:53:09.3387733Z %229 = arith.remsi %228, %227 : i32 2026-02-21T09:53:09.3387849Z %230 = arith.addi %225, %229 : i32 2026-02-21T09:53:09.3387964Z %231 = arith.divsi %228, %227 : i32 2026-02-21T09:53:09.3388079Z %232 = arith.muli %230, %c128_i32 : i32 2026-02-21T09:53:09.3388253Z %233 = tt.splat %232 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:09.3388499Z %234 = arith.addi %233, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:09.3388671Z %235 = arith.muli %231, %c256_i32 : i32 2026-02-21T09:53:09.3388842Z %236 = tt.splat %235 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:09.3389062Z %237 = arith.addi %236, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:09.3389346Z %238 = tt.expand_dims %234 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:53:09.3389602Z %239 = arith.muli %238, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:09.3389801Z %240 = tt.broadcast %239 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3390084Z %241 = tt.expand_dims %237 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:53:09.3390369Z %242 = tt.broadcast %241 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3390591Z %243 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:09.3390781Z %244 = arith.addi %240, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3390979Z %245 = tt.addptr %7, %244 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3391204Z %246 = tt.load %245 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3391488Z %247 = ttg.memdesc_index %243[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3391853Z ttg.local_store %246, %247 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3392096Z %248 = arith.addi %240, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3392294Z %249 = tt.addptr %7, %248 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3392498Z %250 = tt.load %249 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3392795Z %251 = ttg.memdesc_index %243[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3393154Z ttg.local_store %250, %251 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3393697Z %252:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %247, %arg8 = %251) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:53:09.3394174Z %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3394403Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3394579Z %411 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:53:09.3394705Z %412 = arith.muli %411, %c2_i32 : i32 2026-02-21T09:53:09.3394874Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3395090Z %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3395366Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:09.3395642Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3395840Z %417 = arith.addi %240, %416 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3396042Z %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3396260Z %419 = tt.load %418 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3396563Z %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3397002Z %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3397383Z %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3397633Z %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3397827Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3398021Z %425 = arith.addi %424, %242 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3398220Z %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3398421Z %427 = tt.load %426 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3398668Z %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3398949Z %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3399203Z %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3399441Z %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3399733Z %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3400078Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3400363Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3400625Z %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3400869Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3401103Z %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3401334Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3401572Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3401828Z %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3402156Z %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3402705Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3403065Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:53:09.3403194Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:53:09.3403329Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:53:09.3403605Z %446 = ttg.memdesc_index %243[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3403969Z ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3404403Z scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3404711Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:53:09.3404989Z %253 = ttg.local_load %252#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3405423Z %254 = arith.extf %253 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3405722Z %255 = arith.addi %64, %242 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3405922Z %256 = tt.addptr %8, %255 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3406125Z %257 = tt.load %256 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3406371Z %258 = ttg.convert_layout %257 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3406653Z %259 = arith.shli %258, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3406889Z %260 = arith.shrsi %259, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3407131Z %261 = arith.shrsi %258, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3407449Z %262 = tt.expand_dims %260 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3407784Z %263 = tt.expand_dims %261 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3408070Z %264 = tt.broadcast %262 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3408312Z %265 = arith.select %13, %264, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3408557Z %266 = tt.broadcast %263 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3408813Z %267 = arith.select %15, %266, %265 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3409044Z %268 = tt.reshape %267 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3409269Z %269 = arith.sitofp %268 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3409520Z %270 = ttg.local_alloc %269 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3409862Z %271 = ttg.local_load %270 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3410339Z %272 = tt.dot %254, %271, %252#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3410837Z %273 = ttg.local_load %252#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3411270Z %274 = arith.extf %273 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3411573Z %275 = arith.addi %88, %242 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3411771Z %276 = tt.addptr %8, %275 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3411971Z %277 = tt.load %276 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3412211Z %278 = ttg.convert_layout %277 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3412516Z %279 = arith.shli %278, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3412753Z %280 = arith.shrsi %279, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3412989Z %281 = arith.shrsi %278, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3413278Z %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3413613Z %283 = tt.expand_dims %281 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3413900Z %284 = tt.broadcast %282 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3414140Z %285 = arith.select %13, %284, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3414378Z %286 = tt.broadcast %283 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3414611Z %287 = arith.select %15, %286, %285 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3414844Z %288 = tt.reshape %287 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3415070Z %289 = arith.sitofp %288 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3415340Z %290 = ttg.local_alloc %289 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3415663Z %291 = ttg.local_load %290 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3416137Z %292 = tt.dot %274, %291, %272, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3416528Z ttg.local_dealloc %243 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:09.3416761Z %293 = arith.truncf %292 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:53:09.3416936Z %294 = arith.extsi %232 : i32 to i64 2026-02-21T09:53:09.3417054Z %295 = arith.extsi %235 : i32 to i64 2026-02-21T09:53:09.3417222Z %296 = tt.splat %294 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:09.3417435Z %297 = arith.addi %296, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:09.3417717Z %298 = tt.expand_dims %297 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3417958Z %299 = arith.muli %298, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3418137Z %300 = tt.broadcast %299 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3418350Z %301 = tt.splat %295 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:09.3418558Z %302 = arith.addi %301, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:09.3418829Z %303 = tt.expand_dims %302 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3419093Z %304 = tt.broadcast %303 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3419275Z %305 = arith.addi %300, %304 : tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3419468Z %306 = tt.addptr %16, %305 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3419671Z %307 = arith.cmpi sge, %298, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3419839Z %308 = arith.cmpi slt, %298, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3419999Z %309 = arith.andi %307, %308 : tensor<128x1xi1, #mma> 2026-02-21T09:53:09.3420188Z %310 = tt.broadcast %309 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3420374Z %311 = arith.cmpi sge, %303, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3420535Z %312 = arith.cmpi slt, %303, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3420691Z %313 = arith.andi %311, %312 : tensor<1x256xi1, #mma> 2026-02-21T09:53:09.3420862Z %314 = tt.broadcast %313 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3421042Z %315 = arith.andi %310, %314 : tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3421204Z tt.store %306, %293, %315 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:53:09.3421355Z %316 = arith.addi %arg3, %c29184_i32 : i32 2026-02-21T09:53:09.3421485Z %317 = arith.divsi %316, %c256_i32 : i32 2026-02-21T09:53:09.3421605Z %318 = arith.muli %317, %c8_i32 : i32 2026-02-21T09:53:09.3421724Z %319 = arith.subi %c128_i32, %318 : i32 2026-02-21T09:53:09.3421841Z %320 = arith.minsi %319, %c8_i32 : i32 2026-02-21T09:53:09.3421964Z %321 = arith.remsi %316, %c256_i32 : i32 2026-02-21T09:53:09.3422081Z %322 = arith.remsi %321, %320 : i32 2026-02-21T09:53:09.3422195Z %323 = arith.addi %318, %322 : i32 2026-02-21T09:53:09.3422311Z %324 = arith.divsi %321, %320 : i32 2026-02-21T09:53:09.3422425Z %325 = arith.muli %323, %c128_i32 : i32 2026-02-21T09:53:09.3422598Z %326 = tt.splat %325 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:09.3422838Z %327 = arith.addi %326, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:09.3423012Z %328 = arith.muli %324, %c256_i32 : i32 2026-02-21T09:53:09.3423180Z %329 = tt.splat %328 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:09.3423400Z %330 = arith.addi %329, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:09.3423681Z %331 = tt.expand_dims %327 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:53:09.3423933Z %332 = arith.muli %331, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:09.3424146Z %333 = tt.broadcast %332 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3424428Z %334 = tt.expand_dims %330 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:53:09.3424708Z %335 = tt.broadcast %334 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3424928Z %336 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:09.3425127Z %337 = arith.addi %333, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3425327Z %338 = tt.addptr %7, %337 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3425536Z %339 = tt.load %338 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3425822Z %340 = ttg.memdesc_index %336[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3426187Z ttg.local_store %339, %340 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3426427Z %341 = arith.addi %333, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3426624Z %342 = tt.addptr %7, %341 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3426824Z %343 = tt.load %342 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3427103Z %344 = ttg.memdesc_index %336[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3427460Z ttg.local_store %343, %344 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3428010Z %345:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %340, %arg8 = %344) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:53:09.3428487Z %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3428711Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3428884Z %411 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:53:09.3429006Z %412 = arith.muli %411, %c2_i32 : i32 2026-02-21T09:53:09.3429173Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3429388Z %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3429661Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:09.3429937Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3430127Z %417 = arith.addi %333, %416 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3430326Z %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3430542Z %419 = tt.load %418 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3430841Z %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3431277Z %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3431658Z %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3431917Z %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3432108Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3432301Z %425 = arith.addi %424, %335 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3432499Z %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3432697Z %427 = tt.load %426 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3432956Z %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3433237Z %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3433471Z %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3433705Z %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3433995Z %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3434334Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3434618Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3434860Z %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3435098Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3435347Z %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3435576Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3435802Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3436055Z %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3436385Z %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3436859Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3437205Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:53:09.3437331Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:53:09.3437462Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:53:09.3437727Z %446 = ttg.memdesc_index %336[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3438090Z ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3438501Z scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3438804Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:53:09.3439078Z %346 = ttg.local_load %345#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3439514Z %347 = arith.extf %346 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3439968Z %348 = arith.addi %64, %335 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3440162Z %349 = tt.addptr %8, %348 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3440359Z %350 = tt.load %349 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3440600Z %351 = ttg.convert_layout %350 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3440897Z %352 = arith.shli %351, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3441131Z %353 = arith.shrsi %352, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3441365Z %354 = arith.shrsi %351, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3441656Z %355 = tt.expand_dims %353 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3441991Z %356 = tt.expand_dims %354 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3442273Z %357 = tt.broadcast %355 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3442512Z %358 = arith.select %13, %357, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3442810Z %359 = tt.broadcast %356 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3443045Z %360 = arith.select %15, %359, %358 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3443272Z %361 = tt.reshape %360 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3443517Z %362 = arith.sitofp %361 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3443770Z %363 = ttg.local_alloc %362 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3444095Z %364 = ttg.local_load %363 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3444564Z %365 = tt.dot %347, %364, %345#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3445063Z %366 = ttg.local_load %345#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3445491Z %367 = arith.extf %366 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3445787Z %368 = arith.addi %88, %335 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3445982Z %369 = tt.addptr %8, %368 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3446179Z %370 = tt.load %369 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3446420Z %371 = ttg.convert_layout %370 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3446714Z %372 = arith.shli %371, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3446947Z %373 = arith.shrsi %372, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3447181Z %374 = arith.shrsi %371, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3447468Z %375 = tt.expand_dims %373 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3447804Z %376 = tt.expand_dims %374 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3448104Z %377 = tt.broadcast %375 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3448344Z %378 = arith.select %13, %377, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3448580Z %379 = tt.broadcast %376 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3448808Z %380 = arith.select %15, %379, %378 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3449055Z %381 = tt.reshape %380 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3449276Z %382 = arith.sitofp %381 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3449527Z %383 = ttg.local_alloc %382 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3449851Z %384 = ttg.local_load %383 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3450319Z %385 = tt.dot %367, %384, %365, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3450701Z ttg.local_dealloc %336 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:09.3450916Z %386 = arith.truncf %385 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:53:09.3451085Z %387 = arith.extsi %325 : i32 to i64 2026-02-21T09:53:09.3451201Z %388 = arith.extsi %328 : i32 to i64 2026-02-21T09:53:09.3451360Z %389 = tt.splat %387 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:09.3451587Z %390 = arith.addi %389, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:09.3451849Z %391 = tt.expand_dims %390 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3452093Z %392 = arith.muli %391, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3452272Z %393 = tt.broadcast %392 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3452477Z %394 = tt.splat %388 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:09.3452682Z %395 = arith.addi %394, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:09.3452944Z %396 = tt.expand_dims %395 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3453204Z %397 = tt.broadcast %396 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3453386Z %398 = arith.addi %393, %397 : tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3453576Z %399 = tt.addptr %16, %398 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3453778Z %400 = arith.cmpi sge, %391, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3453938Z %401 = arith.cmpi slt, %391, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3454096Z %402 = arith.andi %400, %401 : tensor<128x1xi1, #mma> 2026-02-21T09:53:09.3454284Z %403 = tt.broadcast %402 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3454465Z %404 = arith.cmpi sge, %396, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3454624Z %405 = arith.cmpi slt, %396, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3454774Z %406 = arith.andi %404, %405 : tensor<1x256xi1, #mma> 2026-02-21T09:53:09.3454944Z %407 = tt.broadcast %406 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3455120Z %408 = arith.andi %403, %407 : tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3455280Z tt.store %399, %386, %408 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:53:09.3455443Z } {tt.num_stages = 1 : i32} 2026-02-21T09:53:09.3455571Z scf.for %arg3 = %24 to %c4096_i32 step %c9728_i32 : i32 { 2026-02-21T09:53:09.3455713Z %25 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:53:09.3455832Z %26 = arith.muli %25, %c8_i32 : i32 2026-02-21T09:53:09.3455950Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:53:09.3456063Z %28 = arith.minsi %27, %c8_i32 : i32 2026-02-21T09:53:09.3456180Z %29 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:53:09.3456294Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:53:09.3456420Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:53:09.3456528Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:53:09.3456637Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:53:09.3456801Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:09.3457021Z %35 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:09.3457190Z %36 = arith.muli %32, %c256_i32 : i32 2026-02-21T09:53:09.3457350Z %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:09.3457565Z %38 = arith.addi %37, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:09.3457840Z %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:53:09.3458087Z %40 = arith.muli %39, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:09.3458278Z %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3458549Z %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:53:09.3458838Z %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3459055Z %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:09.3459317Z %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:09.3459579Z %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3459765Z %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3459959Z %48 = tt.addptr %7, %47 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3460158Z %49 = tt.load %48 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3460435Z %50 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3460793Z ttg.local_store %49, %50 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3461063Z %51 = arith.addi %6, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3461333Z %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:09.3461613Z %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3461795Z %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3461986Z %55 = tt.addptr %7, %54 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3462181Z %56 = tt.load %55 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3462458Z %57 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3462812Z ttg.local_store %56, %57 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3463346Z %58:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %50, %arg8 = %57) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:53:09.3463818Z %130 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3464061Z %131 = arith.addi %130, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3464233Z %132 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:53:09.3464354Z %133 = arith.muli %132, %c2_i32 : i32 2026-02-21T09:53:09.3464518Z %134 = tt.splat %133 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3464736Z %135 = arith.addi %134, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:09.3465009Z %136 = tt.expand_dims %135 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:09.3465283Z %137 = tt.broadcast %136 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3465476Z %138 = arith.addi %41, %137 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3465672Z %139 = tt.addptr %7, %138 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:09.3465877Z %140 = tt.load %139 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:09.3466176Z %141 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3466633Z %142 = arith.extf %141 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3467013Z %143 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3467257Z %144 = arith.muli %143, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3467449Z %145 = tt.broadcast %144 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3467639Z %146 = arith.addi %145, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3467834Z %147 = tt.addptr %8, %146 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3468034Z %148 = tt.load %147 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3468275Z %149 = ttg.convert_layout %148 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3468559Z %150 = arith.shli %149, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3468793Z %151 = arith.shrsi %150, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3469026Z %152 = arith.shrsi %149, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3469319Z %153 = tt.expand_dims %151 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3469674Z %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3469957Z %155 = tt.broadcast %153 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3470200Z %156 = arith.select %13, %155, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3470439Z %157 = tt.broadcast %154 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3470675Z %158 = arith.select %15, %157, %156 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3470930Z %159 = tt.reshape %158 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3471153Z %160 = arith.sitofp %159 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3471405Z %161 = ttg.local_alloc %160 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3471744Z %162 = ttg.local_load %161 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3472220Z %163 = tt.dot %142, %162, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3472565Z %164 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:53:09.3472691Z %165 = arith.cmpi slt, %164, %c2_i32 : i32 2026-02-21T09:53:09.3472822Z %166 = arith.select %165, %164, %c0_i32 : i32 2026-02-21T09:53:09.3473083Z %167 = ttg.memdesc_index %44[%166] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3473442Z ttg.local_store %140, %167 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3473839Z scf.yield %163, %166, %arg8, %167 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:09.3474143Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:53:09.3474334Z %59 = arith.addi %5, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3474655Z %60 = ttg.local_load %58#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3475082Z %61 = arith.extf %60 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3475460Z %62 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3475700Z %63 = arith.muli %62, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3475891Z %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3476078Z %65 = arith.addi %64, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3476268Z %66 = tt.addptr %8, %65 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3476461Z %67 = tt.load %66 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3476693Z %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3476968Z %69 = arith.shli %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3477194Z %70 = arith.shrsi %69, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3477441Z %71 = arith.shrsi %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3477723Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3478054Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3478335Z %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3478569Z %75 = arith.select %13, %74, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3478821Z %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3479046Z %77 = arith.select %15, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3479269Z %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3479488Z %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3479754Z %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3480075Z %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3480539Z %82 = tt.dot %61, %81, %58#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3480927Z %83 = arith.addi %5, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:09.3481250Z %84 = ttg.local_load %58#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3481676Z %85 = arith.extf %84 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3482058Z %86 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3482302Z %87 = arith.muli %86, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:53:09.3482504Z %88 = tt.broadcast %87 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3482754Z %89 = arith.addi %88, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3482945Z %90 = tt.addptr %8, %89 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:53:09.3483135Z %91 = tt.load %90 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:53:09.3483372Z %92 = ttg.convert_layout %91 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3483644Z %93 = arith.shli %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3483878Z %94 = arith.shrsi %93, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3484110Z %95 = arith.shrsi %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:09.3484395Z %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3484732Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:53:09.3485012Z %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3485253Z %99 = arith.select %13, %98, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3485511Z %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3485739Z %101 = arith.select %15, %100, %99 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:53:09.3485971Z %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:09.3486195Z %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:09.3486455Z %104 = ttg.local_alloc %103 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:53:09.3486806Z %105 = ttg.local_load %104 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:09.3487272Z %106 = tt.dot %85, %105, %82, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:09.3487659Z ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:09.3487906Z %107 = arith.truncf %106 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:53:09.3488080Z %108 = arith.extsi %33 : i32 to i64 2026-02-21T09:53:09.3488204Z %109 = arith.extsi %36 : i32 to i64 2026-02-21T09:53:09.3488370Z %110 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:09.3488587Z %111 = arith.addi %110, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:09.3488857Z %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3489103Z %113 = arith.muli %112, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3489290Z %114 = tt.broadcast %113 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3489499Z %115 = tt.splat %109 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:09.3489713Z %116 = arith.addi %115, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:09.3489979Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3490262Z %118 = tt.broadcast %117 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3490446Z %119 = arith.addi %114, %118 : tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3490639Z %120 = tt.addptr %16, %119 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:53:09.3490844Z %121 = arith.cmpi sge, %112, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3491009Z %122 = arith.cmpi slt, %112, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:53:09.3491171Z %123 = arith.andi %121, %122 : tensor<128x1xi1, #mma> 2026-02-21T09:53:09.3491349Z %124 = tt.broadcast %123 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3491540Z %125 = arith.cmpi sge, %117, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3491705Z %126 = arith.cmpi slt, %117, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:53:09.3491858Z %127 = arith.andi %125, %126 : tensor<1x256xi1, #mma> 2026-02-21T09:53:09.3492033Z %128 = tt.broadcast %127 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3492214Z %129 = arith.andi %124, %128 : tensor<128x256xi1, #mma> 2026-02-21T09:53:09.3492376Z tt.store %120, %107, %129 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:53:09.3492526Z } {tt.num_stages = 1 : i32} 2026-02-21T09:53:09.3492629Z tt.return 2026-02-21T09:53:09.3492712Z } 2026-02-21T09:53:09.3492788Z } 2026-02-21T09:53:09.3492836Z 2026-02-21T09:53:09.3492867Z {-# 2026-02-21T09:53:09.3492967Z external_resources: { 2026-02-21T09:53:09.3493073Z mlir_reproducer: { 2026-02-21T09:53:09.3494067Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:53:09.3495071Z disable_threading: false, 2026-02-21T09:53:09.3495179Z verify_each: true 2026-02-21T09:53:09.3495274Z } 2026-02-21T09:53:09.3495345Z } 2026-02-21T09:53:09.3495418Z #-} 2026-02-21T09:53:09.3495700Z /tmp/torchinductor_root/xb/cxbuks2owjyp7fxrahwxzjlcklwajtufzfy3vea6edykq37ttp3p.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:53:09.3496415Z /tmp/torchinductor_root/xb/cxbuks2owjyp7fxrahwxzjlcklwajtufzfy3vea6edykq37ttp3p.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:53:09.3496962Z [520s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:53:09.3497742Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:53:09.3498459Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:53:09.3498628Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:53:10.1224450Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:53:10.1235607Z #blocked = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:53:10.1236312Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:53:10.1237021Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:53:10.1237887Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:53:10.1238645Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:53:10.1239338Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:53:10.1239834Z #smem = #ttg.shared_memory 2026-02-21T09:53:10.1240438Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:53:10.1241775Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:53:10.1242501Z %cst = arith.constant dense<4> : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1242939Z %c19456_i32 = arith.constant 19456 : i32 2026-02-21T09:53:10.1243184Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:53:10.1243466Z %cst_0 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1243810Z %c58368_i32 = arith.constant 58368 : i32 2026-02-21T09:53:10.1244027Z %c38912_i32 = arith.constant 38912 : i32 2026-02-21T09:53:10.1244233Z %c77824_i32 = arith.constant 77824 : i32 2026-02-21T09:53:10.1244443Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:53:10.1244644Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:53:10.1244845Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:53:10.1245063Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:53:10.1245264Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:53:10.1245458Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:53:10.1245758Z %cst_1 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:10.1246073Z %cst_2 = arith.constant dense<8192> : tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1246429Z %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1246703Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:53:10.1246960Z %cst_4 = arith.constant dense<2> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1247227Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:53:10.1247422Z %c23551_i32 = arith.constant 23551 : i32 2026-02-21T09:53:10.1247641Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1247913Z %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:53:10.1248153Z %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:53:10.1248393Z %cst_8 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1248626Z %cst_9 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1248851Z %cst_10 = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1249082Z %cst_11 = arith.constant dense<0> : tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1249309Z %cst_12 = arith.constant dense<8192> : tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1249515Z %0 = tt.get_program_id x : i32 2026-02-21T09:53:10.1249791Z %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:10.1250179Z %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:10.1250544Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:10.1250943Z %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:53:10.1251318Z %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1251679Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1252008Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1252277Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1252647Z %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:53:10.1253229Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:53:10.1253783Z %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:53:10.1254131Z %12 = arith.cmpi eq, %11, %cst_6 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:53:10.1254428Z %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1> 2026-02-21T09:53:10.1254704Z %14 = arith.cmpi eq, %11, %cst_7 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:53:10.1254993Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1> 2026-02-21T09:53:10.1255280Z %16 = tt.splat %arg2 : !tt.ptr -> tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:53:10.1255654Z %17 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:10.1256119Z %18 = arith.extsi %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:10.1256437Z %19 = arith.subi %c23551_i32, %0 : i32 2026-02-21T09:53:10.1256630Z %20 = arith.divui %19, %c19456_i32 : i32 2026-02-21T09:53:10.1256792Z %21 = arith.remsi %20, %c4_i32 : i32 2026-02-21T09:53:10.1256954Z %22 = arith.subi %20, %21 : i32 2026-02-21T09:53:10.1257106Z %23 = arith.muli %22, %c19456_i32 : i32 2026-02-21T09:53:10.1257264Z %24 = arith.addi %0, %23 : i32 2026-02-21T09:53:10.1257418Z scf.for %arg3 = %0 to %24 step %c77824_i32 : i32 { 2026-02-21T09:53:10.1257572Z %25 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:53:10.1257704Z %26 = arith.muli %25, %c8_i32 : i32 2026-02-21T09:53:10.1257852Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:53:10.1257983Z %28 = arith.minsi %27, %c8_i32 : i32 2026-02-21T09:53:10.1258112Z %29 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:53:10.1258243Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:53:10.1258366Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:53:10.1258488Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:53:10.1258609Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:53:10.1258795Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:10.1259037Z %35 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:10.1259221Z %36 = arith.muli %32, %c256_i32 : i32 2026-02-21T09:53:10.1259404Z %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:53:10.1259635Z %38 = arith.addi %37, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:53:10.1259935Z %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:53:10.1260207Z %40 = arith.muli %39, %cst_1 : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:10.1260438Z %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1260741Z %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi32, #blocked> 2026-02-21T09:53:10.1261036Z %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1261287Z %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:10.1261613Z %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:10.1261908Z %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1262113Z %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1262323Z %48 = tt.addptr %7, %47 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1262543Z %49 = tt.load %48 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1262798Z %50 = tt.expand_dims %5 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1263055Z %51 = arith.muli %50, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1263252Z %52 = tt.broadcast %51 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1263469Z %53 = arith.addi %52, %43 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1263674Z %54 = tt.addptr %8, %53 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1263881Z %55 = tt.load %54 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1264184Z %56 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1264573Z ttg.local_store %49, %56 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1264863Z %57 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1265119Z %58 = arith.addi %6, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1265412Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:10.1265706Z %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1265911Z %61 = arith.addi %41, %60 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1266138Z %62 = tt.addptr %7, %61 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1266357Z %63 = tt.load %62 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1266593Z %64 = tt.expand_dims %57 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1266830Z %65 = arith.muli %64, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1267014Z %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1267195Z %67 = arith.addi %66, %43 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1267382Z %68 = tt.addptr %8, %67 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1267569Z %69 = tt.load %68 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1267845Z %70 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1268199Z ttg.local_store %63, %70 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1268839Z %71:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %56, %arg8 = %70, %arg9 = %55, %arg10 = %69) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:53:10.1269360Z %408 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:53:10.1269535Z %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1269754Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1269929Z %411 = arith.muli %408, %c2_i32 : i32 2026-02-21T09:53:10.1270097Z %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1270317Z %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1270590Z %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:10.1270868Z %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1271072Z %416 = arith.addi %41, %415 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1271274Z %417 = tt.addptr %7, %416 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1271497Z %418 = tt.load %417 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1271802Z %419 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1272361Z %420 = arith.extf %419 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1272747Z %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1272992Z %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1273214Z %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1273408Z %424 = arith.addi %423, %43 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1273601Z %425 = tt.addptr %8, %424 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1273804Z %426 = tt.load %425 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1273962Z %427 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1274139Z %428 = arith.shrsi %427, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1274384Z %429 = ttg.convert_layout %428 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1274637Z %430 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1274882Z %431 = ttg.convert_layout %430 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1275228Z %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1275579Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1275871Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1276124Z %435 = arith.select %13, %434, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1276373Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1276631Z %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1276867Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1277097Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1277397Z %440 = ttg.convert_layout %439 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1277876Z %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1278225Z %442 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:53:10.1278357Z %443 = arith.cmpi slt, %442, %c2_i32 : i32 2026-02-21T09:53:10.1278493Z %444 = arith.select %443, %442, %c0_i32 : i32 2026-02-21T09:53:10.1278762Z %445 = ttg.memdesc_index %44[%444] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1279125Z ttg.local_store %418, %445 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1279612Z scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1280019Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:53:10.1280299Z %72 = ttg.local_load %71#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1280725Z %73 = arith.extf %72 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1281023Z %74 = arith.shli %71#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1281202Z %75 = arith.shrsi %74, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1281438Z %76 = ttg.convert_layout %75 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1281681Z %77 = arith.shrsi %71#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1281918Z %78 = ttg.convert_layout %77 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1282265Z %79 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1282655Z %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1282941Z %81 = tt.broadcast %79 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1283182Z %82 = arith.select %13, %81, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1283421Z %83 = tt.broadcast %80 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1283655Z %84 = arith.select %15, %83, %82 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1283886Z %85 = tt.reshape %84 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1284105Z %86 = arith.sitofp %85 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1284397Z %87 = ttg.convert_layout %86 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1284877Z %88 = tt.dot %73, %87, %71#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1285368Z %89 = ttg.local_load %71#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1285793Z %90 = arith.extf %89 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1286089Z %91 = arith.shli %71#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1286251Z %92 = arith.shrsi %91, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1286494Z %93 = ttg.convert_layout %92 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1286734Z %94 = arith.shrsi %71#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1286978Z %95 = ttg.convert_layout %94 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1287307Z %96 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1287647Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1287946Z %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1288183Z %99 = arith.select %13, %98, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1288425Z %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1288662Z %101 = arith.select %15, %100, %99 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1288904Z %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1289133Z %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1289448Z %104 = ttg.convert_layout %103 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1289911Z %105 = tt.dot %90, %104, %88, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1290293Z ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:10.1290528Z %106 = arith.truncf %105 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:53:10.1290703Z %107 = arith.extsi %33 : i32 to i64 2026-02-21T09:53:10.1290822Z %108 = arith.extsi %36 : i32 to i64 2026-02-21T09:53:10.1290988Z %109 = tt.splat %107 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:10.1291199Z %110 = arith.addi %109, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:10.1291469Z %111 = tt.expand_dims %110 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1291711Z %112 = arith.muli %111, %cst_8 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1291890Z %113 = tt.broadcast %112 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1292102Z %114 = tt.splat %108 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:10.1292311Z %115 = arith.addi %114, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:10.1292579Z %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1292857Z %117 = tt.broadcast %116 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1293039Z %118 = arith.addi %113, %117 : tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1293263Z %119 = tt.addptr %16, %118 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1293465Z %120 = arith.cmpi sge, %111, %cst_9 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1293634Z %121 = arith.cmpi slt, %111, %cst_10 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1293798Z %122 = arith.andi %120, %121 : tensor<128x1xi1, #mma> 2026-02-21T09:53:10.1293973Z %123 = tt.broadcast %122 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1294163Z %124 = arith.cmpi sge, %116, %cst_11 : tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1294327Z %125 = arith.cmpi slt, %116, %cst_12 : tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1294486Z %126 = arith.andi %124, %125 : tensor<1x256xi1, #mma> 2026-02-21T09:53:10.1294659Z %127 = tt.broadcast %126 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1294840Z %128 = arith.andi %123, %127 : tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1295007Z tt.store %119, %106, %128 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:53:10.1295160Z %129 = arith.addi %arg3, %c19456_i32 : i32 2026-02-21T09:53:10.1295291Z %130 = arith.divsi %129, %c256_i32 : i32 2026-02-21T09:53:10.1295412Z %131 = arith.muli %130, %c8_i32 : i32 2026-02-21T09:53:10.1295548Z %132 = arith.subi %c128_i32, %131 : i32 2026-02-21T09:53:10.1295667Z %133 = arith.minsi %132, %c8_i32 : i32 2026-02-21T09:53:10.1295787Z %134 = arith.remsi %129, %c256_i32 : i32 2026-02-21T09:53:10.1295906Z %135 = arith.remsi %134, %133 : i32 2026-02-21T09:53:10.1296020Z %136 = arith.addi %131, %135 : i32 2026-02-21T09:53:10.1296136Z %137 = arith.divsi %134, %133 : i32 2026-02-21T09:53:10.1296251Z %138 = arith.muli %136, %c128_i32 : i32 2026-02-21T09:53:10.1296425Z %139 = tt.splat %138 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:10.1296652Z %140 = arith.addi %139, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:10.1296846Z %141 = arith.muli %137, %c256_i32 : i32 2026-02-21T09:53:10.1297016Z %142 = tt.splat %141 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:53:10.1297235Z %143 = arith.addi %142, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:53:10.1297517Z %144 = tt.expand_dims %140 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:53:10.1297785Z %145 = arith.muli %144, %cst_1 : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:10.1297983Z %146 = tt.broadcast %145 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1298266Z %147 = tt.expand_dims %143 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi32, #blocked> 2026-02-21T09:53:10.1298544Z %148 = tt.broadcast %147 : tensor<1x256xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1298768Z %149 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:10.1298956Z %150 = arith.addi %146, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1299159Z %151 = tt.addptr %7, %150 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1299368Z %152 = tt.load %151 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1299528Z %153 = arith.addi %52, %148 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1299725Z %154 = tt.addptr %8, %153 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1299923Z %155 = tt.load %154 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1300226Z %156 = ttg.memdesc_index %149[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1300591Z ttg.local_store %152, %156 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1300834Z %157 = arith.addi %146, %60 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1301037Z %158 = tt.addptr %7, %157 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1301244Z %159 = tt.load %158 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1301405Z %160 = arith.addi %66, %148 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1301597Z %161 = tt.addptr %8, %160 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1301795Z %162 = tt.load %161 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1302078Z %163 = ttg.memdesc_index %149[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1302436Z ttg.local_store %159, %163 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1303073Z %164:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %156, %arg8 = %163, %arg9 = %155, %arg10 = %162) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:53:10.1303621Z %408 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:53:10.1303795Z %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1304017Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1304190Z %411 = arith.muli %408, %c2_i32 : i32 2026-02-21T09:53:10.1304362Z %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1304601Z %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1304877Z %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:10.1305156Z %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1305353Z %416 = arith.addi %146, %415 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1305587Z %417 = tt.addptr %7, %416 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1305796Z %418 = tt.load %417 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1306100Z %419 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1306540Z %420 = arith.extf %419 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1306922Z %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1307172Z %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1307366Z %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1307558Z %424 = arith.addi %423, %148 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1307758Z %425 = tt.addptr %8, %424 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1307958Z %426 = tt.load %425 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1308138Z %427 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1308300Z %428 = arith.shrsi %427, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1308545Z %429 = ttg.convert_layout %428 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1308796Z %430 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1309039Z %431 = ttg.convert_layout %430 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1309381Z %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1309728Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1310017Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1310269Z %435 = arith.select %13, %434, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1310514Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1310757Z %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1310996Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1311344Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1311644Z %440 = ttg.convert_layout %439 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1312117Z %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1312464Z %442 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:53:10.1312613Z %443 = arith.cmpi slt, %442, %c2_i32 : i32 2026-02-21T09:53:10.1312748Z %444 = arith.select %443, %442, %c0_i32 : i32 2026-02-21T09:53:10.1313018Z %445 = ttg.memdesc_index %149[%444] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1313381Z ttg.local_store %418, %445 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1313879Z scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1314271Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:53:10.1314551Z %165 = ttg.local_load %164#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1321217Z %166 = arith.extf %165 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1321555Z %167 = arith.shli %164#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1321725Z %168 = arith.shrsi %167, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1321970Z %169 = ttg.convert_layout %168 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1322221Z %170 = arith.shrsi %164#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1322462Z %171 = ttg.convert_layout %170 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1322923Z %172 = tt.expand_dims %169 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1323270Z %173 = tt.expand_dims %171 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1323560Z %174 = tt.broadcast %172 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1323812Z %175 = arith.select %13, %174, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1324057Z %176 = tt.broadcast %173 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1324299Z %177 = arith.select %15, %176, %175 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1324534Z %178 = tt.reshape %177 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1324762Z %179 = arith.sitofp %178 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1325059Z %180 = ttg.convert_layout %179 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1325534Z %181 = tt.dot %166, %180, %164#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1326053Z %182 = ttg.local_load %164#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1326488Z %183 = arith.extf %182 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1326790Z %184 = arith.shli %164#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1326952Z %185 = arith.shrsi %184, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1327196Z %186 = ttg.convert_layout %185 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1327465Z %187 = arith.shrsi %164#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1327708Z %188 = ttg.convert_layout %187 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1328044Z %189 = tt.expand_dims %186 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1328401Z %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1328691Z %191 = tt.broadcast %189 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1328937Z %192 = arith.select %13, %191, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1329181Z %193 = tt.broadcast %190 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1329424Z %194 = arith.select %15, %193, %192 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1329661Z %195 = tt.reshape %194 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1329890Z %196 = arith.sitofp %195 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1330188Z %197 = ttg.convert_layout %196 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1330655Z %198 = tt.dot %183, %197, %181, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1331059Z ttg.local_dealloc %149 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:10.1331277Z %199 = arith.truncf %198 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:53:10.1331454Z %200 = arith.extsi %138 : i32 to i64 2026-02-21T09:53:10.1331571Z %201 = arith.extsi %141 : i32 to i64 2026-02-21T09:53:10.1331735Z %202 = tt.splat %200 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:10.1331949Z %203 = arith.addi %202, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:10.1332219Z %204 = tt.expand_dims %203 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1332462Z %205 = arith.muli %204, %cst_8 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1332642Z %206 = tt.broadcast %205 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1332856Z %207 = tt.splat %201 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:10.1333065Z %208 = arith.addi %207, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:10.1333331Z %209 = tt.expand_dims %208 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1333595Z %210 = tt.broadcast %209 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1333796Z %211 = arith.addi %206, %210 : tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1333990Z %212 = tt.addptr %16, %211 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1334201Z %213 = arith.cmpi sge, %204, %cst_9 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1334369Z %214 = arith.cmpi slt, %204, %cst_10 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1334529Z %215 = arith.andi %213, %214 : tensor<128x1xi1, #mma> 2026-02-21T09:53:10.1334706Z %216 = tt.broadcast %215 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1334894Z %217 = arith.cmpi sge, %209, %cst_11 : tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1335078Z %218 = arith.cmpi slt, %209, %cst_12 : tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1335235Z %219 = arith.andi %217, %218 : tensor<1x256xi1, #mma> 2026-02-21T09:53:10.1335410Z %220 = tt.broadcast %219 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1335589Z %221 = arith.andi %216, %220 : tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1335752Z tt.store %212, %199, %221 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:53:10.1335905Z %222 = arith.addi %arg3, %c38912_i32 : i32 2026-02-21T09:53:10.1336052Z %223 = arith.divsi %222, %c256_i32 : i32 2026-02-21T09:53:10.1336174Z %224 = arith.muli %223, %c8_i32 : i32 2026-02-21T09:53:10.1336299Z %225 = arith.subi %c128_i32, %224 : i32 2026-02-21T09:53:10.1336419Z %226 = arith.minsi %225, %c8_i32 : i32 2026-02-21T09:53:10.1336540Z %227 = arith.remsi %222, %c256_i32 : i32 2026-02-21T09:53:10.1336659Z %228 = arith.remsi %227, %226 : i32 2026-02-21T09:53:10.1336775Z %229 = arith.addi %224, %228 : i32 2026-02-21T09:53:10.1336888Z %230 = arith.divsi %227, %226 : i32 2026-02-21T09:53:10.1337004Z %231 = arith.muli %229, %c128_i32 : i32 2026-02-21T09:53:10.1337178Z %232 = tt.splat %231 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:10.1337405Z %233 = arith.addi %232, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:10.1337580Z %234 = arith.muli %230, %c256_i32 : i32 2026-02-21T09:53:10.1337752Z %235 = tt.splat %234 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:53:10.1337971Z %236 = arith.addi %235, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:53:10.1338271Z %237 = tt.expand_dims %233 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:53:10.1338526Z %238 = arith.muli %237, %cst_1 : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:10.1338726Z %239 = tt.broadcast %238 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1339005Z %240 = tt.expand_dims %236 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi32, #blocked> 2026-02-21T09:53:10.1339285Z %241 = tt.broadcast %240 : tensor<1x256xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1339506Z %242 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:10.1339694Z %243 = arith.addi %239, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1339895Z %244 = tt.addptr %7, %243 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1340102Z %245 = tt.load %244 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1340263Z %246 = arith.addi %52, %241 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1340457Z %247 = tt.addptr %8, %246 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1340654Z %248 = tt.load %247 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1340937Z %249 = ttg.memdesc_index %242[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1341320Z ttg.local_store %245, %249 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1341563Z %250 = arith.addi %239, %60 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1341761Z %251 = tt.addptr %7, %250 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1341964Z %252 = tt.load %251 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1342125Z %253 = arith.addi %66, %241 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1342317Z %254 = tt.addptr %8, %253 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1342531Z %255 = tt.load %254 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1342809Z %256 = ttg.memdesc_index %242[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1343168Z ttg.local_store %252, %256 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1343819Z %257:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %249, %arg8 = %256, %arg9 = %248, %arg10 = %255) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:53:10.1344347Z %408 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:53:10.1344519Z %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1344743Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1344916Z %411 = arith.muli %408, %c2_i32 : i32 2026-02-21T09:53:10.1345082Z %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1345304Z %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1345581Z %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:10.1345862Z %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1346058Z %416 = arith.addi %239, %415 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1346280Z %417 = tt.addptr %7, %416 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1346490Z %418 = tt.load %417 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1346791Z %419 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1347232Z %420 = arith.extf %419 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1347618Z %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1347863Z %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1348054Z %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1348245Z %424 = arith.addi %423, %241 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1348439Z %425 = tt.addptr %8, %424 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1348640Z %426 = tt.load %425 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1348798Z %427 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1348971Z %428 = arith.shrsi %427, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1349214Z %429 = ttg.convert_layout %428 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1349466Z %430 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1349709Z %431 = ttg.convert_layout %430 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1350053Z %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1350403Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1350715Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1350964Z %435 = arith.select %13, %434, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1351213Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1351473Z %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1351711Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1351936Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1352237Z %440 = ttg.convert_layout %439 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1352707Z %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1353058Z %442 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:53:10.1353186Z %443 = arith.cmpi slt, %442, %c2_i32 : i32 2026-02-21T09:53:10.1353319Z %444 = arith.select %443, %442, %c0_i32 : i32 2026-02-21T09:53:10.1353591Z %445 = ttg.memdesc_index %242[%444] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1353953Z ttg.local_store %418, %445 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1354463Z scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1354856Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:53:10.1355135Z %258 = ttg.local_load %257#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1355569Z %259 = arith.extf %258 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1355870Z %260 = arith.shli %257#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1356030Z %261 = arith.shrsi %260, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1356275Z %262 = ttg.convert_layout %261 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1356523Z %263 = arith.shrsi %257#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1356763Z %264 = ttg.convert_layout %263 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1357100Z %265 = tt.expand_dims %262 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1357455Z %266 = tt.expand_dims %264 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1357751Z %267 = tt.broadcast %265 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1358000Z %268 = arith.select %13, %267, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1358244Z %269 = tt.broadcast %266 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1358484Z %270 = arith.select %15, %269, %268 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1358736Z %271 = tt.reshape %270 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1358966Z %272 = arith.sitofp %271 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1359266Z %273 = ttg.convert_layout %272 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1359747Z %274 = tt.dot %259, %273, %257#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1360246Z %275 = ttg.local_load %257#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1360675Z %276 = arith.extf %275 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1360978Z %277 = arith.shli %257#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1361141Z %278 = arith.shrsi %277, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1361384Z %279 = ttg.convert_layout %278 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1361633Z %280 = arith.shrsi %257#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1361877Z %281 = ttg.convert_layout %280 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1362227Z %282 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1362619Z %283 = tt.expand_dims %281 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1362963Z %284 = tt.broadcast %282 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1363208Z %285 = arith.select %13, %284, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1363449Z %286 = tt.broadcast %283 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1363689Z %287 = arith.select %15, %286, %285 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1363925Z %288 = tt.reshape %287 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1364150Z %289 = arith.sitofp %288 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1364448Z %290 = ttg.convert_layout %289 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1364908Z %291 = tt.dot %276, %290, %274, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1365294Z ttg.local_dealloc %242 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:10.1365530Z %292 = arith.truncf %291 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:53:10.1365701Z %293 = arith.extsi %231 : i32 to i64 2026-02-21T09:53:10.1365820Z %294 = arith.extsi %234 : i32 to i64 2026-02-21T09:53:10.1365981Z %295 = tt.splat %293 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:10.1366195Z %296 = arith.addi %295, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:10.1366463Z %297 = tt.expand_dims %296 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1366720Z %298 = arith.muli %297, %cst_8 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1366900Z %299 = tt.broadcast %298 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1367105Z %300 = tt.splat %294 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:10.1367313Z %301 = arith.addi %300, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:10.1367596Z %302 = tt.expand_dims %301 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1367856Z %303 = tt.broadcast %302 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1368039Z %304 = arith.addi %299, %303 : tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1368230Z %305 = tt.addptr %16, %304 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1368432Z %306 = arith.cmpi sge, %297, %cst_9 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1368600Z %307 = arith.cmpi slt, %297, %cst_10 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1368760Z %308 = arith.andi %306, %307 : tensor<128x1xi1, #mma> 2026-02-21T09:53:10.1368935Z %309 = tt.broadcast %308 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1369120Z %310 = arith.cmpi sge, %302, %cst_11 : tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1369284Z %311 = arith.cmpi slt, %302, %cst_12 : tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1369439Z %312 = arith.andi %310, %311 : tensor<1x256xi1, #mma> 2026-02-21T09:53:10.1369612Z %313 = tt.broadcast %312 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1369790Z %314 = arith.andi %309, %313 : tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1369948Z tt.store %305, %292, %314 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:53:10.1370119Z %315 = arith.addi %arg3, %c58368_i32 : i32 2026-02-21T09:53:10.1370243Z %316 = arith.divsi %315, %c256_i32 : i32 2026-02-21T09:53:10.1370366Z %317 = arith.muli %316, %c8_i32 : i32 2026-02-21T09:53:10.1370484Z %318 = arith.subi %c128_i32, %317 : i32 2026-02-21T09:53:10.1370601Z %319 = arith.minsi %318, %c8_i32 : i32 2026-02-21T09:53:10.1370719Z %320 = arith.remsi %315, %c256_i32 : i32 2026-02-21T09:53:10.1370836Z %321 = arith.remsi %320, %319 : i32 2026-02-21T09:53:10.1370949Z %322 = arith.addi %317, %321 : i32 2026-02-21T09:53:10.1371059Z %323 = arith.divsi %320, %319 : i32 2026-02-21T09:53:10.1371176Z %324 = arith.muli %322, %c128_i32 : i32 2026-02-21T09:53:10.1371345Z %325 = tt.splat %324 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:10.1371572Z %326 = arith.addi %325, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:10.1371746Z %327 = arith.muli %323, %c256_i32 : i32 2026-02-21T09:53:10.1371912Z %328 = tt.splat %327 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:53:10.1372132Z %329 = arith.addi %328, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:53:10.1372408Z %330 = tt.expand_dims %326 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:53:10.1372678Z %331 = arith.muli %330, %cst_1 : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:10.1372872Z %332 = tt.broadcast %331 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1373151Z %333 = tt.expand_dims %329 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi32, #blocked> 2026-02-21T09:53:10.1373509Z %334 = tt.broadcast %333 : tensor<1x256xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1373775Z %335 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:10.1373963Z %336 = arith.addi %332, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1374186Z %337 = tt.addptr %7, %336 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1374391Z %338 = tt.load %337 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1374549Z %339 = arith.addi %52, %334 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1374743Z %340 = tt.addptr %8, %339 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1374940Z %341 = tt.load %340 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1375236Z %342 = ttg.memdesc_index %335[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1375602Z ttg.local_store %338, %342 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1375842Z %343 = arith.addi %332, %60 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1376042Z %344 = tt.addptr %7, %343 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1376245Z %345 = tt.load %344 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1376399Z %346 = arith.addi %66, %334 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1376590Z %347 = tt.addptr %8, %346 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1376784Z %348 = tt.load %347 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1377062Z %349 = ttg.memdesc_index %335[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1377420Z ttg.local_store %345, %349 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1378067Z %350:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %342, %arg8 = %349, %arg9 = %341, %arg10 = %348) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:53:10.1378594Z %408 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:53:10.1378768Z %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1378987Z %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1379160Z %411 = arith.muli %408, %c2_i32 : i32 2026-02-21T09:53:10.1379327Z %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1379545Z %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1379823Z %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:10.1380101Z %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1380296Z %416 = arith.addi %332, %415 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1380497Z %417 = tt.addptr %7, %416 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1380725Z %418 = tt.load %417 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1381027Z %419 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1381464Z %420 = arith.extf %419 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1381842Z %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1382100Z %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1382288Z %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1382479Z %424 = arith.addi %423, %334 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1382670Z %425 = tt.addptr %8, %424 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1382884Z %426 = tt.load %425 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1383040Z %427 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1383196Z %428 = arith.shrsi %427, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1383441Z %429 = ttg.convert_layout %428 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1383686Z %430 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1383932Z %431 = ttg.convert_layout %430 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1384268Z %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1384610Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1384903Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1385148Z %435 = arith.select %13, %434, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1385429Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1385670Z %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1385907Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1386134Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1386429Z %440 = ttg.convert_layout %439 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1386902Z %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1387248Z %442 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:53:10.1387374Z %443 = arith.cmpi slt, %442, %c2_i32 : i32 2026-02-21T09:53:10.1387507Z %444 = arith.select %443, %442, %c0_i32 : i32 2026-02-21T09:53:10.1387774Z %445 = ttg.memdesc_index %335[%444] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1388135Z ttg.local_store %418, %445 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1388620Z scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1389022Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:53:10.1389300Z %351 = ttg.local_load %350#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1389737Z %352 = arith.extf %351 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1390051Z %353 = arith.shli %350#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1390209Z %354 = arith.shrsi %353, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1390454Z %355 = ttg.convert_layout %354 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1390700Z %356 = arith.shrsi %350#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1390955Z %357 = ttg.convert_layout %356 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1391290Z %358 = tt.expand_dims %355 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1391631Z %359 = tt.expand_dims %357 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1391921Z %360 = tt.broadcast %358 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1392168Z %361 = arith.select %13, %360, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1392408Z %362 = tt.broadcast %359 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1392648Z %363 = arith.select %15, %362, %361 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1392881Z %364 = tt.reshape %363 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1393107Z %365 = arith.sitofp %364 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1393405Z %366 = ttg.convert_layout %365 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1393883Z %367 = tt.dot %352, %366, %350#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1394379Z %368 = ttg.local_load %350#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1394806Z %369 = arith.extf %368 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1395105Z %370 = arith.shli %350#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1395264Z %371 = arith.shrsi %370, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1395502Z %372 = ttg.convert_layout %371 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1395747Z %373 = arith.shrsi %350#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1395986Z %374 = ttg.convert_layout %373 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1396320Z %375 = tt.expand_dims %372 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1396676Z %376 = tt.expand_dims %374 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1396965Z %377 = tt.broadcast %375 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1397208Z %378 = arith.select %13, %377, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1397450Z %379 = tt.broadcast %376 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1397689Z %380 = arith.select %15, %379, %378 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1397920Z %381 = tt.reshape %380 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1398159Z %382 = arith.sitofp %381 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1398454Z %383 = ttg.convert_layout %382 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1398936Z %384 = tt.dot %369, %383, %367, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1399319Z ttg.local_dealloc %335 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:10.1399534Z %385 = arith.truncf %384 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:53:10.1399704Z %386 = arith.extsi %324 : i32 to i64 2026-02-21T09:53:10.1399820Z %387 = arith.extsi %327 : i32 to i64 2026-02-21T09:53:10.1399988Z %388 = tt.splat %386 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:10.1400197Z %389 = arith.addi %388, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:10.1400466Z %390 = tt.expand_dims %389 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1400706Z %391 = arith.muli %390, %cst_8 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1400886Z %392 = tt.broadcast %391 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1401093Z %393 = tt.splat %387 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:10.1401299Z %394 = arith.addi %393, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:10.1401577Z %395 = tt.expand_dims %394 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1401836Z %396 = tt.broadcast %395 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1402018Z %397 = arith.addi %392, %396 : tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1402209Z %398 = tt.addptr %16, %397 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1402411Z %399 = arith.cmpi sge, %390, %cst_9 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1402616Z %400 = arith.cmpi slt, %390, %cst_10 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1402777Z %401 = arith.andi %399, %400 : tensor<128x1xi1, #mma> 2026-02-21T09:53:10.1402952Z %402 = tt.broadcast %401 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1403137Z %403 = arith.cmpi sge, %395, %cst_11 : tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1403300Z %404 = arith.cmpi slt, %395, %cst_12 : tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1403458Z %405 = arith.andi %403, %404 : tensor<1x256xi1, #mma> 2026-02-21T09:53:10.1403627Z %406 = tt.broadcast %405 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1403808Z %407 = arith.andi %402, %406 : tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1403965Z tt.store %398, %385, %407 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:53:10.1404111Z } {tt.num_stages = 1 : i32} 2026-02-21T09:53:10.1404259Z scf.for %arg3 = %24 to %c4096_i32 step %c19456_i32 : i32 { 2026-02-21T09:53:10.1404404Z %25 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:53:10.1404524Z %26 = arith.muli %25, %c8_i32 : i32 2026-02-21T09:53:10.1404640Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:53:10.1404755Z %28 = arith.minsi %27, %c8_i32 : i32 2026-02-21T09:53:10.1404872Z %29 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:53:10.1404988Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:53:10.1405098Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:53:10.1405206Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:53:10.1405315Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:53:10.1405500Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:10.1405720Z %35 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:10.1405887Z %36 = arith.muli %32, %c256_i32 : i32 2026-02-21T09:53:10.1406047Z %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:53:10.1406274Z %38 = arith.addi %37, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:53:10.1406547Z %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:53:10.1406794Z %40 = arith.muli %39, %cst_1 : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:10.1406983Z %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1407254Z %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi32, #blocked> 2026-02-21T09:53:10.1407525Z %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1407736Z %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:10.1408003Z %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:10.1408269Z %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1408456Z %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1408651Z %48 = tt.addptr %7, %47 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1408876Z %49 = tt.load %48 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1409110Z %50 = tt.expand_dims %5 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1409347Z %51 = arith.muli %50, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1409529Z %52 = tt.broadcast %51 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1409712Z %53 = arith.addi %52, %43 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1409900Z %54 = tt.addptr %8, %53 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1410090Z %55 = tt.load %54 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1410368Z %56 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1410732Z ttg.local_store %49, %56 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1411000Z %57 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1411223Z %58 = arith.addi %6, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1411495Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:10.1411774Z %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1411961Z %61 = arith.addi %41, %60 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1412152Z %62 = tt.addptr %7, %61 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1412351Z %63 = tt.load %62 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1412585Z %64 = tt.expand_dims %57 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1412819Z %65 = arith.muli %64, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1413014Z %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1413195Z %67 = arith.addi %66, %43 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1413385Z %68 = tt.addptr %8, %67 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1413573Z %69 = tt.load %68 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1413864Z %70 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1414219Z ttg.local_store %63, %70 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1414839Z %71:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %56, %arg8 = %70, %arg9 = %55, %arg10 = %69) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:53:10.1415359Z %129 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:53:10.1415533Z %130 = tt.splat %129 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1415755Z %131 = arith.addi %130, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:10.1415932Z %132 = arith.muli %129, %c2_i32 : i32 2026-02-21T09:53:10.1416101Z %133 = tt.splat %132 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1416321Z %134 = arith.addi %133, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:10.1416611Z %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:53:10.1416888Z %136 = tt.broadcast %135 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1417084Z %137 = arith.addi %41, %136 : tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1417284Z %138 = tt.addptr %7, %137 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:53:10.1417491Z %139 = tt.load %138 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:53:10.1417796Z %140 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1418233Z %141 = arith.extf %140 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1418618Z %142 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1418861Z %143 = arith.muli %142, %cst_2 : tensor<2x1xi32, #blocked> 2026-02-21T09:53:10.1419053Z %144 = tt.broadcast %143 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1419248Z %145 = arith.addi %144, %43 : tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1419444Z %146 = tt.addptr %8, %145 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi32, #blocked> 2026-02-21T09:53:10.1419661Z %147 = tt.load %146 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:53:10.1419822Z %148 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1419985Z %149 = arith.shrsi %148, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1420240Z %150 = ttg.convert_layout %149 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1420491Z %151 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1420738Z %152 = ttg.convert_layout %151 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1421092Z %153 = tt.expand_dims %150 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1421443Z %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1421740Z %155 = tt.broadcast %153 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1422010Z %156 = arith.select %13, %155, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1422258Z %157 = tt.broadcast %154 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1422503Z %158 = arith.select %15, %157, %156 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1422744Z %159 = tt.reshape %158 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1422974Z %160 = arith.sitofp %159 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1423271Z %161 = ttg.convert_layout %160 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1423748Z %162 = tt.dot %141, %161, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1424103Z %163 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:53:10.1424232Z %164 = arith.cmpi slt, %163, %c2_i32 : i32 2026-02-21T09:53:10.1424370Z %165 = arith.select %164, %163, %c0_i32 : i32 2026-02-21T09:53:10.1424655Z %166 = ttg.memdesc_index %44[%165] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1425018Z ttg.local_store %139, %166 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:53:10.1425503Z scf.yield %162, %165, %arg8, %166, %arg10, %147 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1425894Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:53:10.1426172Z %72 = ttg.local_load %71#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1426605Z %73 = arith.extf %72 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1426901Z %74 = arith.shli %71#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1427063Z %75 = arith.shrsi %74, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1427298Z %76 = ttg.convert_layout %75 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1427541Z %77 = arith.shrsi %71#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1427695Z %78 = ttg.convert_layout %77 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1427850Z %79 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1427999Z %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1428096Z %81 = tt.broadcast %79 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1428203Z %82 = arith.select %13, %81, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1428311Z %83 = tt.broadcast %80 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1428409Z %84 = arith.select %15, %83, %82 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1428498Z %85 = tt.reshape %84 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1428605Z %86 = arith.sitofp %85 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1428764Z %87 = ttg.convert_layout %86 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1429026Z %88 = tt.dot %73, %87, %71#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1429222Z %89 = ttg.local_load %71#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1429415Z %90 = arith.extf %89 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1429480Z %91 = arith.shli %71#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1429539Z %92 = arith.shrsi %91, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1429681Z %93 = ttg.convert_layout %92 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1429745Z %94 = arith.shrsi %71#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:53:10.1429900Z %95 = ttg.convert_layout %94 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:10.1430050Z %96 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1430200Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:53:10.1430298Z %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1430402Z %99 = arith.select %13, %98, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1430500Z %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1430603Z %101 = arith.select %15, %100, %99 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:53:10.1430696Z %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:53:10.1430789Z %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:53:10.1430953Z %104 = ttg.convert_layout %103 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:10.1431210Z %105 = tt.dot %90, %104, %88, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:53:10.1431316Z ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:53:10.1431406Z %106 = arith.truncf %105 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:53:10.1431451Z %107 = arith.extsi %33 : i32 to i64 2026-02-21T09:53:10.1431493Z %108 = arith.extsi %36 : i32 to i64 2026-02-21T09:53:10.1431583Z %109 = tt.splat %107 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:10.1431667Z %110 = arith.addi %109, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:10.1431823Z %111 = tt.expand_dims %110 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1431882Z %112 = arith.muli %111, %cst_8 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1431967Z %113 = tt.broadcast %112 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1432051Z %114 = tt.splat %108 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:10.1432155Z %115 = arith.addi %114, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:10.1432293Z %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1432377Z %117 = tt.broadcast %116 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1432438Z %118 = arith.addi %113, %117 : tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1432536Z %119 = tt.addptr %16, %118 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:53:10.1432602Z %120 = arith.cmpi sge, %111, %cst_9 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1432672Z %121 = arith.cmpi slt, %111, %cst_10 : tensor<128x1xi64, #mma> 2026-02-21T09:53:10.1432732Z %122 = arith.andi %120, %121 : tensor<128x1xi1, #mma> 2026-02-21T09:53:10.1432814Z %123 = tt.broadcast %122 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1432881Z %124 = arith.cmpi sge, %116, %cst_11 : tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1432946Z %125 = arith.cmpi slt, %116, %cst_12 : tensor<1x256xi64, #mma> 2026-02-21T09:53:10.1433001Z %126 = arith.andi %124, %125 : tensor<1x256xi1, #mma> 2026-02-21T09:53:10.1433209Z %127 = tt.broadcast %126 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1433269Z %128 = arith.andi %123, %127 : tensor<128x256xi1, #mma> 2026-02-21T09:53:10.1433337Z tt.store %119, %106, %128 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:53:10.1433380Z } {tt.num_stages = 1 : i32} 2026-02-21T09:53:10.1433417Z tt.return 2026-02-21T09:53:10.1433450Z } 2026-02-21T09:53:10.1433483Z } 2026-02-21T09:53:10.1433488Z 2026-02-21T09:53:10.1433523Z {-# 2026-02-21T09:53:10.1433565Z external_resources: { 2026-02-21T09:53:10.1433606Z mlir_reproducer: { 2026-02-21T09:53:10.1434554Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:53:10.1434599Z disable_threading: false, 2026-02-21T09:53:10.1434637Z verify_each: true 2026-02-21T09:53:10.1434674Z } 2026-02-21T09:53:10.1434705Z } 2026-02-21T09:53:10.1434736Z #-} 2026-02-21T09:53:10.1434990Z /tmp/torchinductor_root/tt/cttq32vntkim2n6zppmuy3waxfbua3zj3tkcf4h746awu2lpjb4e.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:53:10.1435403Z /tmp/torchinductor_root/tt/cttq32vntkim2n6zppmuy3waxfbua3zj3tkcf4h746awu2lpjb4e.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:53:10.1435516Z [520s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:53:10.1436152Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:53:10.1436223Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:53:10.1436303Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:53:14.5808140Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:53:14.5810114Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}> 2026-02-21T09:53:14.5810438Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:53:14.5810789Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:53:14.5811114Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:53:14.5811400Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:53:14.5811672Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:53:14.5811910Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:53:14.5812106Z #smem = #ttg.shared_memory 2026-02-21T09:53:14.5812460Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:53:14.5812949Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:53:14.5813359Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:53:14.5813544Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:53:14.5813739Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:53:14.5813925Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:53:14.5814093Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:53:14.5814228Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:53:14.5814417Z %cst_3 = arith.constant dense<508> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:14.5814644Z %cst_4 = arith.constant dense<8192> : tensor<4x1xi64, #blocked1> 2026-02-21T09:53:14.5814837Z %cst_5 = arith.constant dense<0> : tensor<4x1xi64, #blocked1> 2026-02-21T09:53:14.5815017Z %cst_6 = arith.constant dense<512> : tensor<4x1xi64, #blocked1> 2026-02-21T09:53:14.5815193Z %cst_7 = arith.constant dense<0> : tensor<1x128xi64, #blocked1> 2026-02-21T09:53:14.5815385Z %cst_8 = arith.constant dense<8192> : tensor<1x128xi64, #blocked1> 2026-02-21T09:53:14.5815631Z %cst_9 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:14.5815792Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:53:14.5815919Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:53:14.5816066Z %cst_10 = arith.constant dense<0> : tensor<4x128xi8, #blocked1> 2026-02-21T09:53:14.5816256Z %cst_11 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked> 2026-02-21T09:53:14.5816410Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:53:14.5816528Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:53:14.5816714Z %cst_12 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:14.5817007Z %0 = tt.get_program_id x : i32 2026-02-21T09:53:14.5817136Z %1 = arith.divsi %0, %c128_i32 : i32 2026-02-21T09:53:14.5817253Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:53:14.5817385Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T09:53:14.5817505Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:53:14.5817621Z %5 = arith.remsi %0, %c128_i32 : i32 2026-02-21T09:53:14.5817740Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:53:14.5817862Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:53:14.5817975Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:53:14.5818103Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:53:14.5818310Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:14.5818611Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:14.5818924Z %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:14.5819196Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:14.5819456Z %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:14.5819667Z %15 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:14.5819879Z %16 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:53:14.5820109Z %17 = arith.addi %15, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:53:14.5820269Z %18 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:53:14.5820429Z %19 = tt.splat %18 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:14.5820662Z %20 = arith.addi %19, %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:53:14.5820899Z %21 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:14.5821221Z %22 = tt.expand_dims %16 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:53:14.5821468Z %23 = arith.muli %22, %cst_9 : tensor<128x1xi32, #blocked2> 2026-02-21T09:53:14.5821678Z %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:53:14.5821891Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:53:14.5822066Z %26 = arith.extsi %18 : i32 to i64 2026-02-21T09:53:14.5822216Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:53:14.5822448Z %28 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:14.5822783Z %29 = arith.extsi %28 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:14.5823073Z %30 = tt.splat %26 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:14.5823380Z %31 = arith.extsi %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:14.5823706Z %32 = arith.addi %30, %31 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:53:14.5823983Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi64, #blocked1> 2026-02-21T09:53:14.5824273Z %34 = tt.broadcast %33 : tensor<1x128xi64, #blocked1> -> tensor<4x128xi64, #blocked1> 2026-02-21T09:53:14.5824472Z %35 = arith.cmpi sge, %33, %cst_7 : tensor<1x128xi64, #blocked1> 2026-02-21T09:53:14.5824646Z %36 = arith.cmpi slt, %33, %cst_8 : tensor<1x128xi64, #blocked1> 2026-02-21T09:53:14.5824821Z %37 = arith.andi %35, %36 : tensor<1x128xi1, #blocked1> 2026-02-21T09:53:14.5825025Z %38 = tt.broadcast %37 : tensor<1x128xi1, #blocked1> -> tensor<4x128xi1, #blocked1> 2026-02-21T09:53:14.5825324Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:53:14.5825745Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:53:14.5826160Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:53:14.5826425Z %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:53:14.5826618Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:53:14.5826829Z %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:53:14.5827019Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:53:14.5827236Z %46 = ttg.local_alloc : () -> !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:53:14.5827521Z %47 = tt.expand_dims %21 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:53:14.5827788Z %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:53:14.5827996Z %49 = arith.addi %24, %48 : tensor<128x8xi32, #blocked2> 2026-02-21T09:53:14.5828189Z %50 = tt.addptr %25, %49 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:53:14.5828406Z %51 = tt.load %50 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:53:14.5828706Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:53:14.5829079Z ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:53:14.5829524Z %53:3 = scf.for %arg3 = %c0_i32 to %c508_i32 step %c4_i32 iter_args(%arg4 = %cst_2, %arg5 = %c0_i32, %arg6 = %52) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>) : i32 { 2026-02-21T09:53:14.5829853Z %92 = arith.addi %arg3, %c4_i32 : i32 2026-02-21T09:53:14.5829989Z %93 = arith.muli %92, %c2_i32 : i32 2026-02-21T09:53:14.5830156Z %94 = tt.splat %93 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:14.5830374Z %95 = arith.addi %94, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:53:14.5830660Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:53:14.5830929Z %97 = tt.broadcast %96 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:53:14.5831141Z %98 = arith.addi %24, %97 : tensor<128x8xi32, #blocked2> 2026-02-21T09:53:14.5831350Z %99 = tt.addptr %25, %98 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:53:14.5831566Z %100 = tt.load %99 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:53:14.5831887Z %101 = ttg.local_load %arg6 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:14.5832345Z %102 = arith.extf %101 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:14.5832635Z %103 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:53:14.5832810Z %104 = tt.splat %103 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:14.5833031Z %105 = arith.addi %104, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:14.5833344Z %106 = tt.expand_dims %105 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi64, #blocked1> 2026-02-21T09:53:14.5833594Z %107 = arith.muli %106, %cst_4 : tensor<4x1xi64, #blocked1> 2026-02-21T09:53:14.5833801Z %108 = tt.broadcast %107 : tensor<4x1xi64, #blocked1> -> tensor<4x128xi64, #blocked1> 2026-02-21T09:53:14.5833998Z %109 = arith.addi %108, %34 : tensor<4x128xi64, #blocked1> 2026-02-21T09:53:14.5834254Z %110 = tt.addptr %27, %109 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi64, #blocked1> 2026-02-21T09:53:14.5834469Z %111 = arith.cmpi sge, %106, %cst_5 : tensor<4x1xi64, #blocked1> 2026-02-21T09:53:14.5834639Z %112 = arith.cmpi slt, %106, %cst_6 : tensor<4x1xi64, #blocked1> 2026-02-21T09:53:14.5834818Z %113 = arith.andi %111, %112 : tensor<4x1xi1, #blocked1> 2026-02-21T09:53:14.5835005Z %114 = tt.broadcast %113 : tensor<4x1xi1, #blocked1> -> tensor<4x128xi1, #blocked1> 2026-02-21T09:53:14.5835209Z %115 = arith.andi %114, %38 : tensor<4x128xi1, #blocked1> 2026-02-21T09:53:14.5835381Z %116 = tt.load %110, %115, %cst_10 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:53:14.5835640Z %117 = ttg.convert_layout %116 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:14.5835941Z %118 = arith.shli %117, %cst_12 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:14.5836182Z %119 = arith.shrsi %118, %cst_12 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:14.5836434Z %120 = arith.shrsi %117, %cst_12 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:14.5836755Z %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:53:14.5837100Z %122 = tt.expand_dims %120 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:53:14.5837402Z %123 = tt.broadcast %121 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:53:14.5837644Z %124 = arith.select %43, %123, %cst_11 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:53:14.5837898Z %125 = tt.broadcast %122 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:53:14.5838135Z %126 = arith.select %45, %125, %124 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:53:14.5838378Z %127 = tt.reshape %126 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:53:14.5838603Z %128 = arith.sitofp %127 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:53:14.5838869Z %129 = ttg.local_alloc %128 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:53:14.5839198Z %130 = ttg.local_load %129 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:14.5839687Z %131 = tt.dot %102, %130, %arg4, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:53:14.5840073Z %132 = arith.addi %arg5, %c1_i32 : i32 2026-02-21T09:53:14.5840200Z %133 = arith.cmpi slt, %132, %c1_i32 : i32 2026-02-21T09:53:14.5840334Z %134 = arith.select %133, %132, %c0_i32 : i32 2026-02-21T09:53:14.5840613Z %135 = ttg.memdesc_index %46[%134] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:53:14.5840990Z ttg.local_store %100, %135 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:53:14.5841307Z scf.yield %131, %134, %135 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:53:14.5841587Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 2 : i32} 2026-02-21T09:53:14.5841892Z %54 = ttg.local_load %53#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:14.5842352Z %55 = arith.extf %54 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:14.5842764Z %56 = arith.addi %29, %cst_3 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:53:14.5843045Z %57 = tt.expand_dims %56 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi64, #blocked1> 2026-02-21T09:53:14.5843302Z %58 = arith.muli %57, %cst_4 : tensor<4x1xi64, #blocked1> 2026-02-21T09:53:14.5843488Z %59 = tt.broadcast %58 : tensor<4x1xi64, #blocked1> -> tensor<4x128xi64, #blocked1> 2026-02-21T09:53:14.5843688Z %60 = arith.addi %59, %34 : tensor<4x128xi64, #blocked1> 2026-02-21T09:53:14.5843891Z %61 = tt.addptr %27, %60 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi64, #blocked1> 2026-02-21T09:53:14.5844096Z %62 = arith.cmpi sge, %57, %cst_5 : tensor<4x1xi64, #blocked1> 2026-02-21T09:53:14.5844282Z %63 = arith.cmpi slt, %57, %cst_6 : tensor<4x1xi64, #blocked1> 2026-02-21T09:53:14.5844439Z %64 = arith.andi %62, %63 : tensor<4x1xi1, #blocked1> 2026-02-21T09:53:14.5844617Z %65 = tt.broadcast %64 : tensor<4x1xi1, #blocked1> -> tensor<4x128xi1, #blocked1> 2026-02-21T09:53:14.5844816Z %66 = arith.andi %65, %38 : tensor<4x128xi1, #blocked1> 2026-02-21T09:53:14.5844978Z %67 = tt.load %61, %66, %cst_10 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:53:14.5845261Z %68 = ttg.convert_layout %67 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:14.5845542Z %69 = arith.shli %68, %cst_12 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:14.5845780Z %70 = arith.shrsi %69, %cst_12 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:14.5846012Z %71 = arith.shrsi %68, %cst_12 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:53:14.5846302Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:53:14.5846637Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:53:14.5846926Z %74 = tt.broadcast %72 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:53:14.5847161Z %75 = arith.select %43, %74, %cst_11 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:53:14.5847409Z %76 = tt.broadcast %73 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:53:14.5847640Z %77 = arith.select %45, %76, %75 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:53:14.5847861Z %78 = tt.reshape %77 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:53:14.5848109Z %79 = arith.sitofp %78 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:53:14.5848354Z %80 = ttg.local_alloc %79 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:53:14.5848689Z %81 = ttg.local_load %80 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:53:14.5849170Z %82 = tt.dot %55, %81, %53#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:53:14.5849584Z ttg.local_dealloc %46 : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:53:14.5849798Z %83 = arith.truncf %82 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:53:14.5850074Z %84 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:53:14.5850315Z %85 = arith.muli %84, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:53:14.5850573Z %86 = tt.expand_dims %20 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:53:14.5850826Z %87 = tt.broadcast %85 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:53:14.5851038Z %88 = tt.broadcast %86 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:53:14.5851213Z %89 = arith.addi %87, %88 : tensor<128x128xi32, #mma> 2026-02-21T09:53:14.5851386Z %90 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:53:14.5851617Z %91 = tt.addptr %90, %89 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:53:14.5851810Z tt.store %91, %83 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:53:14.5851943Z tt.return 2026-02-21T09:53:14.5852022Z } 2026-02-21T09:53:14.5852111Z } 2026-02-21T09:53:14.5852156Z 2026-02-21T09:53:14.5852187Z {-# 2026-02-21T09:53:14.5852269Z external_resources: { 2026-02-21T09:53:14.5852366Z mlir_reproducer: { 2026-02-21T09:53:14.5853408Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:53:14.5854438Z disable_threading: false, 2026-02-21T09:53:14.5854545Z verify_each: true 2026-02-21T09:53:14.5854636Z } 2026-02-21T09:53:14.5854720Z } 2026-02-21T09:53:14.5854791Z #-} 2026-02-21T09:53:14.5855067Z /tmp/torchinductor_root/ge/cgeahcd57aauizttec2barrz3cst5noydv6rsrfcsbxtir4b7wj6.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:53:14.5855771Z /tmp/torchinductor_root/ge/cgeahcd57aauizttec2barrz3cst5noydv6rsrfcsbxtir4b7wj6.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:53:14.5856339Z [525s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:53:14.5857081Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:53:14.5857769Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:53:14.5857950Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:53:15.1467403Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 82/82 9.0 configs/s 2026-02-21T09:53:20.1346316Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 196/196 25.2 configs/s 2026-02-21T09:53:23.5344696Z [534s] Generation 7 complete: 2026-02-21T09:53:23.5345145Z error=13 2026-02-21T09:53:23.5345398Z ok=72 2026-02-21T09:53:23.5345606Z min=1.0151 2026-02-21T09:53:23.5345807Z mid=1.6335 2026-02-21T09:53:23.5346540Z max=59.9127 2026-02-21T09:53:23.5346770Z best={'block_sizes': [8, 128, 256], 2026-02-21T09:53:23.5347152Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T09:53:23.5347512Z 'l2_groupings': [8], 2026-02-21T09:53:23.5347786Z 'load_eviction_policies': ['', ''], 2026-02-21T09:53:23.5348108Z 'loop_orders': [[0, 1]], 2026-02-21T09:53:23.5348384Z 'matrix_instr_nonkdim': 32, 2026-02-21T09:53:23.5348664Z 'num_sm_multiplier': 64, 2026-02-21T09:53:23.5348918Z 'num_stages': 4, 2026-02-21T09:53:23.5349150Z 'num_warps': 4, 2026-02-21T09:53:23.5349538Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:53:23.5349861Z 'range_flattens': [False, True], 2026-02-21T09:53:23.5350157Z 'range_multi_buffers': [True, None], 2026-02-21T09:53:23.5350460Z 'range_num_stages': [2, 3], 2026-02-21T09:53:23.5350737Z 'range_unroll_factors': [4, 0], 2026-02-21T09:53:23.5351035Z 'range_warp_specializes': [], 2026-02-21T09:53:23.5351308Z 'waves_per_eu': 2} 2026-02-21T09:53:23.5423397Z [534s] Fitting surrogate: 760 points, 760 targets 2026-02-21T09:53:24.4541320Z [535s] Generation 8 starting: 80 neighbors, 4 active search path(s) 2026-02-21T09:54:01.5870815Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 0.2 configs/s 2026-02-21T09:54:03.9142341Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:54:03.9154732Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}> 2026-02-21T09:54:03.9155662Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:54:03.9156865Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:54:03.9157804Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}> 2026-02-21T09:54:03.9158689Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:54:03.9159451Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:54:03.9160099Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:54:03.9160570Z #smem = #ttg.shared_memory 2026-02-21T09:54:03.9161184Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:54:03.9162476Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:54:03.9163602Z %cst = arith.constant dense<8192> : tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9164046Z %cst_0 = arith.constant dense<0> : tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9164491Z %cst_1 = arith.constant dense<16384> : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9164931Z %cst_2 = arith.constant dense<0> : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9165363Z %cst_3 = arith.constant dense<8192> : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9165919Z %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:03.9166373Z %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:03.9166851Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9167273Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:54:03.9176453Z %cst_7 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9176977Z %cst_8 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9177486Z %cst_9 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9178031Z %cst_10 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9178458Z %cst_11 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:54:03.9178761Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:54:03.9178987Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:54:03.9179222Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:54:03.9179447Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:54:03.9179772Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:54:03.9179962Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:54:03.9180138Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:54:03.9180372Z %cst_12 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9180614Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:54:03.9180792Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:54:03.9181084Z %cst_13 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9181404Z %0 = tt.get_program_id x : i32 2026-02-21T09:54:03.9181581Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:54:03.9181771Z %2 = arith.minsi %1, %c4096_i32 : i32 2026-02-21T09:54:03.9182103Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:03.9182574Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:03.9183031Z %5 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:03.9183378Z %6 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:03.9183858Z %7 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9184309Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9184710Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9185031Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9185480Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:54:03.9186174Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:54:03.9186851Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:03.9187267Z %14 = arith.cmpi eq, %13, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:03.9187594Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T09:54:03.9187910Z %16 = arith.cmpi eq, %13, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:03.9188217Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T09:54:03.9188581Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:54:03.9189018Z %19 = arith.extsi %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:03.9189581Z %20 = arith.extsi %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:03.9189952Z %21 = arith.subi %2, %0 : i32 2026-02-21T09:54:03.9190122Z %22 = arith.remsi %21, %c4_i32 : i32 2026-02-21T09:54:03.9190308Z %23 = arith.subi %21, %22 : i32 2026-02-21T09:54:03.9190476Z %24 = arith.addi %0, %23 : i32 2026-02-21T09:54:03.9190695Z scf.for %arg3 = %0 to %24 step %c4_i32 : i32 { 2026-02-21T09:54:03.9190907Z %25 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:54:03.9191102Z %26 = arith.muli %25, %c8_i32 : i32 2026-02-21T09:54:03.9191281Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:54:03.9191471Z %28 = arith.minsi %27, %c8_i32 : i32 2026-02-21T09:54:03.9191659Z %29 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:54:03.9191850Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:54:03.9192031Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:54:03.9192222Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:54:03.9192402Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:54:03.9192671Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:03.9193036Z %35 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:03.9193309Z %36 = arith.muli %32, %c256_i32 : i32 2026-02-21T09:54:03.9193574Z %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:03.9193931Z %38 = arith.addi %37, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:03.9194381Z %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:54:03.9194795Z %40 = arith.muli %39, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:54:03.9195117Z %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9195566Z %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:54:03.9196038Z %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9196390Z %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:03.9196832Z %45 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:03.9197288Z %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9197599Z %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9197918Z %48 = tt.addptr %9, %47 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9198241Z %49 = tt.load %48 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9198706Z %50 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9199306Z ttg.local_store %49, %50 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9199747Z %51 = arith.addi %8, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9200207Z %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:03.9200661Z %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9200986Z %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9201313Z %55 = tt.addptr %9, %54 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9201561Z %56 = tt.load %55 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9201841Z %57 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9202197Z ttg.local_store %56, %57 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9202890Z %58:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %50, %arg8 = %57) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:54:03.9203368Z %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9203596Z %410 = arith.addi %409, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9203793Z %411 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:54:03.9203914Z %412 = arith.muli %411, %c2_i32 : i32 2026-02-21T09:54:03.9204087Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9204309Z %414 = arith.addi %413, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9204580Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:03.9204863Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9205057Z %417 = arith.addi %41, %416 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9205301Z %418 = tt.addptr %9, %417 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9205555Z %419 = tt.load %418 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9205856Z %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9206316Z %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9206700Z %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9206948Z %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9207142Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9207335Z %425 = arith.addi %424, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9207537Z %426 = tt.addptr %10, %425 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9207740Z %427 = tt.load %426 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9207985Z %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9208274Z %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9208508Z %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9208748Z %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9209038Z %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9209406Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9209692Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9209933Z %435 = arith.select %15, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9210175Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9210411Z %437 = arith.select %17, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9210662Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9210887Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9211142Z %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9211474Z %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9211972Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9212321Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:03.9212452Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:54:03.9212588Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:54:03.9212857Z %446 = ttg.memdesc_index %44[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9213217Z ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9213621Z scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9213929Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:54:03.9214103Z %59 = arith.addi %7, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9214447Z %60 = ttg.local_load %58#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9214875Z %61 = arith.extf %60 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9215247Z %62 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9215489Z %63 = arith.muli %62, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9215684Z %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9215876Z %65 = arith.addi %64, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9216069Z %66 = tt.addptr %10, %65 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9216266Z %67 = tt.load %66 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9216509Z %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9216785Z %69 = arith.shli %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9217014Z %70 = arith.shrsi %69, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9217260Z %71 = arith.shrsi %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9217541Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9217871Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9218148Z %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9218385Z %75 = arith.select %15, %74, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9218640Z %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9218867Z %77 = arith.select %17, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9219089Z %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9219308Z %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9219569Z %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9219890Z %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9220351Z %82 = tt.dot %61, %81, %58#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9220739Z %83 = arith.addi %7, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9221065Z %84 = ttg.local_load %58#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9221489Z %85 = arith.extf %84 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9221863Z %86 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9222104Z %87 = arith.muli %86, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9222308Z %88 = tt.broadcast %87 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9222500Z %89 = arith.addi %88, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9222690Z %90 = tt.addptr %10, %89 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9222885Z %91 = tt.load %90 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9223122Z %92 = ttg.convert_layout %91 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9223399Z %93 = arith.shli %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9223633Z %94 = arith.shrsi %93, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9223861Z %95 = arith.shrsi %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9224144Z %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9224473Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9224749Z %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9224984Z %99 = arith.select %15, %98, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9225232Z %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9225464Z %101 = arith.select %17, %100, %99 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9225697Z %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9225919Z %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9226173Z %104 = ttg.local_alloc %103 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9226497Z %105 = ttg.local_load %104 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9226980Z %106 = tt.dot %85, %105, %82, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9227364Z ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:03.9227594Z %107 = arith.truncf %106 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:54:03.9227770Z %108 = arith.extsi %33 : i32 to i64 2026-02-21T09:54:03.9227887Z %109 = arith.extsi %36 : i32 to i64 2026-02-21T09:54:03.9228051Z %110 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:03.9228265Z %111 = arith.addi %110, %19 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:03.9228530Z %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9228770Z %113 = arith.muli %112, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9228948Z %114 = tt.broadcast %113 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9229157Z %115 = tt.splat %109 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:03.9229373Z %116 = arith.addi %115, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:03.9229635Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9229899Z %118 = tt.broadcast %117 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9230103Z %119 = arith.addi %114, %118 : tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9230292Z %120 = tt.addptr %18, %119 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9230497Z %121 = arith.cmpi sge, %112, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9230661Z %122 = arith.cmpi slt, %112, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9230818Z %123 = arith.andi %121, %122 : tensor<128x1xi1, #mma> 2026-02-21T09:54:03.9230991Z %124 = tt.broadcast %123 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9231178Z %125 = arith.cmpi sge, %117, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9231343Z %126 = arith.cmpi slt, %117, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9231497Z %127 = arith.andi %125, %126 : tensor<1x256xi1, #mma> 2026-02-21T09:54:03.9231672Z %128 = tt.broadcast %127 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9231852Z %129 = arith.andi %124, %128 : tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9232016Z tt.store %120, %107, %129 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:54:03.9232166Z %130 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:54:03.9232292Z %131 = arith.divsi %130, %c256_i32 : i32 2026-02-21T09:54:03.9232415Z %132 = arith.muli %131, %c8_i32 : i32 2026-02-21T09:54:03.9232533Z %133 = arith.subi %c128_i32, %132 : i32 2026-02-21T09:54:03.9232667Z %134 = arith.minsi %133, %c8_i32 : i32 2026-02-21T09:54:03.9232785Z %135 = arith.remsi %130, %c256_i32 : i32 2026-02-21T09:54:03.9232903Z %136 = arith.remsi %135, %134 : i32 2026-02-21T09:54:03.9233017Z %137 = arith.addi %132, %136 : i32 2026-02-21T09:54:03.9233133Z %138 = arith.divsi %135, %134 : i32 2026-02-21T09:54:03.9233249Z %139 = arith.muli %137, %c128_i32 : i32 2026-02-21T09:54:03.9233422Z %140 = tt.splat %139 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:03.9233649Z %141 = arith.addi %140, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:03.9233823Z %142 = arith.muli %138, %c256_i32 : i32 2026-02-21T09:54:03.9234008Z %143 = tt.splat %142 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:03.9234229Z %144 = arith.addi %143, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:03.9234508Z %145 = tt.expand_dims %141 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:54:03.9234764Z %146 = arith.muli %145, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:54:03.9234975Z %147 = tt.broadcast %146 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9235257Z %148 = tt.expand_dims %144 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:54:03.9235538Z %149 = tt.broadcast %148 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9235763Z %150 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:03.9235953Z %151 = arith.addi %147, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9236151Z %152 = tt.addptr %9, %151 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9236360Z %153 = tt.load %152 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9236649Z %154 = ttg.memdesc_index %150[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9237017Z ttg.local_store %153, %154 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9237261Z %155 = arith.addi %147, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9237480Z %156 = tt.addptr %9, %155 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9237687Z %157 = tt.load %156 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9237967Z %158 = ttg.memdesc_index %150[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9238329Z ttg.local_store %157, %158 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9238859Z %159:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %154, %arg8 = %158) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:54:03.9239425Z %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9239656Z %410 = arith.addi %409, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9239835Z %411 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:54:03.9239958Z %412 = arith.muli %411, %c2_i32 : i32 2026-02-21T09:54:03.9240129Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9240351Z %414 = arith.addi %413, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9240645Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:03.9240924Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9241119Z %417 = arith.addi %147, %416 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9241325Z %418 = tt.addptr %9, %417 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9241532Z %419 = tt.load %418 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9241835Z %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9246196Z %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9246585Z %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9246865Z %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9247059Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9247259Z %425 = arith.addi %424, %149 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9247462Z %426 = tt.addptr %10, %425 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9247664Z %427 = tt.load %426 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9247912Z %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9248197Z %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9248437Z %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9248675Z %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9248965Z %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9249326Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9249613Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9249862Z %435 = arith.select %15, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9250105Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9250343Z %437 = arith.select %17, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9250577Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9250836Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9251142Z %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9251471Z %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9251946Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9252296Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:03.9252441Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:54:03.9252575Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:54:03.9252848Z %446 = ttg.memdesc_index %150[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9253207Z ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9253613Z scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9253936Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:54:03.9254212Z %160 = ttg.local_load %159#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9254647Z %161 = arith.extf %160 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9254956Z %162 = arith.addi %64, %149 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9255159Z %163 = tt.addptr %10, %162 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9255362Z %164 = tt.load %163 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9255605Z %165 = ttg.convert_layout %164 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9255889Z %166 = arith.shli %165, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9256126Z %167 = arith.shrsi %166, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9256362Z %168 = arith.shrsi %165, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9256655Z %169 = tt.expand_dims %167 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9256991Z %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9257277Z %171 = tt.broadcast %169 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9257535Z %172 = arith.select %15, %171, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9257776Z %173 = tt.broadcast %170 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9258014Z %174 = arith.select %17, %173, %172 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9258244Z %175 = tt.reshape %174 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9258471Z %176 = arith.sitofp %175 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9258724Z %177 = ttg.local_alloc %176 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9259059Z %178 = ttg.local_load %177 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9259561Z %179 = tt.dot %161, %178, %159#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9260058Z %180 = ttg.local_load %159#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9260489Z %181 = arith.extf %180 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9260801Z %182 = arith.addi %88, %149 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9261002Z %183 = tt.addptr %10, %182 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9261206Z %184 = tt.load %183 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9261449Z %185 = ttg.convert_layout %184 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9261737Z %186 = arith.shli %185, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9261976Z %187 = arith.shrsi %186, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9262228Z %188 = arith.shrsi %185, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9262517Z %189 = tt.expand_dims %187 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9262851Z %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9263150Z %191 = tt.broadcast %189 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9263393Z %192 = arith.select %15, %191, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9263632Z %193 = tt.broadcast %190 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9263863Z %194 = arith.select %17, %193, %192 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9264098Z %195 = tt.reshape %194 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9264324Z %196 = arith.sitofp %195 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9264598Z %197 = ttg.local_alloc %196 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9264922Z %198 = ttg.local_load %197 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9265392Z %199 = tt.dot %181, %198, %179, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9265794Z ttg.local_dealloc %150 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:03.9266009Z %200 = arith.truncf %199 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:54:03.9266185Z %201 = arith.extsi %139 : i32 to i64 2026-02-21T09:54:03.9266302Z %202 = arith.extsi %142 : i32 to i64 2026-02-21T09:54:03.9266478Z %203 = tt.splat %201 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:03.9266694Z %204 = arith.addi %203, %19 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:03.9266963Z %205 = tt.expand_dims %204 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9267204Z %206 = arith.muli %205, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9267383Z %207 = tt.broadcast %206 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9267595Z %208 = tt.splat %202 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:03.9267801Z %209 = arith.addi %208, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:03.9268067Z %210 = tt.expand_dims %209 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9268330Z %211 = tt.broadcast %210 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9268526Z %212 = arith.addi %207, %211 : tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9268720Z %213 = tt.addptr %18, %212 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9268923Z %214 = arith.cmpi sge, %205, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9269091Z %215 = arith.cmpi slt, %205, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9269248Z %216 = arith.andi %214, %215 : tensor<128x1xi1, #mma> 2026-02-21T09:54:03.9269424Z %217 = tt.broadcast %216 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9269609Z %218 = arith.cmpi sge, %210, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9269798Z %219 = arith.cmpi slt, %210, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9269960Z %220 = arith.andi %218, %219 : tensor<1x256xi1, #mma> 2026-02-21T09:54:03.9270133Z %221 = tt.broadcast %220 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9270320Z %222 = arith.andi %217, %221 : tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9270482Z tt.store %213, %200, %222 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:54:03.9270629Z %223 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:54:03.9270786Z %224 = arith.divsi %223, %c256_i32 : i32 2026-02-21T09:54:03.9270906Z %225 = arith.muli %224, %c8_i32 : i32 2026-02-21T09:54:03.9271030Z %226 = arith.subi %c128_i32, %225 : i32 2026-02-21T09:54:03.9271147Z %227 = arith.minsi %226, %c8_i32 : i32 2026-02-21T09:54:03.9271276Z %228 = arith.remsi %223, %c256_i32 : i32 2026-02-21T09:54:03.9271416Z %229 = arith.remsi %228, %227 : i32 2026-02-21T09:54:03.9271532Z %230 = arith.addi %225, %229 : i32 2026-02-21T09:54:03.9271648Z %231 = arith.divsi %228, %227 : i32 2026-02-21T09:54:03.9271778Z %232 = arith.muli %230, %c128_i32 : i32 2026-02-21T09:54:03.9271966Z %233 = tt.splat %232 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:03.9272193Z %234 = arith.addi %233, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:03.9272368Z %235 = arith.muli %231, %c256_i32 : i32 2026-02-21T09:54:03.9272539Z %236 = tt.splat %235 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:03.9272758Z %237 = arith.addi %236, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:03.9273055Z %238 = tt.expand_dims %234 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:54:03.9273309Z %239 = arith.muli %238, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:54:03.9273509Z %240 = tt.broadcast %239 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9273793Z %241 = tt.expand_dims %237 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:54:03.9274073Z %242 = tt.broadcast %241 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9274296Z %243 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:03.9274486Z %244 = arith.addi %240, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9274686Z %245 = tt.addptr %9, %244 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9274893Z %246 = tt.load %245 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9275179Z %247 = ttg.memdesc_index %243[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9275543Z ttg.local_store %246, %247 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9275786Z %248 = arith.addi %240, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9276000Z %249 = tt.addptr %9, %248 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9276204Z %250 = tt.load %249 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9276484Z %251 = ttg.memdesc_index %243[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9276844Z ttg.local_store %250, %251 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9277370Z %252:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %247, %arg8 = %251) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:54:03.9277863Z %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9278093Z %410 = arith.addi %409, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9278268Z %411 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:54:03.9278392Z %412 = arith.muli %411, %c2_i32 : i32 2026-02-21T09:54:03.9278574Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9278795Z %414 = arith.addi %413, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9279071Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:03.9279349Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9279568Z %417 = arith.addi %240, %416 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9279827Z %418 = tt.addptr %9, %417 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9280036Z %419 = tt.load %418 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9280341Z %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9280775Z %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9281174Z %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9281425Z %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9281616Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9281813Z %425 = arith.addi %424, %242 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9282012Z %426 = tt.addptr %10, %425 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9282290Z %427 = tt.load %426 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9282545Z %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9282883Z %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9283126Z %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9283360Z %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9283653Z %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9283993Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9284295Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9284541Z %435 = arith.select %15, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9284780Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9285021Z %437 = arith.select %17, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9285256Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9285500Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9285754Z %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9286081Z %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9286606Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9286957Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:03.9287147Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:54:03.9287283Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:54:03.9287551Z %446 = ttg.memdesc_index %243[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9287916Z ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9288317Z scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9288622Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:54:03.9288913Z %253 = ttg.local_load %252#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9289366Z %254 = arith.extf %253 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9289665Z %255 = arith.addi %64, %242 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9289865Z %256 = tt.addptr %10, %255 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9290064Z %257 = tt.load %256 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9290311Z %258 = ttg.convert_layout %257 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9290593Z %259 = arith.shli %258, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9290830Z %260 = arith.shrsi %259, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9291069Z %261 = arith.shrsi %258, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9291360Z %262 = tt.expand_dims %260 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9291706Z %263 = tt.expand_dims %261 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9291994Z %264 = tt.broadcast %262 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9292299Z %265 = arith.select %15, %264, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9292544Z %266 = tt.broadcast %263 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9292787Z %267 = arith.select %17, %266, %265 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9293025Z %268 = tt.reshape %267 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9293255Z %269 = arith.sitofp %268 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9293510Z %270 = ttg.local_alloc %269 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9294027Z %271 = ttg.local_load %270 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9294513Z %272 = tt.dot %254, %271, %252#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9295026Z %273 = ttg.local_load %252#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9295462Z %274 = arith.extf %273 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9295769Z %275 = arith.addi %88, %242 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9295970Z %276 = tt.addptr %10, %275 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9296214Z %277 = tt.load %276 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9296463Z %278 = ttg.convert_layout %277 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9296804Z %279 = arith.shli %278, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9297074Z %280 = arith.shrsi %279, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9297320Z %281 = arith.shrsi %278, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9297618Z %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9297975Z %283 = tt.expand_dims %281 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9298266Z %284 = tt.broadcast %282 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9298512Z %285 = arith.select %15, %284, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9298753Z %286 = tt.broadcast %283 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9298995Z %287 = arith.select %17, %286, %285 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9299230Z %288 = tt.reshape %287 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9299460Z %289 = arith.sitofp %288 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9299719Z %290 = ttg.local_alloc %289 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9300068Z %291 = ttg.local_load %290 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9300611Z %292 = tt.dot %274, %291, %272, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9301017Z ttg.local_dealloc %243 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:03.9301245Z %293 = arith.truncf %292 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:54:03.9301426Z %294 = arith.extsi %232 : i32 to i64 2026-02-21T09:54:03.9301548Z %295 = arith.extsi %235 : i32 to i64 2026-02-21T09:54:03.9301747Z %296 = tt.splat %294 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:03.9301979Z %297 = arith.addi %296, %19 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:03.9302257Z %298 = tt.expand_dims %297 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9302536Z %299 = arith.muli %298, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9302719Z %300 = tt.broadcast %299 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9302986Z %301 = tt.splat %295 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:03.9303197Z %302 = arith.addi %301, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:03.9303482Z %303 = tt.expand_dims %302 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9303749Z %304 = tt.broadcast %303 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9303935Z %305 = arith.addi %300, %304 : tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9304133Z %306 = tt.addptr %18, %305 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9304340Z %307 = arith.cmpi sge, %298, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9304511Z %308 = arith.cmpi slt, %298, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9304671Z %309 = arith.andi %307, %308 : tensor<128x1xi1, #mma> 2026-02-21T09:54:03.9304854Z %310 = tt.broadcast %309 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9305096Z %311 = arith.cmpi sge, %303, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9305262Z %312 = arith.cmpi slt, %303, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9305437Z %313 = arith.andi %311, %312 : tensor<1x256xi1, #mma> 2026-02-21T09:54:03.9305658Z %314 = tt.broadcast %313 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9305843Z %315 = arith.andi %310, %314 : tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9306028Z tt.store %306, %293, %315 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:54:03.9306223Z %316 = arith.addi %arg3, %c3_i32 : i32 2026-02-21T09:54:03.9306353Z %317 = arith.divsi %316, %c256_i32 : i32 2026-02-21T09:54:03.9306477Z %318 = arith.muli %317, %c8_i32 : i32 2026-02-21T09:54:03.9306607Z %319 = arith.subi %c128_i32, %318 : i32 2026-02-21T09:54:03.9306727Z %320 = arith.minsi %319, %c8_i32 : i32 2026-02-21T09:54:03.9306853Z %321 = arith.remsi %316, %c256_i32 : i32 2026-02-21T09:54:03.9306974Z %322 = arith.remsi %321, %320 : i32 2026-02-21T09:54:03.9307098Z %323 = arith.addi %318, %322 : i32 2026-02-21T09:54:03.9307220Z %324 = arith.divsi %321, %320 : i32 2026-02-21T09:54:03.9307359Z %325 = arith.muli %323, %c128_i32 : i32 2026-02-21T09:54:03.9307537Z %326 = tt.splat %325 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:03.9307767Z %327 = arith.addi %326, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:03.9307948Z %328 = arith.muli %324, %c256_i32 : i32 2026-02-21T09:54:03.9308188Z %329 = tt.splat %328 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:03.9308426Z %330 = arith.addi %329, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:03.9308723Z %331 = tt.expand_dims %327 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:54:03.9308995Z %332 = arith.muli %331, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:54:03.9309207Z %333 = tt.broadcast %332 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9309489Z %334 = tt.expand_dims %330 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:54:03.9309776Z %335 = tt.broadcast %334 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9310003Z %336 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:03.9310276Z %337 = arith.addi %333, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9310487Z %338 = tt.addptr %9, %337 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9310768Z %339 = tt.load %338 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9311060Z %340 = ttg.memdesc_index %336[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9311477Z ttg.local_store %339, %340 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9311723Z %341 = arith.addi %333, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9311932Z %342 = tt.addptr %9, %341 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9312141Z %343 = tt.load %342 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9312455Z %344 = ttg.memdesc_index %336[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9312829Z ttg.local_store %343, %344 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9313356Z %345:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %340, %arg8 = %344) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:54:03.9313875Z %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9314127Z %410 = arith.addi %409, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9314306Z %411 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:54:03.9314437Z %412 = arith.muli %411, %c2_i32 : i32 2026-02-21T09:54:03.9314609Z %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9314834Z %414 = arith.addi %413, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9315114Z %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:03.9315443Z %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9315652Z %417 = arith.addi %333, %416 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9315879Z %418 = tt.addptr %9, %417 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9316091Z %419 = tt.load %418 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9316395Z %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9316838Z %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9317253Z %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9317501Z %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9317701Z %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9317905Z %425 = arith.addi %424, %335 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9318109Z %426 = tt.addptr %10, %425 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9318330Z %427 = tt.load %426 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9318593Z %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9318883Z %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9319129Z %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9319374Z %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9319687Z %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9320025Z %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9320318Z %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9320570Z %435 = arith.select %15, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9320813Z %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9321057Z %437 = arith.select %17, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9321293Z %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9321528Z %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9321798Z %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9322161Z %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9322692Z %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9323049Z %443 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:03.9323181Z %444 = arith.cmpi slt, %443, %c2_i32 : i32 2026-02-21T09:54:03.9323321Z %445 = arith.select %444, %443, %c0_i32 : i32 2026-02-21T09:54:03.9323592Z %446 = ttg.memdesc_index %336[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9323958Z ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9324364Z scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9324673Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:54:03.9324957Z %346 = ttg.local_load %345#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9325390Z %347 = arith.extf %346 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9325711Z %348 = arith.addi %64, %335 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9325919Z %349 = tt.addptr %10, %348 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9326121Z %350 = tt.load %349 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9326374Z %351 = ttg.convert_layout %350 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9326657Z %352 = arith.shli %351, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9326921Z %353 = arith.shrsi %352, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9327166Z %354 = arith.shrsi %351, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9327463Z %355 = tt.expand_dims %353 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9327824Z %356 = tt.expand_dims %354 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9328113Z %357 = tt.broadcast %355 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9328363Z %358 = arith.select %15, %357, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9328611Z %359 = tt.broadcast %356 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9328852Z %360 = arith.select %17, %359, %358 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9329091Z %361 = tt.reshape %360 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9329319Z %362 = arith.sitofp %361 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9329581Z %363 = ttg.local_alloc %362 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9329914Z %364 = ttg.local_load %363 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9330408Z %365 = tt.dot %347, %364, %345#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9330907Z %366 = ttg.local_load %345#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9331342Z %367 = arith.extf %366 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9331640Z %368 = arith.addi %88, %335 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9331846Z %369 = tt.addptr %10, %368 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9332047Z %370 = tt.load %369 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9332294Z %371 = ttg.convert_layout %370 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9332584Z %372 = arith.shli %371, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9332822Z %373 = arith.shrsi %372, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9333064Z %374 = arith.shrsi %371, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9333352Z %375 = tt.expand_dims %373 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9333708Z %376 = tt.expand_dims %374 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9333997Z %377 = tt.broadcast %375 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9334235Z %378 = arith.select %15, %377, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9334475Z %379 = tt.broadcast %376 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9334708Z %380 = arith.select %17, %379, %378 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9334955Z %381 = tt.reshape %380 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9335179Z %382 = arith.sitofp %381 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9335433Z %383 = ttg.local_alloc %382 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9335774Z %384 = ttg.local_load %383 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9336241Z %385 = tt.dot %367, %384, %365, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9336628Z ttg.local_dealloc %336 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:03.9336847Z %386 = arith.truncf %385 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:54:03.9337021Z %387 = arith.extsi %325 : i32 to i64 2026-02-21T09:54:03.9337141Z %388 = arith.extsi %328 : i32 to i64 2026-02-21T09:54:03.9337308Z %389 = tt.splat %387 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:03.9337520Z %390 = arith.addi %389, %19 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:03.9337788Z %391 = tt.expand_dims %390 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9338028Z %392 = arith.muli %391, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9338210Z %393 = tt.broadcast %392 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9338434Z %394 = tt.splat %388 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:03.9338643Z %395 = arith.addi %394, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:03.9338908Z %396 = tt.expand_dims %395 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9339168Z %397 = tt.broadcast %396 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9339355Z %398 = arith.addi %393, %397 : tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9339545Z %399 = tt.addptr %18, %398 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9339750Z %400 = arith.cmpi sge, %391, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9339917Z %401 = arith.cmpi slt, %391, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9340072Z %402 = arith.andi %400, %401 : tensor<128x1xi1, #mma> 2026-02-21T09:54:03.9340251Z %403 = tt.broadcast %402 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9340433Z %404 = arith.cmpi sge, %396, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9340599Z %405 = arith.cmpi slt, %396, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9340755Z %406 = arith.andi %404, %405 : tensor<1x256xi1, #mma> 2026-02-21T09:54:03.9340926Z %407 = tt.broadcast %406 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9341121Z %408 = arith.andi %403, %407 : tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9341281Z tt.store %399, %386, %408 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:54:03.9341429Z } {tt.num_stages = 1 : i32} 2026-02-21T09:54:03.9341550Z scf.for %arg3 = %24 to %2 step %c1_i32 : i32 { 2026-02-21T09:54:03.9341685Z %25 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:54:03.9341804Z %26 = arith.muli %25, %c8_i32 : i32 2026-02-21T09:54:03.9341922Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:54:03.9342042Z %28 = arith.minsi %27, %c8_i32 : i32 2026-02-21T09:54:03.9342159Z %29 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:54:03.9342279Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:54:03.9342409Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:54:03.9342524Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:54:03.9342635Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:54:03.9342803Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:03.9343028Z %35 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:03.9343196Z %36 = arith.muli %32, %c256_i32 : i32 2026-02-21T09:54:03.9343373Z %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:03.9343588Z %38 = arith.addi %37, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:03.9343864Z %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:54:03.9344113Z %40 = arith.muli %39, %cst_11 : tensor<128x1xi32, #blocked2> 2026-02-21T09:54:03.9344307Z %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9344582Z %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:54:03.9344857Z %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9345076Z %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:03.9345342Z %45 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:03.9345609Z %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9345814Z %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9346006Z %48 = tt.addptr %9, %47 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9346208Z %49 = tt.load %48 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9346487Z %50 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9346848Z ttg.local_store %49, %50 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9347121Z %51 = arith.addi %8, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9347391Z %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:03.9347659Z %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9347850Z %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9348041Z %55 = tt.addptr %9, %54 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9348244Z %56 = tt.load %55 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9348520Z %57 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9348890Z ttg.local_store %56, %57 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9349408Z %58:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %50, %arg8 = %57) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:54:03.9349879Z %130 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9350107Z %131 = arith.addi %130, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9350301Z %132 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:54:03.9350425Z %133 = arith.muli %132, %c2_i32 : i32 2026-02-21T09:54:03.9350594Z %134 = tt.splat %133 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9350816Z %135 = arith.addi %134, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:03.9351105Z %136 = tt.expand_dims %135 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:03.9351382Z %137 = tt.broadcast %136 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9351577Z %138 = arith.addi %41, %137 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9351780Z %139 = tt.addptr %9, %138 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:03.9351990Z %140 = tt.load %139 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:03.9352293Z %141 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9352728Z %142 = arith.extf %141 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9353112Z %143 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9353365Z %144 = arith.muli %143, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9353557Z %145 = tt.broadcast %144 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9353768Z %146 = arith.addi %145, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9353967Z %147 = tt.addptr %10, %146 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9354172Z %148 = tt.load %147 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9354417Z %149 = ttg.convert_layout %148 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9354697Z %150 = arith.shli %149, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9354935Z %151 = arith.shrsi %150, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9355171Z %152 = arith.shrsi %149, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9355463Z %153 = tt.expand_dims %151 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9355802Z %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9356086Z %155 = tt.broadcast %153 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9356332Z %156 = arith.select %15, %155, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9356590Z %157 = tt.broadcast %154 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9356825Z %158 = arith.select %17, %157, %156 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9357061Z %159 = tt.reshape %158 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9357284Z %160 = arith.sitofp %159 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9357540Z %161 = ttg.local_alloc %160 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9357868Z %162 = ttg.local_load %161 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9358364Z %163 = tt.dot %142, %162, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9358713Z %164 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:03.9358840Z %165 = arith.cmpi slt, %164, %c2_i32 : i32 2026-02-21T09:54:03.9358997Z %166 = arith.select %165, %164, %c0_i32 : i32 2026-02-21T09:54:03.9359265Z %167 = ttg.memdesc_index %44[%166] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9359628Z ttg.local_store %140, %167 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9360031Z scf.yield %163, %166, %arg8, %167 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:03.9360335Z } {tt.flatten, tt.num_stages = 3 : i32} 2026-02-21T09:54:03.9360509Z %59 = arith.addi %7, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9360837Z %60 = ttg.local_load %58#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9361261Z %61 = arith.extf %60 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9361660Z %62 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9361903Z %63 = arith.muli %62, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9362092Z %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9362282Z %65 = arith.addi %64, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9362472Z %66 = tt.addptr %10, %65 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9362737Z %67 = tt.load %66 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9362978Z %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9363254Z %69 = arith.shli %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9363488Z %70 = arith.shrsi %69, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9363723Z %71 = arith.shrsi %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9364007Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9364339Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9364634Z %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9364873Z %75 = arith.select %15, %74, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9365105Z %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9365330Z %77 = arith.select %17, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9365553Z %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9365769Z %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9366037Z %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9366355Z %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9366851Z %82 = tt.dot %61, %81, %58#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9367240Z %83 = arith.addi %7, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:03.9367566Z %84 = ttg.local_load %58#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9367991Z %85 = arith.extf %84 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9368366Z %86 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9368613Z %87 = arith.muli %86, %cst_10 : tensor<2x1xi32, #blocked1> 2026-02-21T09:54:03.9368803Z %88 = tt.broadcast %87 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9368989Z %89 = arith.addi %88, %43 : tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9369185Z %90 = tt.addptr %10, %89 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:54:03.9369379Z %91 = tt.load %90 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:54:03.9369638Z %92 = ttg.convert_layout %91 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9369915Z %93 = arith.shli %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9370145Z %94 = arith.shrsi %93, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9370377Z %95 = arith.shrsi %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:03.9370659Z %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9370992Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:54:03.9371273Z %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9371506Z %99 = arith.select %15, %98, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9371742Z %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9371975Z %101 = arith.select %17, %100, %99 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:54:03.9372204Z %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:54:03.9372432Z %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:54:03.9372698Z %104 = ttg.local_alloc %103 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:54:03.9373027Z %105 = ttg.local_load %104 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:03.9373495Z %106 = tt.dot %85, %105, %82, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:54:03.9373873Z ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:03.9374109Z %107 = arith.truncf %106 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:54:03.9374279Z %108 = arith.extsi %33 : i32 to i64 2026-02-21T09:54:03.9374399Z %109 = arith.extsi %36 : i32 to i64 2026-02-21T09:54:03.9374564Z %110 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:03.9374775Z %111 = arith.addi %110, %19 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:03.9375060Z %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9375298Z %113 = arith.muli %112, %cst_3 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9375478Z %114 = tt.broadcast %113 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9375688Z %115 = tt.splat %109 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:03.9375896Z %116 = arith.addi %115, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:03.9376166Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9376426Z %118 = tt.broadcast %117 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9376612Z %119 = arith.addi %114, %118 : tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9376804Z %120 = tt.addptr %18, %119 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi64, #mma> 2026-02-21T09:54:03.9377008Z %121 = arith.cmpi sge, %112, %cst_2 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9377175Z %122 = arith.cmpi slt, %112, %cst_1 : tensor<128x1xi64, #mma> 2026-02-21T09:54:03.9377330Z %123 = arith.andi %121, %122 : tensor<128x1xi1, #mma> 2026-02-21T09:54:03.9377521Z %124 = tt.broadcast %123 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9377707Z %125 = arith.cmpi sge, %117, %cst_0 : tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9377871Z %126 = arith.cmpi slt, %117, %cst : tensor<1x256xi64, #mma> 2026-02-21T09:54:03.9378026Z %127 = arith.andi %125, %126 : tensor<1x256xi1, #mma> 2026-02-21T09:54:03.9378197Z %128 = tt.broadcast %127 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9378379Z %129 = arith.andi %124, %128 : tensor<128x256xi1, #mma> 2026-02-21T09:54:03.9378539Z tt.store %120, %107, %129 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:54:03.9378687Z } {tt.num_stages = 1 : i32} 2026-02-21T09:54:03.9378791Z tt.return 2026-02-21T09:54:03.9378872Z } 2026-02-21T09:54:03.9378947Z } 2026-02-21T09:54:03.9378994Z 2026-02-21T09:54:03.9379025Z {-# 2026-02-21T09:54:03.9379108Z external_resources: { 2026-02-21T09:54:03.9379206Z mlir_reproducer: { 2026-02-21T09:54:03.9380209Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:54:03.9381211Z disable_threading: false, 2026-02-21T09:54:03.9381316Z verify_each: true 2026-02-21T09:54:03.9381407Z } 2026-02-21T09:54:03.9381478Z } 2026-02-21T09:54:03.9381549Z #-} 2026-02-21T09:54:03.9381822Z /tmp/torchinductor_root/s4/cs4axgay44a357gkb6qe3krrz4yax7524ixw3db4slq63gerq545.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:54:03.9382508Z /tmp/torchinductor_root/s4/cs4axgay44a357gkb6qe3krrz4yax7524ixw3db4slq63gerq545.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:54:03.9383063Z [574s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:54:03.9383860Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:54:03.9384562Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:54:03.9384732Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:54:05.4115233Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:54:05.4124955Z #blocked = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:54:05.4126212Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T09:54:05.4127371Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:54:05.4128543Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T09:54:05.4129938Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:54:05.4130763Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:54:05.4131455Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:54:05.4131809Z #smem = #ttg.shared_memory 2026-02-21T09:54:05.4132541Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:54:05.4134091Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:54:05.4135347Z %cst = arith.constant dense<4> : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4135809Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:54:05.4136168Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:54:05.4136524Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:54:05.4136987Z %cst_0 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4137547Z %cst_1 = arith.constant dense<0> : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4138003Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:54:05.4138370Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:54:05.4138740Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:54:05.4139206Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:54:05.4139412Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:54:05.4139575Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:54:05.4139778Z %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:54:05.4140035Z %cst_3 = arith.constant dense<8192> : tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4140284Z %cst_4 = arith.constant dense<0> : tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4140639Z %cst_5 = arith.constant dense<512> : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4141038Z %cst_6 = arith.constant dense<0> : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4141541Z %cst_7 = arith.constant dense<8192> : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4142028Z %cst_8 = arith.constant dense<2> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:05.4142619Z %cst_9 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:05.4143041Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:54:05.4143399Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:54:05.4143901Z %cst_11 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:54:05.4144312Z %cst_12 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:54:05.4144705Z %cst_13 = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:54:05.4145039Z %0 = tt.get_program_id x : i32 2026-02-21T09:54:05.4145289Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:54:05.4145543Z %2 = arith.minsi %1, %c8192_i32 : i32 2026-02-21T09:54:05.4146011Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:05.4146667Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:05.4147296Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:54:05.4147928Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:05.4148555Z %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:05.4149122Z %8 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:05.4149593Z %9 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:54:05.4150125Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:05.4150719Z %11 = arith.extsi %10 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:05.4151384Z %12 = arith.extsi %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:54:05.4152046Z %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:54:05.4152749Z %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:54:05.4153233Z %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:54:05.4153697Z %16 = arith.cmpi eq, %15, %cst_11 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:54:05.4154058Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1> 2026-02-21T09:54:05.4154408Z %18 = arith.cmpi eq, %15, %cst_12 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:54:05.4154755Z %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1> 2026-02-21T09:54:05.4155011Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:54:05.4155184Z %21 = arith.subi %2, %0 : i32 2026-02-21T09:54:05.4155340Z %22 = arith.remsi %21, %c3_i32 : i32 2026-02-21T09:54:05.4155546Z %23 = arith.subi %21, %22 : i32 2026-02-21T09:54:05.4155740Z %24 = arith.addi %0, %23 : i32 2026-02-21T09:54:05.4155957Z scf.for %arg3 = %0 to %24 step %c3_i32 : i32 { 2026-02-21T09:54:05.4156196Z %25 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:54:05.4156414Z %26 = arith.muli %25, %c4_i32 : i32 2026-02-21T09:54:05.4156619Z %27 = arith.subi %c64_i32, %26 : i32 2026-02-21T09:54:05.4156856Z %28 = arith.minsi %27, %c4_i32 : i32 2026-02-21T09:54:05.4157063Z %29 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:54:05.4157277Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:54:05.4157476Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:54:05.4157675Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:54:05.4157870Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:54:05.4158149Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:05.4158555Z %35 = arith.addi %34, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:05.4158838Z %36 = arith.muli %32, %c128_i32 : i32 2026-02-21T09:54:05.4159133Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:05.4159489Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:05.4159828Z %39 = arith.addi %37, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:05.4160170Z %40 = arith.addi %38, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:05.4160648Z %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:54:05.4161054Z %42 = arith.muli %41, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:54:05.4161367Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4161648Z %44 = arith.extsi %33 : i32 to i64 2026-02-21T09:54:05.4161908Z %45 = tt.splat %44 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:54:05.4162260Z %46 = arith.addi %45, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:54:05.4162803Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4163255Z %48 = tt.broadcast %47 : tensor<1x128xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4163478Z %49 = arith.cmpi sge, %47, %cst_4 : tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4163646Z %50 = arith.cmpi slt, %47, %cst_3 : tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4163805Z %51 = arith.andi %49, %50 : tensor<1x128xi1, #blocked> 2026-02-21T09:54:05.4164094Z %52 = tt.broadcast %51 : tensor<1x128xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4164443Z %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:05.4164887Z %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:05.4165332Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4165584Z %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4165782Z %57 = tt.addptr %8, %56 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4165980Z %58 = tt.load %57 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:05.4166214Z %59 = tt.expand_dims %11 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4166473Z %60 = arith.muli %59, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4166653Z %61 = tt.broadcast %60 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4166837Z %62 = arith.addi %61, %48 : tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4167025Z %63 = tt.addptr %9, %62 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4167229Z %64 = arith.cmpi sge, %59, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4167395Z %65 = arith.cmpi slt, %59, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4167568Z %66 = arith.andi %64, %65 : tensor<2x1xi1, #blocked> 2026-02-21T09:54:05.4167744Z %67 = tt.broadcast %66 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4167927Z %68 = arith.andi %67, %52 : tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4168133Z %69 = tt.load %63, %68, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:54:05.4168491Z %70 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4168850Z ttg.local_store %58, %70 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4169124Z %71 = arith.addi %7, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:05.4169400Z %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:05.4169678Z %73 = tt.broadcast %72 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4169869Z %74 = arith.addi %43, %73 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4170064Z %75 = tt.addptr %8, %74 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4170266Z %76 = tt.load %75 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:05.4170451Z %77 = arith.addi %11, %cst_8 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:05.4170722Z %78 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4170956Z %79 = arith.muli %78, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4171151Z %80 = tt.broadcast %79 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4171335Z %81 = arith.addi %80, %48 : tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4171520Z %82 = tt.addptr %9, %81 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4171722Z %83 = arith.cmpi sge, %78, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4171885Z %84 = arith.cmpi slt, %78, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4172044Z %85 = arith.andi %83, %84 : tensor<2x1xi1, #blocked> 2026-02-21T09:54:05.4172219Z %86 = tt.broadcast %85 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4172400Z %87 = arith.andi %86, %52 : tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4172607Z %88 = tt.load %82, %87, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:54:05.4172942Z %89 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4173302Z ttg.local_store %76, %89 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4173926Z %90:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %70, %arg8 = %89, %arg9 = %69, %arg10 = %88) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:54:05.4174465Z %317 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:54:05.4174592Z %318 = arith.muli %317, %c2_i32 : i32 2026-02-21T09:54:05.4174766Z %319 = tt.splat %318 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:05.4174990Z %320 = arith.addi %319, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:05.4175269Z %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:05.4175574Z %322 = tt.broadcast %321 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4175789Z %323 = arith.addi %43, %322 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4175996Z %324 = tt.addptr %8, %323 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4176204Z %325 = tt.load %324 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:05.4176524Z %326 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4176965Z %327 = arith.extf %326 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4177251Z %328 = arith.extsi %317 : i32 to i64 2026-02-21T09:54:05.4177421Z %329 = tt.splat %328 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:05.4177642Z %330 = arith.addi %329, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:05.4177915Z %331 = tt.expand_dims %330 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4178157Z %332 = arith.muli %331, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4178356Z %333 = tt.broadcast %332 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4178548Z %334 = arith.addi %333, %48 : tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4178747Z %335 = tt.addptr %9, %334 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4178977Z %336 = arith.cmpi sge, %331, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4179145Z %337 = arith.cmpi slt, %331, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4179313Z %338 = arith.andi %336, %337 : tensor<2x1xi1, #blocked> 2026-02-21T09:54:05.4179500Z %339 = tt.broadcast %338 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4179689Z %340 = arith.andi %339, %52 : tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4179855Z %341 = tt.load %335, %340, %cst_1 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:54:05.4180031Z %342 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4180191Z %343 = arith.shrsi %342, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4180436Z %344 = ttg.convert_layout %343 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4180688Z %345 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4180935Z %346 = ttg.convert_layout %345 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4181273Z %347 = tt.expand_dims %344 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4181618Z %348 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4181922Z %349 = tt.broadcast %347 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4182172Z %350 = arith.select %17, %349, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4182418Z %351 = tt.broadcast %348 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4182662Z %352 = arith.select %19, %351, %350 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4182900Z %353 = tt.reshape %352 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:54:05.4183145Z %354 = arith.sitofp %353 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:54:05.4183402Z %355 = ttg.local_alloc %354 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:05.4183731Z %356 = ttg.local_load %355 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4184225Z %357 = tt.dot %327, %356, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:05.4184575Z %358 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:05.4184704Z %359 = arith.cmpi slt, %358, %c2_i32 : i32 2026-02-21T09:54:05.4184840Z %360 = arith.select %359, %358, %c0_i32 : i32 2026-02-21T09:54:05.4185110Z %361 = ttg.memdesc_index %53[%360] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4185475Z ttg.local_store %325, %361 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4185969Z scf.yield %357, %360, %arg8, %361, %arg10, %341 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4186398Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:05.4186716Z %91 = ttg.local_load %90#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4187162Z %92 = arith.extf %91 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4187459Z %93 = arith.shli %90#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4187620Z %94 = arith.shrsi %93, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4187859Z %95 = ttg.convert_layout %94 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4188099Z %96 = arith.shrsi %90#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4188339Z %97 = ttg.convert_layout %96 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4188670Z %98 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4189012Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4189299Z %100 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4189544Z %101 = arith.select %17, %100, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4189788Z %102 = tt.broadcast %99 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4190038Z %103 = arith.select %19, %102, %101 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4190280Z %104 = tt.reshape %103 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:54:05.4190506Z %105 = arith.sitofp %104 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:54:05.4190758Z %106 = ttg.local_alloc %105 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:05.4191086Z %107 = ttg.local_load %106 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4191570Z %108 = tt.dot %92, %107, %90#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:05.4192061Z %109 = ttg.local_load %90#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4192508Z %110 = arith.extf %109 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4192807Z %111 = arith.shli %90#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4192968Z %112 = arith.shrsi %111, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4193216Z %113 = ttg.convert_layout %112 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4193471Z %114 = arith.shrsi %90#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4193716Z %115 = ttg.convert_layout %114 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4194054Z %116 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4194400Z %117 = tt.expand_dims %115 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4194689Z %118 = tt.broadcast %116 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4194936Z %119 = arith.select %17, %118, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4195205Z %120 = tt.broadcast %117 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4195444Z %121 = arith.select %19, %120, %119 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4195683Z %122 = tt.reshape %121 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:54:05.4195911Z %123 = arith.sitofp %122 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:54:05.4196164Z %124 = ttg.local_alloc %123 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:05.4196491Z %125 = ttg.local_load %124 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4196966Z %126 = tt.dot %110, %125, %108, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:05.4197350Z ttg.local_dealloc %53 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:05.4197568Z %127 = arith.truncf %126 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:54:05.4197839Z %128 = tt.expand_dims %40 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:05.4198095Z %129 = arith.muli %128, %cst_13 : tensor<128x1xi32, #mma> 2026-02-21T09:54:05.4198331Z %130 = tt.expand_dims %35 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:54:05.4198591Z %131 = tt.broadcast %129 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4198801Z %132 = tt.broadcast %130 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4198986Z %133 = arith.addi %131, %132 : tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4199184Z %134 = tt.addptr %20, %133 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4199407Z tt.store %134, %127 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:54:05.4199550Z %135 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:54:05.4199675Z %136 = arith.divsi %135, %c512_i32 : i32 2026-02-21T09:54:05.4199797Z %137 = arith.muli %136, %c4_i32 : i32 2026-02-21T09:54:05.4199920Z %138 = arith.subi %c64_i32, %137 : i32 2026-02-21T09:54:05.4200035Z %139 = arith.minsi %138, %c4_i32 : i32 2026-02-21T09:54:05.4200153Z %140 = arith.remsi %135, %c512_i32 : i32 2026-02-21T09:54:05.4200269Z %141 = arith.remsi %140, %139 : i32 2026-02-21T09:54:05.4200397Z %142 = arith.addi %137, %141 : i32 2026-02-21T09:54:05.4200514Z %143 = arith.divsi %140, %139 : i32 2026-02-21T09:54:05.4200630Z %144 = arith.muli %142, %c128_i32 : i32 2026-02-21T09:54:05.4200799Z %145 = tt.splat %144 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:05.4201009Z %146 = arith.addi %145, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:05.4201182Z %147 = arith.muli %143, %c128_i32 : i32 2026-02-21T09:54:05.4201350Z %148 = tt.splat %147 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:05.4201571Z %149 = tt.splat %147 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:05.4201789Z %150 = arith.addi %148, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:05.4202002Z %151 = arith.addi %149, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:05.4202277Z %152 = tt.expand_dims %150 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:54:05.4202532Z %153 = arith.muli %152, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:54:05.4202808Z %154 = tt.broadcast %153 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4202995Z %155 = arith.extsi %144 : i32 to i64 2026-02-21T09:54:05.4203161Z %156 = tt.splat %155 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:54:05.4203384Z %157 = arith.addi %156, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:54:05.4203658Z %158 = tt.expand_dims %157 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4203939Z %159 = tt.broadcast %158 : tensor<1x128xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4204143Z %160 = arith.cmpi sge, %158, %cst_4 : tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4204316Z %161 = arith.cmpi slt, %158, %cst_3 : tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4204483Z %162 = arith.andi %160, %161 : tensor<1x128xi1, #blocked> 2026-02-21T09:54:05.4204669Z %163 = tt.broadcast %162 : tensor<1x128xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4204884Z %164 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:05.4205070Z %165 = arith.addi %154, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4205271Z %166 = tt.addptr %8, %165 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4205507Z %167 = tt.load %166 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:05.4205664Z %168 = arith.addi %61, %159 : tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4205858Z %169 = tt.addptr %9, %168 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4206054Z %170 = arith.andi %67, %163 : tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4206270Z %171 = tt.load %169, %170, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:54:05.4206614Z %172 = ttg.memdesc_index %164[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4206997Z ttg.local_store %167, %172 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4207239Z %173 = arith.addi %154, %73 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4207440Z %174 = tt.addptr %8, %173 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4207647Z %175 = tt.load %174 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:05.4207805Z %176 = arith.addi %80, %159 : tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4208012Z %177 = tt.addptr %9, %176 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4208208Z %178 = arith.andi %86, %163 : tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4208418Z %179 = tt.load %177, %178, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:54:05.4208756Z %180 = ttg.memdesc_index %164[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4209120Z ttg.local_store %175, %180 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4209762Z %181:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %172, %arg8 = %180, %arg9 = %171, %arg10 = %179) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:54:05.4210292Z %317 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:54:05.4210418Z %318 = arith.muli %317, %c2_i32 : i32 2026-02-21T09:54:05.4210606Z %319 = tt.splat %318 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:05.4210834Z %320 = arith.addi %319, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:05.4211109Z %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:05.4211388Z %322 = tt.broadcast %321 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4211586Z %323 = arith.addi %154, %322 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4211788Z %324 = tt.addptr %8, %323 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4211999Z %325 = tt.load %324 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:05.4212298Z %326 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4212738Z %327 = arith.extf %326 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4213020Z %328 = arith.extsi %317 : i32 to i64 2026-02-21T09:54:05.4213188Z %329 = tt.splat %328 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:05.4213407Z %330 = arith.addi %329, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:05.4213694Z %331 = tt.expand_dims %330 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4213941Z %332 = arith.muli %331, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4214132Z %333 = tt.broadcast %332 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4214322Z %334 = arith.addi %333, %159 : tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4214520Z %335 = tt.addptr %9, %334 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4214723Z %336 = arith.cmpi sge, %331, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4214913Z %337 = arith.cmpi slt, %331, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4215074Z %338 = arith.andi %336, %337 : tensor<2x1xi1, #blocked> 2026-02-21T09:54:05.4215261Z %339 = tt.broadcast %338 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4215453Z %340 = arith.andi %339, %163 : tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4215621Z %341 = tt.load %335, %340, %cst_1 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:54:05.4215811Z %342 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4215971Z %343 = arith.shrsi %342, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4216218Z %344 = ttg.convert_layout %343 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4216469Z %345 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4216714Z %346 = ttg.convert_layout %345 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4217054Z %347 = tt.expand_dims %344 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4217398Z %348 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4217695Z %349 = tt.broadcast %347 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4217947Z %350 = arith.select %17, %349, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4218190Z %351 = tt.broadcast %348 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4218451Z %352 = arith.select %19, %351, %350 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4218686Z %353 = tt.reshape %352 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:54:05.4218914Z %354 = arith.sitofp %353 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:54:05.4219171Z %355 = ttg.local_alloc %354 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:05.4219503Z %356 = ttg.local_load %355 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4219989Z %357 = tt.dot %327, %356, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:05.4220343Z %358 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:05.4220471Z %359 = arith.cmpi slt, %358, %c2_i32 : i32 2026-02-21T09:54:05.4220608Z %360 = arith.select %359, %358, %c0_i32 : i32 2026-02-21T09:54:05.4220876Z %361 = ttg.memdesc_index %164[%360] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4221238Z ttg.local_store %325, %361 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4221742Z scf.yield %357, %360, %arg8, %361, %arg10, %341 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4222165Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:05.4222486Z %182 = ttg.local_load %181#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4222939Z %183 = arith.extf %182 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4223239Z %184 = arith.shli %181#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4223405Z %185 = arith.shrsi %184, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4223646Z %186 = ttg.convert_layout %185 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4223906Z %187 = arith.shrsi %181#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4224150Z %188 = ttg.convert_layout %187 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4224488Z %189 = tt.expand_dims %186 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4224838Z %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4225127Z %191 = tt.broadcast %189 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4225377Z %192 = arith.select %17, %191, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4225626Z %193 = tt.broadcast %190 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4225867Z %194 = arith.select %19, %193, %192 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4226106Z %195 = tt.reshape %194 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:54:05.4226329Z %196 = arith.sitofp %195 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:54:05.4226600Z %197 = ttg.local_alloc %196 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:05.4226929Z %198 = ttg.local_load %197 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4227401Z %199 = tt.dot %183, %198, %181#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:05.4227899Z %200 = ttg.local_load %181#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4228330Z %201 = arith.extf %200 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4228632Z %202 = arith.shli %181#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4228796Z %203 = arith.shrsi %202, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4229041Z %204 = ttg.convert_layout %203 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4229291Z %205 = arith.shrsi %181#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4229534Z %206 = ttg.convert_layout %205 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4229882Z %207 = tt.expand_dims %204 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4230224Z %208 = tt.expand_dims %206 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4230511Z %209 = tt.broadcast %207 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4230758Z %210 = arith.select %17, %209, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4231022Z %211 = tt.broadcast %208 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4231261Z %212 = arith.select %19, %211, %210 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4231497Z %213 = tt.reshape %212 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:54:05.4231728Z %214 = arith.sitofp %213 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:54:05.4232001Z %215 = ttg.local_alloc %214 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:05.4232327Z %216 = ttg.local_load %215 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4232798Z %217 = tt.dot %201, %216, %199, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:05.4233183Z ttg.local_dealloc %164 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:05.4233401Z %218 = arith.truncf %217 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:54:05.4233672Z %219 = tt.expand_dims %151 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:05.4233916Z %220 = arith.muli %219, %cst_13 : tensor<128x1xi32, #mma> 2026-02-21T09:54:05.4234151Z %221 = tt.expand_dims %146 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:54:05.4234414Z %222 = tt.broadcast %220 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4234647Z %223 = tt.broadcast %221 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4234828Z %224 = arith.addi %222, %223 : tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4235023Z %225 = tt.addptr %20, %224 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4235222Z tt.store %225, %218 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:54:05.4235366Z %226 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:54:05.4235489Z %227 = arith.divsi %226, %c512_i32 : i32 2026-02-21T09:54:05.4235612Z %228 = arith.muli %227, %c4_i32 : i32 2026-02-21T09:54:05.4235733Z %229 = arith.subi %c64_i32, %228 : i32 2026-02-21T09:54:05.4235852Z %230 = arith.minsi %229, %c4_i32 : i32 2026-02-21T09:54:05.4235972Z %231 = arith.remsi %226, %c512_i32 : i32 2026-02-21T09:54:05.4236089Z %232 = arith.remsi %231, %230 : i32 2026-02-21T09:54:05.4236210Z %233 = arith.addi %228, %232 : i32 2026-02-21T09:54:05.4236326Z %234 = arith.divsi %231, %230 : i32 2026-02-21T09:54:05.4236451Z %235 = arith.muli %233, %c128_i32 : i32 2026-02-21T09:54:05.4236618Z %236 = tt.splat %235 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:05.4236838Z %237 = arith.addi %236, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:05.4237015Z %238 = arith.muli %234, %c128_i32 : i32 2026-02-21T09:54:05.4237189Z %239 = tt.splat %238 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:05.4237427Z %240 = tt.splat %238 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:05.4237646Z %241 = arith.addi %239, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:05.4237866Z %242 = arith.addi %240, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:05.4238148Z %243 = tt.expand_dims %241 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:54:05.4238406Z %244 = arith.muli %243, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:54:05.4238630Z %245 = tt.broadcast %244 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4238810Z %246 = arith.extsi %235 : i32 to i64 2026-02-21T09:54:05.4238985Z %247 = tt.splat %246 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:54:05.4239213Z %248 = arith.addi %247, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:54:05.4239506Z %249 = tt.expand_dims %248 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4239791Z %250 = tt.broadcast %249 : tensor<1x128xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4239996Z %251 = arith.cmpi sge, %249, %cst_4 : tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4240180Z %252 = arith.cmpi slt, %249, %cst_3 : tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4240353Z %253 = arith.andi %251, %252 : tensor<1x128xi1, #blocked> 2026-02-21T09:54:05.4240543Z %254 = tt.broadcast %253 : tensor<1x128xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4240765Z %255 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:05.4240955Z %256 = arith.addi %245, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4241166Z %257 = tt.addptr %8, %256 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4241375Z %258 = tt.load %257 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:05.4241541Z %259 = arith.addi %61, %250 : tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4241741Z %260 = tt.addptr %9, %259 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4241940Z %261 = arith.andi %67, %254 : tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4242176Z %262 = tt.load %260, %261, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:54:05.4242520Z %263 = ttg.memdesc_index %255[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4242916Z ttg.local_store %258, %263 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4243168Z %264 = arith.addi %245, %73 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4243371Z %265 = tt.addptr %8, %264 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4243586Z %266 = tt.load %265 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:05.4243747Z %267 = arith.addi %80, %250 : tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4243945Z %268 = tt.addptr %9, %267 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4244150Z %269 = arith.andi %86, %254 : tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4244363Z %270 = tt.load %268, %269, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:54:05.4244712Z %271 = ttg.memdesc_index %255[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4245080Z ttg.local_store %266, %271 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4245744Z %272:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %263, %arg8 = %271, %arg9 = %262, %arg10 = %270) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:54:05.4246284Z %317 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:54:05.4246411Z %318 = arith.muli %317, %c2_i32 : i32 2026-02-21T09:54:05.4246611Z %319 = tt.splat %318 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:05.4246841Z %320 = arith.addi %319, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:05.4247121Z %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:05.4247409Z %322 = tt.broadcast %321 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4247625Z %323 = arith.addi %245, %322 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4247834Z %324 = tt.addptr %8, %323 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4248046Z %325 = tt.load %324 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:05.4248353Z %326 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4248803Z %327 = arith.extf %326 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4249088Z %328 = arith.extsi %317 : i32 to i64 2026-02-21T09:54:05.4249268Z %329 = tt.splat %328 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:05.4249494Z %330 = arith.addi %329, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:05.4249774Z %331 = tt.expand_dims %330 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4250029Z %332 = arith.muli %331, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4250243Z %333 = tt.broadcast %332 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4250444Z %334 = arith.addi %333, %250 : tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4250646Z %335 = tt.addptr %9, %334 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4250855Z %336 = arith.cmpi sge, %331, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4251032Z %337 = arith.cmpi slt, %331, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4251198Z %338 = arith.andi %336, %337 : tensor<2x1xi1, #blocked> 2026-02-21T09:54:05.4251388Z %339 = tt.broadcast %338 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4251582Z %340 = arith.andi %339, %254 : tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4251756Z %341 = tt.load %335, %340, %cst_1 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:54:05.4251938Z %342 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4252100Z %343 = arith.shrsi %342, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4252351Z %344 = ttg.convert_layout %343 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4252606Z %345 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4252861Z %346 = ttg.convert_layout %345 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4253222Z %347 = tt.expand_dims %344 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4253570Z %348 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4253871Z %349 = tt.broadcast %347 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4254130Z %350 = arith.select %17, %349, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4254383Z %351 = tt.broadcast %348 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4254651Z %352 = arith.select %19, %351, %350 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4254892Z %353 = tt.reshape %352 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:54:05.4255128Z %354 = arith.sitofp %353 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:54:05.4255387Z %355 = ttg.local_alloc %354 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:05.4255743Z %356 = ttg.local_load %355 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4256228Z %357 = tt.dot %327, %356, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:05.4256579Z %358 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:05.4256714Z %359 = arith.cmpi slt, %358, %c2_i32 : i32 2026-02-21T09:54:05.4256854Z %360 = arith.select %359, %358, %c0_i32 : i32 2026-02-21T09:54:05.4257126Z %361 = ttg.memdesc_index %255[%360] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4257496Z ttg.local_store %325, %361 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4257986Z scf.yield %357, %360, %arg8, %361, %arg10, %341 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4258431Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:05.4258756Z %273 = ttg.local_load %272#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4259192Z %274 = arith.extf %273 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4259502Z %275 = arith.shli %272#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4259670Z %276 = arith.shrsi %275, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4259918Z %277 = ttg.convert_layout %276 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4260172Z %278 = arith.shrsi %272#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4260417Z %279 = ttg.convert_layout %278 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4260760Z %280 = tt.expand_dims %277 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4261113Z %281 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4261420Z %282 = tt.broadcast %280 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4261675Z %283 = arith.select %17, %282, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4261924Z %284 = tt.broadcast %281 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4262171Z %285 = arith.select %19, %284, %283 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4262418Z %286 = tt.reshape %285 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:54:05.4262650Z %287 = arith.sitofp %286 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:54:05.4262924Z %288 = ttg.local_alloc %287 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:05.4263250Z %289 = ttg.local_load %288 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4263751Z %290 = tt.dot %274, %289, %272#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:05.4264254Z %291 = ttg.local_load %272#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4264689Z %292 = arith.extf %291 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4264998Z %293 = arith.shli %272#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4265168Z %294 = arith.shrsi %293, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4265413Z %295 = ttg.convert_layout %294 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4265668Z %296 = arith.shrsi %272#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4265914Z %297 = ttg.convert_layout %296 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4266259Z %298 = tt.expand_dims %295 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4266627Z %299 = tt.expand_dims %297 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4266918Z %300 = tt.broadcast %298 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4267175Z %301 = arith.select %17, %300, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4267423Z %302 = tt.broadcast %299 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4267670Z %303 = arith.select %19, %302, %301 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4267911Z %304 = tt.reshape %303 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:54:05.4268141Z %305 = arith.sitofp %304 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:54:05.4268403Z %306 = ttg.local_alloc %305 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:05.4268736Z %307 = ttg.local_load %306 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4269214Z %308 = tt.dot %292, %307, %290, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:05.4269603Z ttg.local_dealloc %255 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:05.4269837Z %309 = arith.truncf %308 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:54:05.4270115Z %310 = tt.expand_dims %242 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:05.4270363Z %311 = arith.muli %310, %cst_13 : tensor<128x1xi32, #mma> 2026-02-21T09:54:05.4270599Z %312 = tt.expand_dims %237 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:54:05.4270869Z %313 = tt.broadcast %311 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4271101Z %314 = tt.broadcast %312 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4271292Z %315 = arith.addi %313, %314 : tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4271494Z %316 = tt.addptr %20, %315 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4271701Z tt.store %316, %309 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:54:05.4271848Z } {tt.num_stages = 1 : i32} 2026-02-21T09:54:05.4271971Z scf.for %arg3 = %24 to %2 step %c1_i32 : i32 { 2026-02-21T09:54:05.4272128Z %25 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:54:05.4272252Z %26 = arith.muli %25, %c4_i32 : i32 2026-02-21T09:54:05.4272379Z %27 = arith.subi %c64_i32, %26 : i32 2026-02-21T09:54:05.4272498Z %28 = arith.minsi %27, %c4_i32 : i32 2026-02-21T09:54:05.4272624Z %29 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:54:05.4272748Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:54:05.4272865Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:54:05.4272982Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:54:05.4273096Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:54:05.4273262Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:05.4273473Z %35 = arith.addi %34, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:05.4273643Z %36 = arith.muli %32, %c128_i32 : i32 2026-02-21T09:54:05.4273816Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:05.4274028Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:05.4274246Z %39 = arith.addi %37, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:05.4274476Z %40 = arith.addi %38, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:05.4274751Z %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:54:05.4275006Z %42 = arith.muli %41, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:54:05.4275196Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4275373Z %44 = arith.extsi %33 : i32 to i64 2026-02-21T09:54:05.4275537Z %45 = tt.splat %44 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:54:05.4275754Z %46 = arith.addi %45, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:54:05.4276023Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4276297Z %48 = tt.broadcast %47 : tensor<1x128xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4276493Z %49 = arith.cmpi sge, %47, %cst_4 : tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4276661Z %50 = arith.cmpi slt, %47, %cst_3 : tensor<1x128xi64, #blocked> 2026-02-21T09:54:05.4276824Z %51 = arith.andi %49, %50 : tensor<1x128xi1, #blocked> 2026-02-21T09:54:05.4277000Z %52 = tt.broadcast %51 : tensor<1x128xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4277229Z %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:05.4277495Z %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:05.4277761Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4277950Z %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4278144Z %57 = tt.addptr %8, %56 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4278345Z %58 = tt.load %57 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:05.4278601Z %59 = tt.expand_dims %11 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4278836Z %60 = arith.muli %59, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4279018Z %61 = tt.broadcast %60 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4279203Z %62 = arith.addi %61, %48 : tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4279392Z %63 = tt.addptr %9, %62 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4279606Z %64 = arith.cmpi sge, %59, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4279771Z %65 = arith.cmpi slt, %59, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4279928Z %66 = arith.andi %64, %65 : tensor<2x1xi1, #blocked> 2026-02-21T09:54:05.4280102Z %67 = tt.broadcast %66 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4280286Z %68 = arith.andi %67, %52 : tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4280493Z %69 = tt.load %63, %68, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:54:05.4280831Z %70 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4281195Z ttg.local_store %58, %70 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4281468Z %71 = arith.addi %7, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:05.4281745Z %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:05.4282030Z %73 = tt.broadcast %72 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4282219Z %74 = arith.addi %43, %73 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4282417Z %75 = tt.addptr %8, %74 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4282658Z %76 = tt.load %75 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:05.4282847Z %77 = arith.addi %11, %cst_8 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:05.4283115Z %78 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4283352Z %79 = arith.muli %78, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4283534Z %80 = tt.broadcast %79 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4283718Z %81 = arith.addi %80, %48 : tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4283909Z %82 = tt.addptr %9, %81 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4284107Z %83 = arith.cmpi sge, %78, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4284273Z %84 = arith.cmpi slt, %78, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4284429Z %85 = arith.andi %83, %84 : tensor<2x1xi1, #blocked> 2026-02-21T09:54:05.4284604Z %86 = tt.broadcast %85 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4284806Z %87 = arith.andi %86, %52 : tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4285008Z %88 = tt.load %82, %87, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:54:05.4285341Z %89 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4285699Z ttg.local_store %76, %89 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4286324Z %90:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %70, %arg8 = %89, %arg9 = %69, %arg10 = %88) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>) : i32 { 2026-02-21T09:54:05.4286861Z %135 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:54:05.4286987Z %136 = arith.muli %135, %c2_i32 : i32 2026-02-21T09:54:05.4287161Z %137 = tt.splat %136 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:05.4287405Z %138 = arith.addi %137, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:05.4287679Z %139 = tt.expand_dims %138 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:05.4287957Z %140 = tt.broadcast %139 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4288154Z %141 = arith.addi %43, %140 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4288360Z %142 = tt.addptr %8, %141 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:05.4288569Z %143 = tt.load %142 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:05.4288869Z %144 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4289309Z %145 = arith.extf %144 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4289590Z %146 = arith.extsi %135 : i32 to i64 2026-02-21T09:54:05.4289762Z %147 = tt.splat %146 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:05.4290002Z %148 = arith.addi %147, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:05.4290273Z %149 = tt.expand_dims %148 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4290521Z %150 = arith.muli %149, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4290710Z %151 = tt.broadcast %150 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4290904Z %152 = arith.addi %151, %48 : tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4291100Z %153 = tt.addptr %9, %152 : tensor<2x128x!tt.ptr, #blocked>, tensor<2x128xi64, #blocked> 2026-02-21T09:54:05.4291307Z %154 = arith.cmpi sge, %149, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4291480Z %155 = arith.cmpi slt, %149, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:54:05.4291642Z %156 = arith.andi %154, %155 : tensor<2x1xi1, #blocked> 2026-02-21T09:54:05.4291833Z %157 = tt.broadcast %156 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4292025Z %158 = arith.andi %157, %52 : tensor<2x128xi1, #blocked> 2026-02-21T09:54:05.4292193Z %159 = tt.load %153, %158, %cst_1 : tensor<2x128x!tt.ptr, #blocked> 2026-02-21T09:54:05.4292370Z %160 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4292531Z %161 = arith.shrsi %160, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4292794Z %162 = ttg.convert_layout %161 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4293043Z %163 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4293293Z %164 = ttg.convert_layout %163 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4293635Z %165 = tt.expand_dims %162 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4293980Z %166 = tt.expand_dims %164 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4294297Z %167 = tt.broadcast %165 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4294550Z %168 = arith.select %17, %167, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4294796Z %169 = tt.broadcast %166 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4295058Z %170 = arith.select %19, %169, %168 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4295293Z %171 = tt.reshape %170 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:54:05.4295533Z %172 = arith.sitofp %171 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:54:05.4295788Z %173 = ttg.local_alloc %172 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:05.4296120Z %174 = ttg.local_load %173 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4296598Z %175 = tt.dot %145, %174, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:05.4296949Z %176 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:05.4297077Z %177 = arith.cmpi slt, %176, %c2_i32 : i32 2026-02-21T09:54:05.4297211Z %178 = arith.select %177, %176, %c0_i32 : i32 2026-02-21T09:54:05.4297477Z %179 = ttg.memdesc_index %53[%178] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4297858Z ttg.local_store %143, %179 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:54:05.4298350Z scf.yield %175, %178, %arg8, %179, %arg10, %159 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4298774Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:05.4299092Z %91 = ttg.local_load %90#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4299521Z %92 = arith.extf %91 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4299817Z %93 = arith.shli %90#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4299976Z %94 = arith.shrsi %93, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4300214Z %95 = ttg.convert_layout %94 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4300454Z %96 = arith.shrsi %90#4, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4300691Z %97 = ttg.convert_layout %96 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4301036Z %98 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4301376Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4301661Z %100 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4301904Z %101 = arith.select %17, %100, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4302147Z %102 = tt.broadcast %99 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4302402Z %103 = arith.select %19, %102, %101 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4302640Z %104 = tt.reshape %103 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:54:05.4302869Z %105 = arith.sitofp %104 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:54:05.4303134Z %106 = ttg.local_alloc %105 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:05.4303462Z %107 = ttg.local_load %106 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4303933Z %108 = tt.dot %92, %107, %90#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:05.4304429Z %109 = ttg.local_load %90#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4304856Z %110 = arith.extf %109 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4305154Z %111 = arith.shli %90#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4305316Z %112 = arith.shrsi %111, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4305561Z %113 = ttg.convert_layout %112 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4305807Z %114 = arith.shrsi %90#5, %cst : tensor<2x128xi8, #blocked> 2026-02-21T09:54:05.4306065Z %115 = ttg.convert_layout %114 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:05.4306401Z %116 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4306745Z %117 = tt.expand_dims %115 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1> 2026-02-21T09:54:05.4307036Z %118 = tt.broadcast %116 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4307281Z %119 = arith.select %17, %118, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4307529Z %120 = tt.broadcast %117 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4307769Z %121 = arith.select %19, %120, %119 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1> 2026-02-21T09:54:05.4308006Z %122 = tt.reshape %121 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:54:05.4308233Z %123 = arith.sitofp %122 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:54:05.4308487Z %124 = ttg.local_alloc %123 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:05.4308813Z %125 = ttg.local_load %124 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:05.4309295Z %126 = tt.dot %110, %125, %108, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:05.4309682Z ttg.local_dealloc %53 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:54:05.4309901Z %127 = arith.truncf %126 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:54:05.4310174Z %128 = tt.expand_dims %40 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:05.4310429Z %129 = arith.muli %128, %cst_13 : tensor<128x1xi32, #mma> 2026-02-21T09:54:05.4310667Z %130 = tt.expand_dims %35 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:54:05.4310932Z %131 = tt.broadcast %129 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4311140Z %132 = tt.broadcast %130 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4311336Z %133 = arith.addi %131, %132 : tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4311532Z %134 = tt.addptr %20, %133 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:54:05.4311731Z tt.store %134, %127 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:54:05.4311871Z } {tt.num_stages = 1 : i32} 2026-02-21T09:54:05.4311975Z tt.return 2026-02-21T09:54:05.4312053Z } 2026-02-21T09:54:05.4312130Z } 2026-02-21T09:54:05.4312174Z 2026-02-21T09:54:05.4312206Z {-# 2026-02-21T09:54:05.4312287Z external_resources: { 2026-02-21T09:54:05.4312385Z mlir_reproducer: { 2026-02-21T09:54:05.4313378Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:54:05.4314362Z disable_threading: false, 2026-02-21T09:54:05.4314503Z verify_each: true 2026-02-21T09:54:05.4314597Z } 2026-02-21T09:54:05.4314667Z } 2026-02-21T09:54:05.4314738Z #-} 2026-02-21T09:54:05.4315016Z /tmp/torchinductor_root/sm/csm3u5qpmzobx27atwavnzywqgcr3ileu4vkrdd2sfjqwnv35b46.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:54:05.4315702Z /tmp/torchinductor_root/sm/csm3u5qpmzobx27atwavnzywqgcr3ileu4vkrdd2sfjqwnv35b46.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:54:05.4316250Z [576s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:54:05.4317024Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:54:05.4317738Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:54:05.4317908Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:54:07.7872981Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:54:07.7876742Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}> 2026-02-21T09:54:07.7877433Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:54:07.7878098Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:54:07.7878688Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:54:07.7879322Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:54:07.7879798Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:54:07.7880155Z #smem = #ttg.shared_memory 2026-02-21T09:54:07.7880610Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:54:07.7881657Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:54:07.7882427Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:54:07.7882846Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:07.7883188Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:07.7883539Z %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:07.7883915Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:54:07.7884284Z %cst_4 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2> 2026-02-21T09:54:07.7884632Z %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked2> 2026-02-21T09:54:07.7884974Z %cst_6 = arith.constant dense<512> : tensor<2x1xi64, #blocked2> 2026-02-21T09:54:07.7885318Z %cst_7 = arith.constant dense<0> : tensor<1x128xi64, #blocked2> 2026-02-21T09:54:07.7885670Z %cst_8 = arith.constant dense<8192> : tensor<1x128xi64, #blocked2> 2026-02-21T09:54:07.7895224Z %cst_9 = arith.constant dense<0> : tensor<2x128xi8, #blocked2> 2026-02-21T09:54:07.7895537Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:54:07.7895924Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:54:07.7896206Z %cst_10 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked> 2026-02-21T09:54:07.7896511Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:54:07.7896737Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:54:07.7897093Z %cst_11 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:07.7897479Z %0 = tt.get_program_id x : i32 2026-02-21T09:54:07.7897692Z %1 = arith.divsi %0, %c128_i32 : i32 2026-02-21T09:54:07.7897916Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:54:07.7898133Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T09:54:07.7898359Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:54:07.7898576Z %5 = arith.remsi %0, %c128_i32 : i32 2026-02-21T09:54:07.7898798Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:54:07.7899007Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:54:07.7899208Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:54:07.7899418Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:54:07.7899816Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:07.7900386Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:07.7900924Z %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:07.7901513Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:07.7902035Z %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:07.7902461Z %15 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:07.7902887Z %16 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:07.7903315Z %17 = arith.addi %15, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:07.7903693Z %18 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:54:07.7904002Z %19 = tt.splat %18 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:07.7904405Z %20 = arith.addi %19, %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:07.7904891Z %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:07.7905522Z %22 = tt.expand_dims %16 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:54:07.7906055Z %23 = arith.muli %22, %cst_2 : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:07.7906443Z %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:54:07.7906874Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:54:07.7907193Z %26 = arith.extsi %18 : i32 to i64 2026-02-21T09:54:07.7907490Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:54:07.7907961Z %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:07.7908616Z %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:07.7909204Z %30 = tt.splat %26 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:07.7909809Z %31 = arith.extsi %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:07.7910411Z %32 = arith.addi %30, %31 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:07.7911001Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2> 2026-02-21T09:54:07.7911565Z %34 = tt.broadcast %33 : tensor<1x128xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:54:07.7911968Z %35 = arith.cmpi sge, %33, %cst_7 : tensor<1x128xi64, #blocked2> 2026-02-21T09:54:07.7912307Z %36 = arith.cmpi slt, %33, %cst_8 : tensor<1x128xi64, #blocked2> 2026-02-21T09:54:07.7912627Z %37 = arith.andi %35, %36 : tensor<1x128xi1, #blocked2> 2026-02-21T09:54:07.7912990Z %38 = tt.broadcast %37 : tensor<1x128xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:54:07.7913576Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:54:07.7914427Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:54:07.7915259Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:07.7915770Z %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:07.7916150Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:54:07.7916541Z %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:07.7916942Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:54:07.7917472Z %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg4 = %cst_3) -> (tensor<128x128xf32, #mma>) : i32 { 2026-02-21T09:54:07.7917905Z %56 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:54:07.7918233Z %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:07.7918669Z %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:07.7919218Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:54:07.7919796Z %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:54:07.7920176Z %61 = arith.addi %24, %60 : tensor<128x4xi32, #blocked1> 2026-02-21T09:54:07.7920564Z %62 = tt.addptr %25, %61 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:54:07.7920969Z %63 = tt.load %62 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:54:07.7921423Z %64 = ttg.local_alloc %63 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:54:07.7922102Z %65 = ttg.local_load %64 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:07.7923001Z %66 = arith.extf %65 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:07.7923582Z %67 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:54:07.7923911Z %68 = tt.splat %67 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:07.7924341Z %69 = arith.addi %68, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:07.7924890Z %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:54:07.7925384Z %71 = arith.muli %70, %cst_4 : tensor<2x1xi64, #blocked2> 2026-02-21T09:54:07.7925757Z %72 = tt.broadcast %71 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:54:07.7926134Z %73 = arith.addi %72, %34 : tensor<2x128xi64, #blocked2> 2026-02-21T09:54:07.7926518Z %74 = tt.addptr %27, %73 : tensor<2x128x!tt.ptr, #blocked2>, tensor<2x128xi64, #blocked2> 2026-02-21T09:54:07.7926964Z %75 = arith.cmpi sge, %70, %cst_5 : tensor<2x1xi64, #blocked2> 2026-02-21T09:54:07.7927299Z %76 = arith.cmpi slt, %70, %cst_6 : tensor<2x1xi64, #blocked2> 2026-02-21T09:54:07.7927616Z %77 = arith.andi %75, %76 : tensor<2x1xi1, #blocked2> 2026-02-21T09:54:07.7927977Z %78 = tt.broadcast %77 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:54:07.7928346Z %79 = arith.andi %78, %38 : tensor<2x128xi1, #blocked2> 2026-02-21T09:54:07.7928671Z %80 = tt.load %74, %79, %cst_9 : tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:54:07.7929180Z %81 = ttg.convert_layout %80 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:07.7929747Z %82 = arith.shli %81, %cst_11 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:07.7930225Z %83 = arith.shrsi %82, %cst_11 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:07.7930700Z %84 = arith.shrsi %81, %cst_11 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:07.7931294Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:54:07.7931976Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:54:07.7932583Z %87 = tt.broadcast %85 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:54:07.7933066Z %88 = arith.select %43, %87, %cst_10 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:54:07.7933543Z %89 = tt.broadcast %86 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:54:07.7934009Z %90 = arith.select %45, %89, %88 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:54:07.7934464Z %91 = tt.reshape %90 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:54:07.7934909Z %92 = arith.sitofp %91 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:54:07.7935451Z %93 = ttg.local_alloc %92 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:07.7936120Z %94 = ttg.local_load %93 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:07.7937141Z %95 = tt.dot %66, %94, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:07.7937872Z scf.yield %95 : tensor<128x128xf32, #mma> 2026-02-21T09:54:07.7938249Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32} 2026-02-21T09:54:07.7938718Z %47 = arith.truncf %46 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:54:07.7939252Z %48 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:07.7939736Z %49 = arith.muli %48, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:54:07.7940198Z %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:54:07.7940725Z %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:54:07.7941137Z %52 = tt.broadcast %50 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:54:07.7941491Z %53 = arith.addi %51, %52 : tensor<128x128xi32, #mma> 2026-02-21T09:54:07.7941839Z %54 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:54:07.7942271Z %55 = tt.addptr %54, %53 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:54:07.7942684Z tt.store %55, %47 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:54:07.7942944Z tt.return 2026-02-21T09:54:07.7943096Z } 2026-02-21T09:54:07.7943246Z } 2026-02-21T09:54:07.7943328Z 2026-02-21T09:54:07.7943386Z {-# 2026-02-21T09:54:07.7943540Z external_resources: { 2026-02-21T09:54:07.7943726Z mlir_reproducer: { 2026-02-21T09:54:07.7945860Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:54:07.7948007Z disable_threading: false, 2026-02-21T09:54:07.7948220Z verify_each: true 2026-02-21T09:54:07.7948391Z } 2026-02-21T09:54:07.7948529Z } 2026-02-21T09:54:07.7948660Z #-} 2026-02-21T09:54:07.7949246Z /tmp/torchinductor_root/nf/cnf5yzghtklydrgpfrfik7a4ewtelw3rmct3ccelbte66osh63kw.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:54:07.7950775Z /tmp/torchinductor_root/nf/cnf5yzghtklydrgpfrfik7a4ewtelw3rmct3ccelbte66osh63kw.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:54:07.7952036Z [578s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:54:07.7953635Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:54:07.7955119Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:54:07.7955462Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:54:08.4809691Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:54:08.4811602Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T09:54:08.4812303Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:54:08.4812973Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:54:08.4813633Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:54:08.4814257Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:54:08.4814817Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:54:08.4815330Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:54:08.4815726Z #smem = #ttg.shared_memory 2026-02-21T09:54:08.4816237Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:54:08.4817415Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:54:08.4818239Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:54:08.4818627Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:08.4818993Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:08.4819298Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:54:08.4819612Z %cst_3 = arith.constant dense<8192> : tensor<2x1xi64, #blocked1> 2026-02-21T09:54:08.4819900Z %cst_4 = arith.constant dense<0> : tensor<2x1xi64, #blocked1> 2026-02-21T09:54:08.4820180Z %cst_5 = arith.constant dense<512> : tensor<2x1xi64, #blocked1> 2026-02-21T09:54:08.4820470Z %cst_6 = arith.constant dense<0> : tensor<1x128xi64, #blocked1> 2026-02-21T09:54:08.4820763Z %cst_7 = arith.constant dense<8192> : tensor<1x128xi64, #blocked1> 2026-02-21T09:54:08.4821054Z %cst_8 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:54:08.4821306Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:54:08.4821507Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:54:08.4821745Z %cst_9 = arith.constant dense<0> : tensor<2x128xi8, #blocked1> 2026-02-21T09:54:08.4822033Z %cst_10 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked> 2026-02-21T09:54:08.4822277Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:54:08.4822471Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:54:08.4822819Z %cst_11 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:08.4823135Z %0 = tt.get_program_id x : i32 2026-02-21T09:54:08.4823315Z %1 = arith.divsi %0, %c128_i32 : i32 2026-02-21T09:54:08.4823504Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:54:08.4823692Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T09:54:08.4823872Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:54:08.4824061Z %5 = arith.remsi %0, %c128_i32 : i32 2026-02-21T09:54:08.4824239Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:54:08.4824422Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:54:08.4824588Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:54:08.4824847Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:54:08.4825189Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:08.4825652Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:08.4826113Z %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:08.4826587Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:08.4827003Z %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:08.4827356Z %15 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:08.4827710Z %16 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:54:08.4828064Z %17 = arith.addi %15, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:08.4828329Z %18 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:54:08.4828591Z %19 = tt.splat %18 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:08.4828929Z %20 = arith.addi %19, %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:08.4829324Z %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:08.4829852Z %22 = tt.expand_dims %16 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:54:08.4830165Z %23 = arith.muli %22, %cst_8 : tensor<128x1xi32, #blocked2> 2026-02-21T09:54:08.4830429Z %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:08.4830698Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:08.4830928Z %26 = arith.extsi %18 : i32 to i64 2026-02-21T09:54:08.4831117Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:54:08.4831406Z %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:08.4831810Z %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:08.4832176Z %30 = tt.splat %26 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:08.4832550Z %31 = arith.extsi %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:08.4832921Z %32 = arith.addi %30, %31 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:08.4833268Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi64, #blocked1> 2026-02-21T09:54:08.4833613Z %34 = tt.broadcast %33 : tensor<1x128xi64, #blocked1> -> tensor<2x128xi64, #blocked1> 2026-02-21T09:54:08.4833865Z %35 = arith.cmpi sge, %33, %cst_6 : tensor<1x128xi64, #blocked1> 2026-02-21T09:54:08.4834101Z %36 = arith.cmpi slt, %33, %cst_7 : tensor<1x128xi64, #blocked1> 2026-02-21T09:54:08.4834302Z %37 = arith.andi %35, %36 : tensor<1x128xi1, #blocked1> 2026-02-21T09:54:08.4834528Z %38 = tt.broadcast %37 : tensor<1x128xi1, #blocked1> -> tensor<2x128xi1, #blocked1> 2026-02-21T09:54:08.4834883Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:54:08.4835404Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:54:08.4835935Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:08.4836253Z %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:08.4836498Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:54:08.4836739Z %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:08.4836976Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:54:08.4837327Z %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg4 = %cst_2) -> (tensor<128x128xf32, #mma>) : i32 { 2026-02-21T09:54:08.4837594Z %56 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:54:08.4837810Z %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:08.4838085Z %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:54:08.4838422Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:54:08.4838768Z %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:54:08.4839059Z %61 = arith.addi %24, %60 : tensor<128x4xi32, #blocked2> 2026-02-21T09:54:08.4839260Z %62 = tt.addptr %25, %61 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:54:08.4839468Z %63 = tt.load %62 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:54:08.4839690Z %64 = ttg.local_alloc %63 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:54:08.4840041Z %65 = ttg.local_load %64 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:08.4840451Z %66 = arith.extf %65 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:08.4840740Z %67 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:54:08.4840910Z %68 = tt.splat %67 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:08.4841125Z %69 = arith.addi %68, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:08.4841399Z %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi64, #blocked1> 2026-02-21T09:54:08.4841642Z %71 = arith.muli %70, %cst_3 : tensor<2x1xi64, #blocked1> 2026-02-21T09:54:08.4841831Z %72 = tt.broadcast %71 : tensor<2x1xi64, #blocked1> -> tensor<2x128xi64, #blocked1> 2026-02-21T09:54:08.4842019Z %73 = arith.addi %72, %34 : tensor<2x128xi64, #blocked1> 2026-02-21T09:54:08.4842219Z %74 = tt.addptr %27, %73 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi64, #blocked1> 2026-02-21T09:54:08.4842430Z %75 = arith.cmpi sge, %70, %cst_4 : tensor<2x1xi64, #blocked1> 2026-02-21T09:54:08.4842700Z %76 = arith.cmpi slt, %70, %cst_5 : tensor<2x1xi64, #blocked1> 2026-02-21T09:54:08.4842865Z %77 = arith.andi %75, %76 : tensor<2x1xi1, #blocked1> 2026-02-21T09:54:08.4843069Z %78 = tt.broadcast %77 : tensor<2x1xi1, #blocked1> -> tensor<2x128xi1, #blocked1> 2026-02-21T09:54:08.4843257Z %79 = arith.andi %78, %38 : tensor<2x128xi1, #blocked1> 2026-02-21T09:54:08.4843426Z %80 = tt.load %74, %79, %cst_9 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:54:08.4843681Z %81 = ttg.convert_layout %80 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:08.4843968Z %82 = arith.shli %81, %cst_11 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:08.4844205Z %83 = arith.shrsi %82, %cst_11 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:08.4844469Z %84 = arith.shrsi %81, %cst_11 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:08.4844761Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:54:08.4845100Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:54:08.4845388Z %87 = tt.broadcast %85 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:54:08.4845649Z %88 = arith.select %43, %87, %cst_10 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:54:08.4845891Z %89 = tt.broadcast %86 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:54:08.4846124Z %90 = arith.select %45, %89, %88 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:54:08.4846350Z %91 = tt.reshape %90 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:54:08.4846575Z %92 = arith.sitofp %91 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:54:08.4846824Z %93 = ttg.local_alloc %92 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:54:08.4847157Z %94 = ttg.local_load %93 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:08.4847635Z %95 = tt.dot %66, %94, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:54:08.4847990Z scf.yield %95 : tensor<128x128xf32, #mma> 2026-02-21T09:54:08.4848119Z } {tt.num_stages = 1 : i32} 2026-02-21T09:54:08.4857705Z %47 = arith.truncf %46 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:54:08.4857997Z %48 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:08.4858233Z %49 = arith.muli %48, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:54:08.4858458Z %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:54:08.4858715Z %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:54:08.4858916Z %52 = tt.broadcast %50 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:54:08.4859093Z %53 = arith.addi %51, %52 : tensor<128x128xi32, #mma> 2026-02-21T09:54:08.4859267Z %54 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:54:08.4859482Z %55 = tt.addptr %54, %53 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:54:08.4859679Z tt.store %55, %47 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:54:08.4859809Z tt.return 2026-02-21T09:54:08.4859890Z } 2026-02-21T09:54:08.4859963Z } 2026-02-21T09:54:08.4860008Z 2026-02-21T09:54:08.4860040Z {-# 2026-02-21T09:54:08.4860117Z external_resources: { 2026-02-21T09:54:08.4860216Z mlir_reproducer: { 2026-02-21T09:54:08.4861224Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:54:08.4862263Z disable_threading: false, 2026-02-21T09:54:08.4862367Z verify_each: true 2026-02-21T09:54:08.4862478Z } 2026-02-21T09:54:08.4862548Z } 2026-02-21T09:54:08.4862618Z #-} 2026-02-21T09:54:08.4862899Z /tmp/torchinductor_root/qy/cqyq56n6uafmgokjpwcvf26cpfbfufu46mxhrzhs4gjynwx5cdlg.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:54:08.4863602Z /tmp/torchinductor_root/qy/cqyq56n6uafmgokjpwcvf26cpfbfufu46mxhrzhs4gjynwx5cdlg.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:54:08.4864154Z [579s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:54:08.4864877Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:54:08.4865533Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:54:08.4865702Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:54:08.8706784Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 11.2 configs/s 2026-02-21T09:54:14.4341207Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 211/211 27.4 configs/s 2026-02-21T09:54:17.5181455Z [588s] Generation 8 complete: 2026-02-21T09:54:17.5181826Z error=10 2026-02-21T09:54:17.5182010Z ok=74 2026-02-21T09:54:17.5182181Z min=0.9360 2026-02-21T09:54:17.5182354Z mid=1.4499 2026-02-21T09:54:17.5182524Z max=34.4558 2026-02-21T09:54:17.5182722Z best={'block_sizes': [8, 128, 128], 2026-02-21T09:54:17.5183565Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:54:17.5183864Z 'l2_groupings': [2], 2026-02-21T09:54:17.5184116Z 'load_eviction_policies': ['', ''], 2026-02-21T09:54:17.5184379Z 'loop_orders': [[0, 1]], 2026-02-21T09:54:17.5184618Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:54:17.5184839Z 'num_stages': 1, 2026-02-21T09:54:17.5185034Z 'num_warps': 4, 2026-02-21T09:54:17.5185236Z 'pid_type': 'flat', 2026-02-21T09:54:17.5185465Z 'range_flattens': [None, None], 2026-02-21T09:54:17.5185728Z 'range_multi_buffers': [None, False], 2026-02-21T09:54:17.5185987Z 'range_num_stages': [0, 1], 2026-02-21T09:54:17.5186215Z 'range_unroll_factors': [0, 0], 2026-02-21T09:54:17.5186468Z 'range_warp_specializes': [], 2026-02-21T09:54:17.5186707Z 'waves_per_eu': 2} 2026-02-21T09:54:17.5258585Z [588s] Fitting surrogate: 844 points, 844 targets 2026-02-21T09:54:18.2054323Z [589s] Generation 9 starting: 63 neighbors, 3 active search path(s) 2026-02-21T09:54:42.0969797Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 0.9 configs/s 2026-02-21T09:54:46.8506249Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:54:46.8510344Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [8, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:54:46.8513616Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:54:46.8515211Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:54:46.8516334Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 4], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:54:46.8517390Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:54:46.8517890Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:54:46.8518172Z #smem = #ttg.shared_memory 2026-02-21T09:54:46.8518757Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:54:46.8519518Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:54:46.8520158Z %cst = arith.constant dense<0.000000e+00> : tensor<128x64xf32, #mma> 2026-02-21T09:54:46.8520492Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:54:46.8520670Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:54:46.8520854Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T09:54:46.8521040Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:54:46.8521220Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:54:46.8521403Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:54:46.8521629Z %cst_0 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8521861Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:54:46.8522036Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:54:46.8522204Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:54:46.8522424Z %cst_1 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:46.8522819Z %cst_2 = arith.constant dense<8192> : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8523076Z %cst_3 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8523291Z %cst_4 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:46.8523464Z %cst_5 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:46.8523633Z %cst_6 = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:54:46.8523876Z %0 = tt.get_program_id x : i32 2026-02-21T09:54:46.8523988Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:54:46.8524107Z %2 = arith.minsi %1, %c16384_i32 : i32 2026-02-21T09:54:46.8524308Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:46.8524582Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:46.8524890Z %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8525194Z %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:46.8525494Z %7 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8525799Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:46.8526038Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:54:46.8526282Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8526581Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:54:46.8527015Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:54:46.8527411Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:46.8527658Z %14 = arith.cmpi eq, %13, %cst_4 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:46.8527853Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:54:46.8528049Z %16 = arith.cmpi eq, %13, %cst_5 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:46.8528254Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked> 2026-02-21T09:54:46.8528459Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:54:46.8528615Z %19 = arith.subi %2, %0 : i32 2026-02-21T09:54:46.8528729Z %20 = arith.remsi %19, %c3_i32 : i32 2026-02-21T09:54:46.8528845Z %21 = arith.subi %19, %20 : i32 2026-02-21T09:54:46.8528956Z %22 = arith.addi %0, %21 : i32 2026-02-21T09:54:46.8529123Z scf.for %arg3 = %0 to %22 step %c3_i32 : i32 { 2026-02-21T09:54:46.8529263Z %23 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:54:46.8529384Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:54:46.8529503Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:54:46.8529621Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:54:46.8529736Z %27 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:54:46.8529854Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:54:46.8529965Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:54:46.8530074Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:54:46.8530183Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:54:46.8530351Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:46.8530568Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:46.8530780Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:46.8530990Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:46.8531150Z %36 = arith.muli %30, %c64_i32 : i32 2026-02-21T09:54:46.8531376Z %37 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8531622Z %38 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:46.8531865Z %39 = arith.addi %37, %5 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8532108Z %40 = arith.addi %38, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:46.8532375Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:54:46.8532625Z %42 = arith.muli %41, %cst_1 : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:46.8532821Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8533168Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8533590Z %45 = tt.broadcast %44 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8533923Z %46 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<128x64xf32, #mma>) : i32 { 2026-02-21T09:54:46.8534221Z %121 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8534538Z %122 = arith.addi %121, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8534753Z %123 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:54:46.8534924Z %124 = tt.splat %123 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:46.8535143Z %125 = arith.addi %124, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:46.8535419Z %126 = tt.expand_dims %125 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:54:46.8535716Z %127 = tt.broadcast %126 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8535909Z %128 = arith.addi %43, %127 : tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8536114Z %129 = tt.addptr %9, %128 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8536322Z %130 = tt.load %129 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:54:46.8536563Z %131 = ttg.local_alloc %130 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:54:46.8536898Z %132 = ttg.local_load %131 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:46.8537307Z %133 = arith.extf %132 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:46.8537767Z %134 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8538127Z %135 = arith.muli %134, %cst_2 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8538434Z %136 = tt.broadcast %135 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8538738Z %137 = arith.addi %136, %45 : tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8539047Z %138 = tt.addptr %10, %137 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8539379Z %139 = tt.load %138 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8539615Z %140 = arith.shli %139, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8539851Z %141 = arith.shrsi %140, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8540091Z %142 = arith.shrsi %139, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8540381Z %143 = tt.expand_dims %141 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:54:46.8540722Z %144 = tt.expand_dims %142 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:54:46.8541009Z %145 = tt.broadcast %143 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8541249Z %146 = arith.select %15, %145, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8541491Z %147 = tt.broadcast %144 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8541731Z %148 = arith.select %17, %147, %146 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8541963Z %149 = tt.reshape %148 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:54:46.8542189Z %150 = arith.sitofp %149 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:54:46.8542458Z %151 = ttg.local_alloc %150 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:54:46.8542787Z %152 = ttg.local_load %151 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:46.8543272Z %153 = tt.dot %133, %152, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:46.8543632Z scf.yield %153 : tensor<128x64xf32, #mma> 2026-02-21T09:54:46.8543825Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:46.8544032Z %47 = arith.truncf %46 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma> 2026-02-21T09:54:46.8544302Z %48 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:46.8544550Z %49 = arith.muli %48, %cst_6 : tensor<128x1xi32, #mma> 2026-02-21T09:54:46.8544793Z %50 = tt.expand_dims %40 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:54:46.8545049Z %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8545249Z %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8545429Z %53 = arith.addi %51, %52 : tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8545618Z %54 = tt.addptr %18, %53 : tensor<128x64x!tt.ptr, #mma>, tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8545811Z tt.store %54, %47 : tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:54:46.8545955Z %55 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:54:46.8546078Z %56 = arith.divsi %55, %c512_i32 : i32 2026-02-21T09:54:46.8546203Z %57 = arith.muli %56, %c4_i32 : i32 2026-02-21T09:54:46.8546321Z %58 = arith.subi %c128_i32, %57 : i32 2026-02-21T09:54:46.8546443Z %59 = arith.minsi %58, %c4_i32 : i32 2026-02-21T09:54:46.8546563Z %60 = arith.remsi %55, %c512_i32 : i32 2026-02-21T09:54:46.8546686Z %61 = arith.remsi %60, %59 : i32 2026-02-21T09:54:46.8546804Z %62 = arith.addi %57, %61 : i32 2026-02-21T09:54:46.8546917Z %63 = arith.divsi %60, %59 : i32 2026-02-21T09:54:46.8547036Z %64 = arith.muli %62, %c128_i32 : i32 2026-02-21T09:54:46.8547235Z %65 = tt.splat %64 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:46.8547454Z %66 = tt.splat %64 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:46.8547669Z %67 = arith.addi %65, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:46.8547910Z %68 = arith.addi %66, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:46.8548075Z %69 = arith.muli %63, %c64_i32 : i32 2026-02-21T09:54:46.8548282Z %70 = tt.splat %69 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8548533Z %71 = tt.splat %69 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:46.8548783Z %72 = arith.addi %70, %5 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8549032Z %73 = arith.addi %71, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:46.8549300Z %74 = tt.expand_dims %67 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:54:46.8549556Z %75 = arith.muli %74, %cst_1 : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:46.8549748Z %76 = tt.broadcast %75 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8550119Z %77 = tt.expand_dims %72 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8550545Z %78 = tt.broadcast %77 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8550880Z %79 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<128x64xf32, #mma>) : i32 { 2026-02-21T09:54:46.8551185Z %121 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8551504Z %122 = arith.addi %121, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8551725Z %123 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:54:46.8551903Z %124 = tt.splat %123 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:46.8552128Z %125 = arith.addi %124, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:46.8552422Z %126 = tt.expand_dims %125 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:54:46.8552701Z %127 = tt.broadcast %126 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8552904Z %128 = arith.addi %76, %127 : tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8553116Z %129 = tt.addptr %9, %128 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8553328Z %130 = tt.load %129 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:54:46.8553561Z %131 = ttg.local_alloc %130 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:54:46.8553895Z %132 = ttg.local_load %131 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:46.8554312Z %133 = arith.extf %132 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:46.8554777Z %134 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8555151Z %135 = arith.muli %134, %cst_2 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8555465Z %136 = tt.broadcast %135 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8555774Z %137 = arith.addi %136, %78 : tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8556083Z %138 = tt.addptr %10, %137 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8556401Z %139 = tt.load %138 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8556635Z %140 = arith.shli %139, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8556876Z %141 = arith.shrsi %140, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8557120Z %142 = arith.shrsi %139, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8557408Z %143 = tt.expand_dims %141 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:54:46.8557748Z %144 = tt.expand_dims %142 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:54:46.8558034Z %145 = tt.broadcast %143 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8558382Z %146 = arith.select %15, %145, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8558625Z %147 = tt.broadcast %144 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8558857Z %148 = arith.select %17, %147, %146 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8559095Z %149 = tt.reshape %148 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:54:46.8559326Z %150 = arith.sitofp %149 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:54:46.8559597Z %151 = ttg.local_alloc %150 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:54:46.8559927Z %152 = ttg.local_load %151 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:46.8560401Z %153 = tt.dot %133, %152, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:46.8560780Z scf.yield %153 : tensor<128x64xf32, #mma> 2026-02-21T09:54:46.8560953Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:46.8561160Z %80 = arith.truncf %79 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma> 2026-02-21T09:54:46.8561431Z %81 = tt.expand_dims %68 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:46.8561670Z %82 = arith.muli %81, %cst_6 : tensor<128x1xi32, #mma> 2026-02-21T09:54:46.8561901Z %83 = tt.expand_dims %73 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:54:46.8562157Z %84 = tt.broadcast %82 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8562356Z %85 = tt.broadcast %83 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8562535Z %86 = arith.addi %84, %85 : tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8562761Z %87 = tt.addptr %18, %86 : tensor<128x64x!tt.ptr, #mma>, tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8562956Z tt.store %87, %80 : tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:54:46.8563097Z %88 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:54:46.8563241Z %89 = arith.divsi %88, %c512_i32 : i32 2026-02-21T09:54:46.8563364Z %90 = arith.muli %89, %c4_i32 : i32 2026-02-21T09:54:46.8563481Z %91 = arith.subi %c128_i32, %90 : i32 2026-02-21T09:54:46.8563603Z %92 = arith.minsi %91, %c4_i32 : i32 2026-02-21T09:54:46.8563722Z %93 = arith.remsi %88, %c512_i32 : i32 2026-02-21T09:54:46.8563844Z %94 = arith.remsi %93, %92 : i32 2026-02-21T09:54:46.8563959Z %95 = arith.addi %90, %94 : i32 2026-02-21T09:54:46.8564077Z %96 = arith.divsi %93, %92 : i32 2026-02-21T09:54:46.8564196Z %97 = arith.muli %95, %c128_i32 : i32 2026-02-21T09:54:46.8564364Z %98 = tt.splat %97 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:46.8564581Z %99 = tt.splat %97 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:46.8564797Z %100 = arith.addi %98, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:46.8565015Z %101 = arith.addi %99, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:46.8565182Z %102 = arith.muli %96, %c64_i32 : i32 2026-02-21T09:54:46.8565397Z %103 = tt.splat %102 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8565657Z %104 = tt.splat %102 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:46.8565934Z %105 = arith.addi %103, %5 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8566190Z %106 = arith.addi %104, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:46.8566466Z %107 = tt.expand_dims %100 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:54:46.8566728Z %108 = arith.muli %107, %cst_1 : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:46.8566937Z %109 = tt.broadcast %108 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8567293Z %110 = tt.expand_dims %105 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8567744Z %111 = tt.broadcast %110 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8568088Z %112 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<128x64xf32, #mma>) : i32 { 2026-02-21T09:54:46.8568407Z %121 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8568711Z %122 = arith.addi %121, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8568928Z %123 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:54:46.8569108Z %124 = tt.splat %123 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:46.8569338Z %125 = arith.addi %124, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:46.8569616Z %126 = tt.expand_dims %125 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:54:46.8569898Z %127 = tt.broadcast %126 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8570100Z %128 = arith.addi %109, %127 : tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8570310Z %129 = tt.addptr %9, %128 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8570525Z %130 = tt.load %129 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:54:46.8570753Z %131 = ttg.local_alloc %130 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:54:46.8574086Z %132 = ttg.local_load %131 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:46.8574508Z %133 = arith.extf %132 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:46.8574976Z %134 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8575340Z %135 = arith.muli %134, %cst_2 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8575649Z %136 = tt.broadcast %135 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8575959Z %137 = arith.addi %136, %111 : tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8576277Z %138 = tt.addptr %10, %137 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8576589Z %139 = tt.load %138 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8576825Z %140 = arith.shli %139, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8577084Z %141 = arith.shrsi %140, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8577327Z %142 = arith.shrsi %139, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8577619Z %143 = tt.expand_dims %141 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:54:46.8577957Z %144 = tt.expand_dims %142 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:54:46.8578243Z %145 = tt.broadcast %143 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8578499Z %146 = arith.select %15, %145, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8578740Z %147 = tt.broadcast %144 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8578978Z %148 = arith.select %17, %147, %146 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8579208Z %149 = tt.reshape %148 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:54:46.8579458Z %150 = arith.sitofp %149 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:54:46.8579709Z %151 = ttg.local_alloc %150 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:54:46.8580038Z %152 = ttg.local_load %151 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:46.8580517Z %153 = tt.dot %133, %152, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:46.8580873Z scf.yield %153 : tensor<128x64xf32, #mma> 2026-02-21T09:54:46.8581040Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:46.8581250Z %113 = arith.truncf %112 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma> 2026-02-21T09:54:46.8581517Z %114 = tt.expand_dims %101 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:46.8581759Z %115 = arith.muli %114, %cst_6 : tensor<128x1xi32, #mma> 2026-02-21T09:54:46.8582003Z %116 = tt.expand_dims %106 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:54:46.8582260Z %117 = tt.broadcast %115 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8582466Z %118 = tt.broadcast %116 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8582648Z %119 = arith.addi %117, %118 : tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8582840Z %120 = tt.addptr %18, %119 : tensor<128x64x!tt.ptr, #mma>, tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8583036Z tt.store %120, %113 : tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:54:46.8583176Z } {tt.num_stages = 1 : i32} 2026-02-21T09:54:46.8583295Z scf.for %arg3 = %22 to %2 step %c1_i32 : i32 { 2026-02-21T09:54:46.8583431Z %23 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:54:46.8583554Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:54:46.8583671Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:54:46.8583788Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:54:46.8583908Z %27 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:54:46.8584027Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:54:46.8584138Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:54:46.8584250Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:54:46.8584359Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:54:46.8584526Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:46.8584756Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:46.8584967Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:46.8585179Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:46.8585340Z %36 = arith.muli %30, %c64_i32 : i32 2026-02-21T09:54:46.8585544Z %37 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8585792Z %38 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:46.8586051Z %39 = arith.addi %37, %5 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8586297Z %40 = arith.addi %38, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:46.8586564Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:54:46.8586817Z %42 = arith.muli %41, %cst_1 : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:46.8587023Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8587370Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8587788Z %45 = tt.broadcast %44 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8588124Z %46 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<128x64xf32, #mma>) : i32 { 2026-02-21T09:54:46.8588426Z %55 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8588725Z %56 = arith.addi %55, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:46.8588934Z %57 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:54:46.8589103Z %58 = tt.splat %57 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:46.8589318Z %59 = arith.addi %58, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:46.8589604Z %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:54:46.8589878Z %61 = tt.broadcast %60 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8590071Z %62 = arith.addi %43, %61 : tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8590267Z %63 = tt.addptr %9, %62 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:54:46.8590468Z %64 = tt.load %63 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:54:46.8590689Z %65 = ttg.local_alloc %64 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:54:46.8591018Z %66 = ttg.local_load %65 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:46.8591422Z %67 = arith.extf %66 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:46.8591876Z %68 = tt.expand_dims %56 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8592227Z %69 = arith.muli %68, %cst_2 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8592523Z %70 = tt.broadcast %69 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8592832Z %71 = arith.addi %70, %45 : tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8593131Z %72 = tt.addptr %10, %71 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8593433Z %73 = tt.load %72 : tensor<2x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8593657Z %74 = arith.shli %73, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8593901Z %75 = arith.shrsi %74, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8594129Z %76 = arith.shrsi %73, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:46.8594406Z %77 = tt.expand_dims %75 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:54:46.8594735Z %78 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked> 2026-02-21T09:54:46.8595024Z %79 = tt.broadcast %77 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8595254Z %80 = arith.select %15, %79, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8595484Z %81 = tt.broadcast %78 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8595705Z %82 = arith.select %17, %81, %80 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked> 2026-02-21T09:54:46.8595929Z %83 = tt.reshape %82 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2> 2026-02-21T09:54:46.8596145Z %84 = arith.sitofp %83 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2> 2026-02-21T09:54:46.8596390Z %85 = ttg.local_alloc %84 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem> 2026-02-21T09:54:46.8596713Z %86 = ttg.local_load %85 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:46.8597181Z %87 = tt.dot %67, %86, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:46.8597543Z scf.yield %87 : tensor<128x64xf32, #mma> 2026-02-21T09:54:46.8597708Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:46.8597913Z %47 = arith.truncf %46 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma> 2026-02-21T09:54:46.8598177Z %48 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:46.8598412Z %49 = arith.muli %48, %cst_6 : tensor<128x1xi32, #mma> 2026-02-21T09:54:46.8598637Z %50 = tt.expand_dims %40 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:54:46.8598892Z %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8599087Z %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8599264Z %53 = arith.addi %51, %52 : tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8599447Z %54 = tt.addptr %18, %53 : tensor<128x64x!tt.ptr, #mma>, tensor<128x64xi32, #mma> 2026-02-21T09:54:46.8599636Z tt.store %54, %47 : tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:54:46.8599774Z } {tt.num_stages = 1 : i32} 2026-02-21T09:54:46.8599878Z tt.return 2026-02-21T09:54:46.8599960Z } 2026-02-21T09:54:46.8600034Z } 2026-02-21T09:54:46.8600080Z 2026-02-21T09:54:46.8600110Z {-# 2026-02-21T09:54:46.8600204Z external_resources: { 2026-02-21T09:54:46.8600305Z mlir_reproducer: { 2026-02-21T09:54:46.8601305Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:54:46.8602313Z disable_threading: false, 2026-02-21T09:54:46.8602418Z verify_each: true 2026-02-21T09:54:46.8602508Z } 2026-02-21T09:54:46.8602614Z } 2026-02-21T09:54:46.8602685Z #-} 2026-02-21T09:54:46.8602956Z /tmp/torchinductor_root/js/cjs4wbyfs5ng5fgjz2j2ouuubwzm3iruptmd2fosj7bk57htgu72.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:54:46.8603650Z /tmp/torchinductor_root/js/cjs4wbyfs5ng5fgjz2j2ouuubwzm3iruptmd2fosj7bk57htgu72.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:54:46.8604194Z [617s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:54:46.8604959Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 64], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:54:46.8605662Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:54:46.8605830Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:54:47.1671617Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:54:47.1689795Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:54:47.1690237Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:54:47.1690648Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:54:47.1691022Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:54:47.1691368Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:54:47.1691676Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:54:47.1691921Z #smem = #ttg.shared_memory 2026-02-21T09:54:47.1692226Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:54:47.1692849Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:54:47.1693359Z %cst = arith.constant dense<0.000000e+00> : tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1693567Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:54:47.1693721Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:54:47.1693867Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:54:47.1694048Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:54:47.1694243Z %cst_0 = arith.constant dense<0> : tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1694436Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:54:47.1694594Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T09:54:47.1694747Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:54:47.1694896Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:54:47.1695049Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:54:47.1695297Z %cst_1 = arith.constant dense<0> : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1695655Z %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1696029Z %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1696423Z %cst_4 = arith.constant dense<512> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1696736Z %cst_5 = arith.constant dense<0> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1697046Z %cst_6 = arith.constant dense<8192> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1697445Z %cst_7 = arith.constant dense<508> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1697853Z %cst_8 = arith.constant dense<504> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1698218Z %cst_9 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1698449Z %c504_i32 = arith.constant 504 : i32 2026-02-21T09:54:47.1698599Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:54:47.1698790Z %cst_10 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:47.1699060Z %cst_11 = arith.constant dense<4> : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1699339Z %cst_12 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:47.1699553Z %cst_13 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:47.1699773Z %cst_14 = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:54:47.1699962Z %0 = tt.get_program_id x : i32 2026-02-21T09:54:47.1700105Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:54:47.1700248Z %2 = arith.minsi %1, %c16384_i32 : i32 2026-02-21T09:54:47.1700531Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:47.1700882Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:47.1701268Z %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1701654Z %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:47.1701981Z %7 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1702288Z %8 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1702589Z %9 = tt.splat %arg1 : !tt.ptr -> tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1702982Z %10 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1703519Z %11 = arith.extsi %10 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1704155Z %12 = arith.extsi %5 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1704695Z %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:54:47.1705208Z %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:54:47.1705711Z %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:47.1706014Z %16 = arith.cmpi eq, %15, %cst_12 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:47.1706353Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x64xi1, #blocked> 2026-02-21T09:54:47.1706552Z %18 = arith.cmpi eq, %15, %cst_13 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:47.1706744Z %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x64xi1, #blocked> 2026-02-21T09:54:47.1706955Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:54:47.1707118Z %21 = arith.subi %2, %0 : i32 2026-02-21T09:54:47.1707250Z %22 = arith.remsi %21, %c4_i32 : i32 2026-02-21T09:54:47.1707368Z %23 = arith.subi %21, %22 : i32 2026-02-21T09:54:47.1707480Z %24 = arith.addi %0, %23 : i32 2026-02-21T09:54:47.1707607Z scf.for %arg3 = %0 to %24 step %c4_i32 : i32 { 2026-02-21T09:54:47.1707747Z %25 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:54:47.1707872Z %26 = arith.muli %25, %c4_i32 : i32 2026-02-21T09:54:47.1707992Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:54:47.1708115Z %28 = arith.minsi %27, %c4_i32 : i32 2026-02-21T09:54:47.1708236Z %29 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:54:47.1708357Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:54:47.1708471Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:54:47.1708583Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:54:47.1708699Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:54:47.1708872Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:47.1709093Z %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:47.1709311Z %36 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:47.1709546Z %37 = arith.addi %35, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:47.1709713Z %38 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:54:47.1709872Z %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:47.1710077Z %40 = arith.addi %39, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:47.1710352Z %41 = tt.expand_dims %36 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:54:47.1710610Z %42 = arith.muli %41, %cst_10 : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:47.1710810Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1710985Z %44 = arith.extsi %38 : i32 to i64 2026-02-21T09:54:47.1711194Z %45 = tt.splat %44 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1711496Z %46 = arith.addi %45, %12 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1711889Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1712327Z %48 = tt.broadcast %47 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1712654Z %49 = arith.cmpi sge, %47, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1712899Z %50 = arith.cmpi slt, %47, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1713143Z %51 = arith.andi %49, %50 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1713443Z %52 = tt.broadcast %51 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1713740Z %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:54:47.1714030Z %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:54:47.1714306Z %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1714502Z %56 = arith.addi %43, %55 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1714705Z %57 = tt.addptr %8, %56 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1714922Z %58 = tt.load %57 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1715205Z %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1723701Z ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1724012Z %60 = arith.addi %7, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1724295Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:54:47.1724567Z %62 = tt.broadcast %61 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1724762Z %63 = arith.addi %43, %62 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1724960Z %64 = tt.addptr %8, %63 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1725160Z %65 = tt.load %64 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1725446Z %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1725860Z ttg.local_store %65, %66 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1726381Z %67:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>) : i32 { 2026-02-21T09:54:47.1726801Z %393 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:54:47.1726925Z %394 = arith.muli %393, %c2_i32 : i32 2026-02-21T09:54:47.1727100Z %395 = tt.splat %394 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1727333Z %396 = arith.addi %395, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1727611Z %397 = tt.expand_dims %396 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:54:47.1727894Z %398 = tt.broadcast %397 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1728091Z %399 = arith.addi %43, %398 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1728297Z %400 = tt.addptr %8, %399 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1728506Z %401 = tt.load %400 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1728827Z %402 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1729272Z %403 = arith.extf %402 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1729556Z %404 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:54:47.1729769Z %405 = tt.splat %404 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1730066Z %406 = arith.addi %405, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1730476Z %407 = tt.expand_dims %406 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1730834Z %408 = arith.muli %407, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1731159Z %409 = tt.broadcast %408 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1731461Z %410 = arith.addi %409, %48 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1731771Z %411 = tt.addptr %9, %410 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1732086Z %412 = arith.cmpi sge, %407, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1732333Z %413 = arith.cmpi slt, %407, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1732569Z %414 = arith.andi %412, %413 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1732868Z %415 = tt.broadcast %414 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1733166Z %416 = arith.andi %415, %52 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1733411Z %417 = tt.load %411, %416, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1733657Z %418 = arith.shli %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1733909Z %419 = arith.shrsi %418, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1734147Z %420 = arith.shrsi %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1734434Z %421 = tt.expand_dims %419 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1734770Z %422 = tt.expand_dims %420 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1735055Z %423 = tt.broadcast %421 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1735348Z %424 = arith.select %17, %423, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1735581Z %425 = tt.broadcast %422 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1735814Z %426 = arith.select %19, %425, %424 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1736043Z %427 = tt.reshape %426 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1736264Z %428 = arith.sitofp %427 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1736517Z %429 = ttg.local_alloc %428 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1736853Z %430 = ttg.local_load %429 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1737333Z %431 = tt.dot %403, %430, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1737687Z %432 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:47.1737816Z %433 = arith.cmpi slt, %432, %c2_i32 : i32 2026-02-21T09:54:47.1737952Z %434 = arith.select %433, %432, %c0_i32 : i32 2026-02-21T09:54:47.1738241Z %435 = ttg.memdesc_index %53[%434] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1738603Z ttg.local_store %401, %435 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1739004Z scf.yield %431, %434, %arg8, %435 : tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1739355Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:47.1739672Z %68 = ttg.local_load %67#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1740104Z %69 = arith.extf %68 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1740471Z %70 = arith.addi %11, %cst_8 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1740860Z %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1741207Z %72 = arith.muli %71, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1741505Z %73 = tt.broadcast %72 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1741798Z %74 = arith.addi %73, %48 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1742116Z %75 = tt.addptr %9, %74 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1742424Z %76 = arith.cmpi sge, %71, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1742659Z %77 = arith.cmpi slt, %71, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1742881Z %78 = arith.andi %76, %77 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1743171Z %79 = tt.broadcast %78 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1743459Z %80 = arith.andi %79, %52 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1743696Z %81 = tt.load %75, %80, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1743936Z %82 = arith.shli %81, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1744161Z %83 = arith.shrsi %82, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1744394Z %84 = arith.shrsi %81, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1744671Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1745010Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1745283Z %87 = tt.broadcast %85 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1745511Z %88 = arith.select %17, %87, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1745739Z %89 = tt.broadcast %86 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1745963Z %90 = arith.select %19, %89, %88 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1746212Z %91 = tt.reshape %90 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1746426Z %92 = arith.sitofp %91 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1746665Z %93 = ttg.local_alloc %92 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1746981Z %94 = ttg.local_load %93 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1747457Z %95 = tt.dot %69, %94, %67#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1747943Z %96 = ttg.local_load %67#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1748369Z %97 = arith.extf %96 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1748737Z %98 = arith.addi %11, %cst_7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1749123Z %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1749476Z %100 = arith.muli %99, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1749780Z %101 = tt.broadcast %100 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1750100Z %102 = arith.addi %101, %48 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1750404Z %103 = tt.addptr %9, %102 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1750716Z %104 = arith.cmpi sge, %99, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1750957Z %105 = arith.cmpi slt, %99, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1751187Z %106 = arith.andi %104, %105 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1751485Z %107 = tt.broadcast %106 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1751782Z %108 = arith.andi %107, %52 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1752024Z %109 = tt.load %103, %108, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1752270Z %110 = arith.shli %109, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1752503Z %111 = arith.shrsi %110, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1752737Z %112 = arith.shrsi %109, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1753036Z %113 = tt.expand_dims %111 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1753368Z %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1753649Z %115 = tt.broadcast %113 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1753888Z %116 = arith.select %17, %115, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1754121Z %117 = tt.broadcast %114 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1754375Z %118 = arith.select %19, %117, %116 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1754597Z %119 = tt.reshape %118 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1754817Z %120 = arith.sitofp %119 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1755069Z %121 = ttg.local_alloc %120 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1755406Z %122 = ttg.local_load %121 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1755872Z %123 = tt.dot %97, %122, %95, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1756252Z ttg.local_dealloc %53 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:54:47.1756468Z %124 = arith.truncf %123 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma> 2026-02-21T09:54:47.1756736Z %125 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:47.1756976Z %126 = arith.muli %125, %cst_14 : tensor<128x1xi32, #mma> 2026-02-21T09:54:47.1757209Z %127 = tt.expand_dims %40 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:54:47.1757465Z %128 = tt.broadcast %126 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1757671Z %129 = tt.broadcast %127 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1757852Z %130 = arith.addi %128, %129 : tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1758057Z %131 = tt.addptr %20, %130 : tensor<128x64x!tt.ptr, #mma>, tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1758257Z tt.store %131, %124 : tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:54:47.1758398Z %132 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:54:47.1758521Z %133 = arith.divsi %132, %c512_i32 : i32 2026-02-21T09:54:47.1758640Z %134 = arith.muli %133, %c4_i32 : i32 2026-02-21T09:54:47.1758761Z %135 = arith.subi %c128_i32, %134 : i32 2026-02-21T09:54:47.1758881Z %136 = arith.minsi %135, %c4_i32 : i32 2026-02-21T09:54:47.1758999Z %137 = arith.remsi %132, %c512_i32 : i32 2026-02-21T09:54:47.1759120Z %138 = arith.remsi %137, %136 : i32 2026-02-21T09:54:47.1759235Z %139 = arith.addi %134, %138 : i32 2026-02-21T09:54:47.1759352Z %140 = arith.divsi %137, %136 : i32 2026-02-21T09:54:47.1759467Z %141 = arith.muli %139, %c128_i32 : i32 2026-02-21T09:54:47.1759643Z %142 = tt.splat %141 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:47.1759865Z %143 = tt.splat %141 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:47.1760086Z %144 = arith.addi %142, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:47.1760304Z %145 = arith.addi %143, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:47.1760486Z %146 = arith.muli %140, %c64_i32 : i32 2026-02-21T09:54:47.1760645Z %147 = tt.splat %146 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:47.1760850Z %148 = arith.addi %147, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:47.1761125Z %149 = tt.expand_dims %144 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:54:47.1761382Z %150 = arith.muli %149, %cst_10 : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:47.1761582Z %151 = tt.broadcast %150 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1761760Z %152 = arith.extsi %146 : i32 to i64 2026-02-21T09:54:47.1761984Z %153 = tt.splat %152 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1762281Z %154 = arith.addi %153, %12 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1762728Z %155 = tt.expand_dims %154 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1763172Z %156 = tt.broadcast %155 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1763484Z %157 = arith.cmpi sge, %155, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1763728Z %158 = arith.cmpi slt, %155, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1763967Z %159 = arith.andi %157, %158 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1764269Z %160 = tt.broadcast %159 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1764565Z %161 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:54:47.1764756Z %162 = arith.addi %151, %55 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1764959Z %163 = tt.addptr %8, %162 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1765165Z %164 = tt.load %163 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1765477Z %165 = ttg.memdesc_index %161[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1765837Z ttg.local_store %164, %165 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1766078Z %166 = arith.addi %151, %62 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1766279Z %167 = tt.addptr %8, %166 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1766481Z %168 = tt.load %167 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1766763Z %169 = ttg.memdesc_index %161[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1767119Z ttg.local_store %168, %169 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1767641Z %170:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %165, %arg8 = %169) -> (tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>) : i32 { 2026-02-21T09:54:47.1768060Z %393 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:54:47.1768184Z %394 = arith.muli %393, %c2_i32 : i32 2026-02-21T09:54:47.1768357Z %395 = tt.splat %394 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1768600Z %396 = arith.addi %395, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1768877Z %397 = tt.expand_dims %396 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:54:47.1769157Z %398 = tt.broadcast %397 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1769353Z %399 = arith.addi %151, %398 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1769556Z %400 = tt.addptr %8, %399 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1769768Z %401 = tt.load %400 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1770092Z %402 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1770528Z %403 = arith.extf %402 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1770813Z %404 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:54:47.1771039Z %405 = tt.splat %404 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1771335Z %406 = arith.addi %405, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1771724Z %407 = tt.expand_dims %406 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1772077Z %408 = arith.muli %407, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1772384Z %409 = tt.broadcast %408 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1772689Z %410 = arith.addi %409, %156 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1772997Z %411 = tt.addptr %9, %410 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1773311Z %412 = arith.cmpi sge, %407, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1773573Z %413 = arith.cmpi slt, %407, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1773809Z %414 = arith.andi %412, %413 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1774108Z %415 = tt.broadcast %414 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1774402Z %416 = arith.andi %415, %160 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1774648Z %417 = tt.load %411, %416, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1774894Z %418 = arith.shli %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1775127Z %419 = arith.shrsi %418, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1775361Z %420 = arith.shrsi %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1775646Z %421 = tt.expand_dims %419 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1775984Z %422 = tt.expand_dims %420 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1776265Z %423 = tt.broadcast %421 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1776519Z %424 = arith.select %17, %423, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1776756Z %425 = tt.broadcast %422 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1776985Z %426 = arith.select %19, %425, %424 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1777214Z %427 = tt.reshape %426 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1777437Z %428 = arith.sitofp %427 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1777687Z %429 = ttg.local_alloc %428 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1778027Z %430 = ttg.local_load %429 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1778497Z %431 = tt.dot %403, %430, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1778847Z %432 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:47.1779089Z %433 = arith.cmpi slt, %432, %c2_i32 : i32 2026-02-21T09:54:47.1779221Z %434 = arith.select %433, %432, %c0_i32 : i32 2026-02-21T09:54:47.1779490Z %435 = ttg.memdesc_index %161[%434] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1779849Z ttg.local_store %401, %435 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1780251Z scf.yield %431, %434, %arg8, %435 : tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1780592Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:47.1780907Z %171 = ttg.local_load %170#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1781343Z %172 = arith.extf %171 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1781697Z %173 = arith.addi %73, %156 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1782001Z %174 = tt.addptr %9, %173 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1782302Z %175 = arith.andi %79, %160 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1782540Z %176 = tt.load %174, %175, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1782786Z %177 = arith.shli %176, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1783022Z %178 = arith.shrsi %177, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1783255Z %179 = arith.shrsi %176, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1783543Z %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1783873Z %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1784152Z %182 = tt.broadcast %180 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1784391Z %183 = arith.select %17, %182, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1784645Z %184 = tt.broadcast %181 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1784875Z %185 = arith.select %19, %184, %183 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1785101Z %186 = tt.reshape %185 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1785319Z %187 = arith.sitofp %186 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1785567Z %188 = ttg.local_alloc %187 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1785883Z %189 = ttg.local_load %188 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1786369Z %190 = tt.dot %172, %189, %170#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1786865Z %191 = ttg.local_load %170#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1787304Z %192 = arith.extf %191 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1787637Z %193 = arith.addi %101, %156 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1787945Z %194 = tt.addptr %9, %193 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1788247Z %195 = arith.andi %107, %160 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1788487Z %196 = tt.load %194, %195, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1788732Z %197 = arith.shli %196, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1788966Z %198 = arith.shrsi %197, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1789200Z %199 = arith.shrsi %196, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1789482Z %200 = tt.expand_dims %198 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1789827Z %201 = tt.expand_dims %199 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1790102Z %202 = tt.broadcast %200 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1790337Z %203 = arith.select %17, %202, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1790571Z %204 = tt.broadcast %201 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1790799Z %205 = arith.select %19, %204, %203 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1791025Z %206 = tt.reshape %205 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1791242Z %207 = arith.sitofp %206 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1791490Z %208 = ttg.local_alloc %207 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1791810Z %209 = ttg.local_load %208 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1792276Z %210 = tt.dot %192, %209, %190, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1792680Z ttg.local_dealloc %161 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:54:47.1792893Z %211 = arith.truncf %210 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma> 2026-02-21T09:54:47.1793164Z %212 = tt.expand_dims %145 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:47.1793406Z %213 = arith.muli %212, %cst_14 : tensor<128x1xi32, #mma> 2026-02-21T09:54:47.1793634Z %214 = tt.expand_dims %148 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:54:47.1793890Z %215 = tt.broadcast %213 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1794143Z %216 = tt.broadcast %214 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1794320Z %217 = arith.addi %215, %216 : tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1794509Z %218 = tt.addptr %20, %217 : tensor<128x64x!tt.ptr, #mma>, tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1794704Z tt.store %218, %211 : tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:54:47.1794845Z %219 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:54:47.1794977Z %220 = arith.divsi %219, %c512_i32 : i32 2026-02-21T09:54:47.1795097Z %221 = arith.muli %220, %c4_i32 : i32 2026-02-21T09:54:47.1795216Z %222 = arith.subi %c128_i32, %221 : i32 2026-02-21T09:54:47.1795333Z %223 = arith.minsi %222, %c4_i32 : i32 2026-02-21T09:54:47.1795450Z %224 = arith.remsi %219, %c512_i32 : i32 2026-02-21T09:54:47.1795565Z %225 = arith.remsi %224, %223 : i32 2026-02-21T09:54:47.1795680Z %226 = arith.addi %221, %225 : i32 2026-02-21T09:54:47.1795794Z %227 = arith.divsi %224, %223 : i32 2026-02-21T09:54:47.1795912Z %228 = arith.muli %226, %c128_i32 : i32 2026-02-21T09:54:47.1796082Z %229 = tt.splat %228 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:47.1796303Z %230 = tt.splat %228 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:47.1796518Z %231 = arith.addi %229, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:47.1796730Z %232 = arith.addi %230, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:47.1796895Z %233 = arith.muli %227, %c64_i32 : i32 2026-02-21T09:54:47.1797051Z %234 = tt.splat %233 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:47.1797268Z %235 = arith.addi %234, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:47.1797541Z %236 = tt.expand_dims %231 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:54:47.1797794Z %237 = arith.muli %236, %cst_10 : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:47.1797991Z %238 = tt.broadcast %237 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1798166Z %239 = arith.extsi %233 : i32 to i64 2026-02-21T09:54:47.1798370Z %240 = tt.splat %239 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1798669Z %241 = arith.addi %240, %12 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1799056Z %242 = tt.expand_dims %241 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1799481Z %243 = tt.broadcast %242 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1799789Z %244 = arith.cmpi sge, %242, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1800029Z %245 = arith.cmpi slt, %242, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1800276Z %246 = arith.andi %244, %245 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1800573Z %247 = tt.broadcast %246 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1800867Z %248 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:54:47.1801053Z %249 = arith.addi %238, %55 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1801255Z %250 = tt.addptr %8, %249 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1801477Z %251 = tt.load %250 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1801758Z %252 = ttg.memdesc_index %248[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1802116Z ttg.local_store %251, %252 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1802355Z %253 = arith.addi %238, %62 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1802602Z %254 = tt.addptr %8, %253 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1802807Z %255 = tt.load %254 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1803083Z %256 = ttg.memdesc_index %248[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1803437Z ttg.local_store %255, %256 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1803954Z %257:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %252, %arg8 = %256) -> (tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>) : i32 { 2026-02-21T09:54:47.1804371Z %393 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:54:47.1804493Z %394 = arith.muli %393, %c2_i32 : i32 2026-02-21T09:54:47.1804662Z %395 = tt.splat %394 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1804882Z %396 = arith.addi %395, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1805176Z %397 = tt.expand_dims %396 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:54:47.1805450Z %398 = tt.broadcast %397 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1805646Z %399 = arith.addi %238, %398 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1805845Z %400 = tt.addptr %8, %399 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1806051Z %401 = tt.load %400 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1806352Z %402 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1806787Z %403 = arith.extf %402 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1807066Z %404 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:54:47.1807273Z %405 = tt.splat %404 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1807568Z %406 = arith.addi %405, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1807952Z %407 = tt.expand_dims %406 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1808317Z %408 = arith.muli %407, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1808621Z %409 = tt.broadcast %408 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1808923Z %410 = arith.addi %409, %243 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1809230Z %411 = tt.addptr %9, %410 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1809561Z %412 = arith.cmpi sge, %407, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1809800Z %413 = arith.cmpi slt, %407, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1810033Z %414 = arith.andi %412, %413 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1810331Z %415 = tt.broadcast %414 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1810640Z %416 = arith.andi %415, %247 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1810883Z %417 = tt.load %411, %416, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1811126Z %418 = arith.shli %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1811360Z %419 = arith.shrsi %418, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1811598Z %420 = arith.shrsi %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1811881Z %421 = tt.expand_dims %419 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1812214Z %422 = tt.expand_dims %420 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1812493Z %423 = tt.broadcast %421 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1812727Z %424 = arith.select %17, %423, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1812979Z %425 = tt.broadcast %422 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1813207Z %426 = arith.select %19, %425, %424 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1813433Z %427 = tt.reshape %426 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1813651Z %428 = arith.sitofp %427 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1813902Z %429 = ttg.local_alloc %428 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1814226Z %430 = ttg.local_load %429 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1814695Z %431 = tt.dot %403, %430, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1815042Z %432 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:47.1815168Z %433 = arith.cmpi slt, %432, %c2_i32 : i32 2026-02-21T09:54:47.1815299Z %434 = arith.select %433, %432, %c0_i32 : i32 2026-02-21T09:54:47.1815566Z %435 = ttg.memdesc_index %248[%434] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1815922Z ttg.local_store %401, %435 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1816334Z scf.yield %431, %434, %arg8, %435 : tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1816669Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:47.1816982Z %258 = ttg.local_load %257#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1817410Z %259 = arith.extf %258 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1817753Z %260 = arith.addi %73, %243 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1818054Z %261 = tt.addptr %9, %260 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1818357Z %262 = arith.andi %79, %247 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1818608Z %263 = tt.load %261, %262, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1818850Z %264 = arith.shli %263, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1819082Z %265 = arith.shrsi %264, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1819313Z %266 = arith.shrsi %263, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1819598Z %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1819924Z %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1820201Z %269 = tt.broadcast %267 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1820438Z %270 = arith.select %17, %269, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1820667Z %271 = tt.broadcast %268 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1820911Z %272 = arith.select %19, %271, %270 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1821134Z %273 = tt.reshape %272 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1821352Z %274 = arith.sitofp %273 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1821599Z %275 = ttg.local_alloc %274 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1821913Z %276 = ttg.local_load %275 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1822386Z %277 = tt.dot %259, %276, %257#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1822883Z %278 = ttg.local_load %257#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1823311Z %279 = arith.extf %278 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1823645Z %280 = arith.addi %101, %243 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1823950Z %281 = tt.addptr %9, %280 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1824266Z %282 = arith.andi %107, %247 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1824508Z %283 = tt.load %281, %282, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1824755Z %284 = arith.shli %283, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1824989Z %285 = arith.shrsi %284, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1825222Z %286 = arith.shrsi %283, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1825523Z %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1825853Z %288 = tt.expand_dims %286 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1826128Z %289 = tt.broadcast %287 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1826375Z %290 = arith.select %17, %289, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1826608Z %291 = tt.broadcast %288 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1826835Z %292 = arith.select %19, %291, %290 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1827061Z %293 = tt.reshape %292 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1827278Z %294 = arith.sitofp %293 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1827529Z %295 = ttg.local_alloc %294 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1827849Z %296 = ttg.local_load %295 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1828313Z %297 = tt.dot %279, %296, %277, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1828699Z ttg.local_dealloc %248 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:54:47.1828912Z %298 = arith.truncf %297 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma> 2026-02-21T09:54:47.1829194Z %299 = tt.expand_dims %232 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:47.1829437Z %300 = arith.muli %299, %cst_14 : tensor<128x1xi32, #mma> 2026-02-21T09:54:47.1829668Z %301 = tt.expand_dims %235 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:54:47.1829925Z %302 = tt.broadcast %300 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1830128Z %303 = tt.broadcast %301 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1830310Z %304 = arith.addi %302, %303 : tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1830501Z %305 = tt.addptr %20, %304 : tensor<128x64x!tt.ptr, #mma>, tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1830696Z tt.store %305, %298 : tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:54:47.1830838Z %306 = arith.addi %arg3, %c3_i32 : i32 2026-02-21T09:54:47.1830961Z %307 = arith.divsi %306, %c512_i32 : i32 2026-02-21T09:54:47.1831081Z %308 = arith.muli %307, %c4_i32 : i32 2026-02-21T09:54:47.1831199Z %309 = arith.subi %c128_i32, %308 : i32 2026-02-21T09:54:47.1831318Z %310 = arith.minsi %309, %c4_i32 : i32 2026-02-21T09:54:47.1831436Z %311 = arith.remsi %306, %c512_i32 : i32 2026-02-21T09:54:47.1831550Z %312 = arith.remsi %311, %310 : i32 2026-02-21T09:54:47.1831679Z %313 = arith.addi %308, %312 : i32 2026-02-21T09:54:47.1831793Z %314 = arith.divsi %311, %310 : i32 2026-02-21T09:54:47.1831910Z %315 = arith.muli %313, %c128_i32 : i32 2026-02-21T09:54:47.1832080Z %316 = tt.splat %315 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:47.1832299Z %317 = tt.splat %315 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:47.1832517Z %318 = arith.addi %316, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:47.1832733Z %319 = arith.addi %317, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:47.1832920Z %320 = arith.muli %314, %c64_i32 : i32 2026-02-21T09:54:47.1833080Z %321 = tt.splat %320 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:47.1833291Z %322 = arith.addi %321, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:47.1833567Z %323 = tt.expand_dims %318 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:54:47.1833830Z %324 = arith.muli %323, %cst_10 : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:47.1834048Z %325 = tt.broadcast %324 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1834226Z %326 = arith.extsi %320 : i32 to i64 2026-02-21T09:54:47.1834438Z %327 = tt.splat %326 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1834736Z %328 = arith.addi %327, %12 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1835133Z %329 = tt.expand_dims %328 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1835562Z %330 = tt.broadcast %329 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1835876Z %331 = arith.cmpi sge, %329, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1836125Z %332 = arith.cmpi slt, %329, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1836364Z %333 = arith.andi %331, %332 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1836682Z %334 = tt.broadcast %333 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1836983Z %335 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:54:47.1837171Z %336 = arith.addi %325, %55 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1837377Z %337 = tt.addptr %8, %336 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1837591Z %338 = tt.load %337 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1837879Z %339 = ttg.memdesc_index %335[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1838242Z ttg.local_store %338, %339 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1838486Z %340 = arith.addi %325, %62 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1838688Z %341 = tt.addptr %8, %340 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1838897Z %342 = tt.load %341 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1839180Z %343 = ttg.memdesc_index %335[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1839557Z ttg.local_store %342, %343 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1840085Z %344:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %339, %arg8 = %343) -> (tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>) : i32 { 2026-02-21T09:54:47.1840509Z %393 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:54:47.1840641Z %394 = arith.muli %393, %c2_i32 : i32 2026-02-21T09:54:47.1840813Z %395 = tt.splat %394 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1841059Z %396 = arith.addi %395, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1841337Z %397 = tt.expand_dims %396 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:54:47.1841622Z %398 = tt.broadcast %397 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1841827Z %399 = arith.addi %325, %398 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1842053Z %400 = tt.addptr %8, %399 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1842263Z %401 = tt.load %400 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1842613Z %402 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1843056Z %403 = arith.extf %402 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1843347Z %404 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:54:47.1843559Z %405 = tt.splat %404 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1843866Z %406 = arith.addi %405, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1844265Z %407 = tt.expand_dims %406 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1844624Z %408 = arith.muli %407, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1844957Z %409 = tt.broadcast %408 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1845263Z %410 = arith.addi %409, %330 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1845575Z %411 = tt.addptr %9, %410 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1845894Z %412 = arith.cmpi sge, %407, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1846140Z %413 = arith.cmpi slt, %407, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1846380Z %414 = arith.andi %412, %413 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1846682Z %415 = tt.broadcast %414 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1846980Z %416 = arith.andi %415, %334 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1847233Z %417 = tt.load %411, %416, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1847483Z %418 = arith.shli %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1847741Z %419 = arith.shrsi %418, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1847982Z %420 = arith.shrsi %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1848273Z %421 = tt.expand_dims %419 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1848611Z %422 = tt.expand_dims %420 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1848894Z %423 = tt.broadcast %421 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1849157Z %424 = arith.select %17, %423, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1849396Z %425 = tt.broadcast %422 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1849630Z %426 = arith.select %19, %425, %424 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1849864Z %427 = tt.reshape %426 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1850104Z %428 = arith.sitofp %427 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1850363Z %429 = ttg.local_alloc %428 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1850693Z %430 = ttg.local_load %429 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1851164Z %431 = tt.dot %403, %430, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1851524Z %432 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:47.1851654Z %433 = arith.cmpi slt, %432, %c2_i32 : i32 2026-02-21T09:54:47.1851794Z %434 = arith.select %433, %432, %c0_i32 : i32 2026-02-21T09:54:47.1852071Z %435 = ttg.memdesc_index %335[%434] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1852433Z ttg.local_store %401, %435 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1852852Z scf.yield %431, %434, %arg8, %435 : tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1853195Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:47.1853514Z %345 = ttg.local_load %344#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1853955Z %346 = arith.extf %345 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1854288Z %347 = arith.addi %73, %330 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1854599Z %348 = tt.addptr %9, %347 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1854906Z %349 = arith.andi %79, %334 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1855147Z %350 = tt.load %348, %349, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1855397Z %351 = arith.shli %350, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1855635Z %352 = arith.shrsi %351, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1855889Z %353 = arith.shrsi %350, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1856179Z %354 = tt.expand_dims %352 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1856512Z %355 = tt.expand_dims %353 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1856797Z %356 = tt.broadcast %354 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1857037Z %357 = arith.select %17, %356, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1857287Z %358 = tt.broadcast %355 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1857523Z %359 = arith.select %19, %358, %357 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1857751Z %360 = tt.reshape %359 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1857974Z %361 = arith.sitofp %360 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1858241Z %362 = ttg.local_alloc %361 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1858562Z %363 = ttg.local_load %362 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1859033Z %364 = tt.dot %346, %363, %344#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1859535Z %365 = ttg.local_load %344#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1859969Z %366 = arith.extf %365 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1860303Z %367 = arith.addi %101, %330 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1860612Z %368 = tt.addptr %9, %367 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1860919Z %369 = arith.andi %107, %334 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1861185Z %370 = tt.load %368, %369, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1861431Z %371 = arith.shli %370, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1861670Z %372 = arith.shrsi %371, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1861906Z %373 = arith.shrsi %370, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1862197Z %374 = tt.expand_dims %372 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1862533Z %375 = tt.expand_dims %373 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1862812Z %376 = tt.broadcast %374 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1863054Z %377 = arith.select %17, %376, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1863288Z %378 = tt.broadcast %375 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1863524Z %379 = arith.select %19, %378, %377 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1863754Z %380 = tt.reshape %379 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1863988Z %381 = arith.sitofp %380 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1864243Z %382 = ttg.local_alloc %381 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1864564Z %383 = ttg.local_load %382 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1865034Z %384 = tt.dot %366, %383, %364, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1865443Z ttg.local_dealloc %335 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:54:47.1865657Z %385 = arith.truncf %384 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma> 2026-02-21T09:54:47.1865931Z %386 = tt.expand_dims %319 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:47.1866178Z %387 = arith.muli %386, %cst_14 : tensor<128x1xi32, #mma> 2026-02-21T09:54:47.1866429Z %388 = tt.expand_dims %322 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:54:47.1866692Z %389 = tt.broadcast %387 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1866898Z %390 = tt.broadcast %388 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1867084Z %391 = arith.addi %389, %390 : tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1867275Z %392 = tt.addptr %20, %391 : tensor<128x64x!tt.ptr, #mma>, tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1867481Z tt.store %392, %385 : tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:54:47.1867626Z } {tt.num_stages = 1 : i32} 2026-02-21T09:54:47.1867750Z scf.for %arg3 = %24 to %2 step %c1_i32 : i32 { 2026-02-21T09:54:47.1867893Z %25 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T09:54:47.1868019Z %26 = arith.muli %25, %c4_i32 : i32 2026-02-21T09:54:47.1868143Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:54:47.1868263Z %28 = arith.minsi %27, %c4_i32 : i32 2026-02-21T09:54:47.1868388Z %29 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T09:54:47.1868514Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:54:47.1868629Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:54:47.1868746Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:54:47.1868879Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:54:47.1869051Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:47.1869266Z %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:47.1869483Z %36 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:47.1869700Z %37 = arith.addi %35, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:47.1869867Z %38 = arith.muli %32, %c64_i32 : i32 2026-02-21T09:54:47.1870030Z %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:47.1870233Z %40 = arith.addi %39, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:47.1870510Z %41 = tt.expand_dims %36 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:54:47.1870768Z %42 = arith.muli %41, %cst_10 : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:47.1870962Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1871142Z %44 = arith.extsi %38 : i32 to i64 2026-02-21T09:54:47.1871346Z %45 = tt.splat %44 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1871659Z %46 = arith.addi %45, %12 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1872043Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1872467Z %48 = tt.broadcast %47 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1872780Z %49 = arith.cmpi sge, %47, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1873019Z %50 = arith.cmpi slt, %47, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1873274Z %51 = arith.andi %49, %50 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1873571Z %52 = tt.broadcast %51 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1873858Z %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:54:47.1874141Z %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:54:47.1874409Z %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1874599Z %56 = arith.addi %43, %55 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1874796Z %57 = tt.addptr %8, %56 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1874995Z %58 = tt.load %57 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1875277Z %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1875632Z ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1875904Z %60 = arith.addi %7, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1876178Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:54:47.1876445Z %62 = tt.broadcast %61 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1876650Z %63 = arith.addi %43, %62 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1876841Z %64 = tt.addptr %8, %63 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1877041Z %65 = tt.load %64 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1877318Z %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1877670Z ttg.local_store %65, %66 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1878185Z %67:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>) : i32 { 2026-02-21T09:54:47.1878600Z %132 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:54:47.1878645Z %133 = arith.muli %132, %c2_i32 : i32 2026-02-21T09:54:47.1878739Z %134 = tt.splat %133 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1878833Z %135 = arith.addi %134, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:47.1878980Z %136 = tt.expand_dims %135 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:54:47.1879085Z %137 = tt.broadcast %136 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1879148Z %138 = arith.addi %43, %137 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1879255Z %139 = tt.addptr %8, %138 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:47.1879319Z %140 = tt.load %139 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:47.1879522Z %141 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1879724Z %142 = arith.extf %141 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1879787Z %143 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:54:47.1879918Z %144 = tt.splat %143 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1880049Z %145 = arith.addi %144, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1880280Z %146 = tt.expand_dims %145 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1880379Z %147 = arith.muli %146, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1880551Z %148 = tt.broadcast %147 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1880646Z %149 = arith.addi %148, %48 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1880821Z %150 = tt.addptr %9, %149 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1880926Z %151 = arith.cmpi sge, %146, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1881027Z %152 = arith.cmpi slt, %146, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1881122Z %153 = arith.andi %151, %152 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1881300Z %154 = tt.broadcast %153 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1881392Z %155 = arith.andi %154, %52 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1881502Z %156 = tt.load %150, %155, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1881601Z %157 = arith.shli %156, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1881698Z %158 = arith.shrsi %157, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1881795Z %159 = arith.shrsi %156, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1881945Z %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1882092Z %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1882185Z %162 = tt.broadcast %160 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1882292Z %163 = arith.select %17, %162, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1882381Z %164 = tt.broadcast %161 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1882490Z %165 = arith.select %19, %164, %163 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1882622Z %166 = tt.reshape %165 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1882714Z %167 = arith.sitofp %166 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1882831Z %168 = ttg.local_alloc %167 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1883002Z %169 = ttg.local_load %168 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1883268Z %170 = tt.dot %142, %169, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1883334Z %171 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:54:47.1883385Z %172 = arith.cmpi slt, %171, %c2_i32 : i32 2026-02-21T09:54:47.1883436Z %173 = arith.select %172, %171, %c0_i32 : i32 2026-02-21T09:54:47.1883631Z %174 = ttg.memdesc_index %53[%173] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1883776Z ttg.local_store %140, %174 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1883993Z scf.yield %170, %173, %arg8, %174 : tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:54:47.1884074Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:47.1884271Z %68 = ttg.local_load %67#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1884467Z %69 = arith.extf %68 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1884600Z %70 = arith.addi %11, %cst_8 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1884818Z %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1884927Z %72 = arith.muli %71, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1885096Z %73 = tt.broadcast %72 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1885183Z %74 = arith.addi %73, %48 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1885352Z %75 = tt.addptr %9, %74 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1885453Z %76 = arith.cmpi sge, %71, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1885548Z %77 = arith.cmpi slt, %71, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1885633Z %78 = arith.andi %76, %77 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1885798Z %79 = tt.broadcast %78 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1885886Z %80 = arith.andi %79, %52 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1885989Z %81 = tt.load %75, %80, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1886085Z %82 = arith.shli %81, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1886194Z %83 = arith.shrsi %82, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1886289Z %84 = arith.shrsi %81, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1886431Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1886576Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1886666Z %87 = tt.broadcast %85 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1886778Z %88 = arith.select %17, %87, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1886868Z %89 = tt.broadcast %86 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1886962Z %90 = arith.select %19, %89, %88 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1887045Z %91 = tt.reshape %90 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1887147Z %92 = arith.sitofp %91 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1887259Z %93 = ttg.local_alloc %92 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1887422Z %94 = ttg.local_load %93 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1887683Z %95 = tt.dot %69, %94, %67#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1887876Z %96 = ttg.local_load %67#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1888072Z %97 = arith.extf %96 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1888202Z %98 = arith.addi %11, %cst_7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:47.1888426Z %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1888525Z %100 = arith.muli %99, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1888691Z %101 = tt.broadcast %100 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1888785Z %102 = arith.addi %101, %48 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1888964Z %103 = tt.addptr %9, %102 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1889062Z %104 = arith.cmpi sge, %99, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1889159Z %105 = arith.cmpi slt, %99, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1889253Z %106 = arith.andi %104, %105 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1889415Z %107 = tt.broadcast %106 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1889506Z %108 = arith.andi %107, %52 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1889635Z %109 = tt.load %103, %108, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1889732Z %110 = arith.shli %109, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1889830Z %111 = arith.shrsi %110, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1889928Z %112 = arith.shrsi %109, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:47.1890076Z %113 = tt.expand_dims %111 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1890220Z %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:47.1890438Z %115 = tt.broadcast %113 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1890539Z %116 = arith.select %17, %115, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1890631Z %117 = tt.broadcast %114 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1890741Z %118 = arith.select %19, %117, %116 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:47.1890827Z %119 = tt.reshape %118 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:47.1890915Z %120 = arith.sitofp %119 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:47.1891030Z %121 = ttg.local_alloc %120 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:47.1891197Z %122 = ttg.local_load %121 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:47.1891452Z %123 = tt.dot %97, %122, %95, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:47.1891541Z ttg.local_dealloc %53 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:54:47.1891628Z %124 = arith.truncf %123 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma> 2026-02-21T09:54:47.1891768Z %125 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:47.1891830Z %126 = arith.muli %125, %cst_14 : tensor<128x1xi32, #mma> 2026-02-21T09:54:47.1891977Z %127 = tt.expand_dims %40 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:54:47.1892060Z %128 = tt.broadcast %126 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1892141Z %129 = tt.broadcast %127 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1892202Z %130 = arith.addi %128, %129 : tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1892298Z %131 = tt.addptr %20, %130 : tensor<128x64x!tt.ptr, #mma>, tensor<128x64xi32, #mma> 2026-02-21T09:54:47.1892360Z tt.store %131, %124 : tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:54:47.1892405Z } {tt.num_stages = 1 : i32} 2026-02-21T09:54:47.1892440Z tt.return 2026-02-21T09:54:47.1892471Z } 2026-02-21T09:54:47.1892509Z } 2026-02-21T09:54:47.1892514Z 2026-02-21T09:54:47.1892544Z {-# 2026-02-21T09:54:47.1892585Z external_resources: { 2026-02-21T09:54:47.1892623Z mlir_reproducer: { 2026-02-21T09:54:47.1893558Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:54:47.1893616Z disable_threading: false, 2026-02-21T09:54:47.1893654Z verify_each: true 2026-02-21T09:54:47.1893686Z } 2026-02-21T09:54:47.1893716Z } 2026-02-21T09:54:47.1893746Z #-} 2026-02-21T09:54:47.1893987Z /tmp/torchinductor_root/lv/clvm7l6fs2dqjoun2ulv7bbmeucmix4j7eppuxy3r2di4tghe3tn.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:54:47.1894398Z /tmp/torchinductor_root/lv/clvm7l6fs2dqjoun2ulv7bbmeucmix4j7eppuxy3r2di4tghe3tn.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:54:47.1894529Z [618s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:54:47.1895162Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 64], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:54:47.1895219Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:54:47.1895304Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:54:48.2720478Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:54:48.2722878Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [8, 2, 1], order = [2, 1, 0]}> 2026-02-21T09:54:48.2723536Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:54:48.2724150Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [16, 1], order = [1, 0]}> 2026-02-21T09:54:48.2724716Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 4], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:54:48.2725487Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:54:48.2725924Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:54:48.2726267Z #smem = #ttg.shared_memory 2026-02-21T09:54:48.2726710Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:54:48.2727594Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:54:48.2728315Z %cst = arith.constant dense<0.000000e+00> : tensor<128x64xf32, #mma> 2026-02-21T09:54:48.2728617Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:54:48.2728827Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:54:48.2729042Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:54:48.2729352Z %cst_0 = arith.constant dense<0> : tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:48.2729619Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:54:48.2729831Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:54:48.2730050Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:54:48.2730384Z %cst_1 = arith.constant dense<0> : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2730864Z %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2731438Z %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2731908Z %cst_4 = arith.constant dense<512> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2732369Z %cst_5 = arith.constant dense<0> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2732830Z %cst_6 = arith.constant dense<8192> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2733249Z %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:48.2733639Z %cst_8 = arith.constant dense<4> : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2734122Z %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:48.2734442Z %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:48.2734768Z %cst_11 = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:54:48.2734978Z %0 = tt.get_program_id x : i32 2026-02-21T09:54:48.2735131Z %1 = arith.divsi %0, %c512_i32 : i32 2026-02-21T09:54:48.2735289Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T09:54:48.2735523Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T09:54:48.2735681Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T09:54:48.2735835Z %5 = arith.remsi %0, %c512_i32 : i32 2026-02-21T09:54:48.2735988Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:54:48.2736135Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:54:48.2736279Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:54:48.2736424Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:54:48.2736709Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:48.2737095Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:48.2737440Z %12 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:48.2737736Z %13 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:48.2738030Z %14 = arith.addi %12, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:54:48.2738317Z %15 = arith.addi %13, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:54:48.2738550Z %16 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:54:48.2738897Z %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:48.2739325Z %18 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:48.2739640Z %19 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:48.2739914Z %20 = arith.addi %19, %18 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:54:48.2740246Z %21 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:48.2740667Z %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:54:48.2741022Z %23 = arith.muli %22, %cst_7 : tensor<128x1xi32, #blocked1> 2026-02-21T09:54:48.2741292Z %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:48.2741591Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:48.2741821Z %26 = arith.extsi %16 : i32 to i64 2026-02-21T09:54:48.2742082Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2742501Z %28 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:48.2743116Z %29 = arith.extsi %28 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:48.2743664Z %30 = tt.splat %26 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:48.2744209Z %31 = arith.extsi %17 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:48.2744736Z %32 = arith.addi %30, %31 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:48.2745176Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2745636Z %34 = tt.broadcast %33 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2745992Z %35 = arith.cmpi sge, %33, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2746253Z %36 = arith.cmpi slt, %33, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2746501Z %37 = arith.andi %35, %36 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2746817Z %38 = tt.broadcast %37 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2747208Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:54:48.2747674Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:54:48.2748105Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:48.2748382Z %42 = arith.cmpi eq, %41, %cst_9 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:48.2748590Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x64xi1, #blocked> 2026-02-21T09:54:48.2748804Z %44 = arith.cmpi eq, %41, %cst_10 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:54:48.2749029Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x64xi1, #blocked> 2026-02-21T09:54:48.2749315Z %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst) -> (tensor<128x64xf32, #mma>) : i32 { 2026-02-21T09:54:48.2749546Z %56 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:54:48.2749729Z %57 = tt.splat %56 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:48.2749966Z %58 = arith.addi %57, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:54:48.2750257Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:54:48.2750549Z %60 = tt.broadcast %59 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:54:48.2750758Z %61 = arith.addi %24, %60 : tensor<128x8xi32, #blocked1> 2026-02-21T09:54:48.2750973Z %62 = tt.addptr %25, %61 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:54:48.2751200Z %63 = tt.load %62 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:54:48.2751437Z %64 = ttg.local_alloc %63 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:54:48.2751793Z %65 = ttg.local_load %64 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:48.2752250Z %66 = arith.extf %65 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:48.2752555Z %67 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:54:48.2760303Z %68 = tt.splat %67 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:48.2760621Z %69 = arith.addi %68, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> 2026-02-21T09:54:48.2761001Z %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2761391Z %71 = arith.muli %70, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2761691Z %72 = tt.broadcast %71 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2761986Z %73 = arith.addi %72, %34 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2762309Z %74 = tt.addptr %27, %73 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2762698Z %75 = arith.cmpi sge, %70, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2762940Z %76 = arith.cmpi slt, %70, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2763170Z %77 = arith.andi %75, %76 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2763457Z %78 = tt.broadcast %77 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2763744Z %79 = arith.andi %78, %38 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2763978Z %80 = tt.load %74, %79, %cst_1 : tensor<4x64x!tt.ptr, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2764213Z %81 = arith.shli %80, %cst_8 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2764441Z %82 = arith.shrsi %81, %cst_8 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2764695Z %83 = arith.shrsi %80, %cst_8 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:54:48.2764975Z %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:48.2765306Z %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked> 2026-02-21T09:54:48.2765578Z %86 = tt.broadcast %84 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:48.2765807Z %87 = arith.select %43, %86, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:48.2766034Z %88 = tt.broadcast %85 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:48.2766253Z %89 = arith.select %45, %88, %87 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked> 2026-02-21T09:54:48.2766472Z %90 = tt.reshape %89 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2> 2026-02-21T09:54:48.2766684Z %91 = arith.sitofp %90 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2> 2026-02-21T09:54:48.2766925Z %92 = ttg.local_alloc %91 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem> 2026-02-21T09:54:48.2767241Z %93 = ttg.local_load %92 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:54:48.2767708Z %94 = tt.dot %66, %93, %arg4, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma> 2026-02-21T09:54:48.2768073Z scf.yield %94 : tensor<128x64xf32, #mma> 2026-02-21T09:54:48.2768238Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:54:48.2768441Z %47 = arith.truncf %46 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma> 2026-02-21T09:54:48.2768705Z %48 = tt.expand_dims %15 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:54:48.2768937Z %49 = arith.muli %48, %cst_11 : tensor<128x1xi32, #mma> 2026-02-21T09:54:48.2769185Z %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma> 2026-02-21T09:54:48.2769435Z %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:48.2769635Z %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma> 2026-02-21T09:54:48.2769808Z %53 = arith.addi %51, %52 : tensor<128x64xi32, #mma> 2026-02-21T09:54:48.2769975Z %54 = tt.splat %arg2 : !tt.ptr -> tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:54:48.2770204Z %55 = tt.addptr %54, %53 : tensor<128x64x!tt.ptr, #mma>, tensor<128x64xi32, #mma> 2026-02-21T09:54:48.2770390Z tt.store %55, %47 : tensor<128x64x!tt.ptr, #mma> 2026-02-21T09:54:48.2770519Z tt.return 2026-02-21T09:54:48.2770601Z } 2026-02-21T09:54:48.2770677Z } 2026-02-21T09:54:48.2770722Z 2026-02-21T09:54:48.2770753Z {-# 2026-02-21T09:54:48.2770834Z external_resources: { 2026-02-21T09:54:48.2770938Z mlir_reproducer: { 2026-02-21T09:54:48.2771945Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:54:48.2772941Z disable_threading: false, 2026-02-21T09:54:48.2773051Z verify_each: true 2026-02-21T09:54:48.2773139Z } 2026-02-21T09:54:48.2773211Z } 2026-02-21T09:54:48.2773298Z #-} 2026-02-21T09:54:48.2773574Z /tmp/torchinductor_root/vx/cvxgycmeq7yjydt6ne5477bm5nqyqrzezs7lyi323qohz53sf3fy.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:54:48.2774242Z /tmp/torchinductor_root/vx/cvxgycmeq7yjydt6ne5477bm5nqyqrzezs7lyi323qohz53sf3fy.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:54:48.2774786Z [619s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:54:48.2775504Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 64], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:54:48.2776158Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:54:48.2776322Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:54:50.9819875Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 65/65 7.5 configs/s 2026-02-21T09:54:54.1801997Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 213/213 35.6 configs/s 2026-02-21T09:54:57.4422186Z [628s] Generation 9 complete: 2026-02-21T09:54:57.4422402Z error=4 2026-02-21T09:54:57.4422488Z ok=62 2026-02-21T09:54:57.4422561Z min=0.9309 2026-02-21T09:54:57.4422642Z mid=1.8349 2026-02-21T09:54:57.4422747Z max=77.3627 2026-02-21T09:54:57.4422836Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:54:57.4422979Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:54:57.4423114Z 'l2_groupings': [2], 2026-02-21T09:54:57.4423214Z 'load_eviction_policies': ['', ''], 2026-02-21T09:54:57.4423336Z 'loop_orders': [[0, 1]], 2026-02-21T09:54:57.4423437Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:54:57.4423538Z 'num_stages': 1, 2026-02-21T09:54:57.4423940Z 'num_warps': 4, 2026-02-21T09:54:57.4424029Z 'pid_type': 'flat', 2026-02-21T09:54:57.4424127Z 'range_flattens': [None, None], 2026-02-21T09:54:57.4424240Z 'range_multi_buffers': [None, False], 2026-02-21T09:54:57.4424355Z 'range_num_stages': [0, 1], 2026-02-21T09:54:57.4424463Z 'range_unroll_factors': [0, 0], 2026-02-21T09:54:57.4424571Z 'range_warp_specializes': [], 2026-02-21T09:54:57.4424670Z 'waves_per_eu': 2} 2026-02-21T09:54:57.4525165Z [628s] Fitting surrogate: 910 points, 910 targets 2026-02-21T09:54:58.1345196Z [628s] Generation 10 starting: 60 neighbors, 3 active search path(s) 2026-02-21T09:55:10.5449269Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 20.0 configs/s 2026-02-21T09:55:13.4908896Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:55:13.4934509Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}> 2026-02-21T09:55:13.4935332Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:55:13.4936171Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:55:13.4936721Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:55:13.4937227Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:55:13.4937685Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:55:13.4938032Z #smem = #ttg.shared_memory 2026-02-21T09:55:13.4939135Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:55:13.4940172Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:55:13.4941384Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.4941751Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.4942065Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.4942362Z %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.4942614Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:55:13.4942869Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:55:13.4943137Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:55:13.4943440Z %cst_4 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.4943855Z %cst_5 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.4944269Z %cst_6 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.4944624Z %cst_7 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4945028Z %cst_8 = arith.constant dense<0> : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4945302Z %cst_9 = arith.constant dense<512> : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4945596Z %cst_10 = arith.constant dense<0> : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.4945889Z %cst_11 = arith.constant dense<8192> : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.4946179Z %cst_12 = arith.constant dense<0> : tensor<2x128xi8, #blocked2> 2026-02-21T09:55:13.4946419Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:55:13.4946615Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:55:13.4946800Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:55:13.4947103Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:55:13.4947343Z %cst_13 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.4947586Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:55:13.4947769Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:55:13.4947950Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:55:13.4948245Z %cst_14 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.4948553Z %0 = tt.get_program_id x : i32 2026-02-21T09:55:13.4948815Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:55:13.4949001Z %2 = arith.minsi %1, %c8192_i32 : i32 2026-02-21T09:55:13.4949338Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.4949802Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.4950236Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.4950689Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.4951142Z %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.4951545Z %8 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:55:13.4951877Z %9 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.4952202Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.4952623Z %11 = arith.extsi %10 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.4953077Z %12 = arith.extsi %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.4953515Z %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:55:13.4954028Z %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:55:13.4954526Z %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.4954836Z %16 = arith.cmpi eq, %15, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.4955085Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:55:13.4955332Z %18 = arith.cmpi eq, %15, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.4955558Z %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:55:13.4955823Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.4956022Z %21 = arith.subi %2, %0 : i32 2026-02-21T09:55:13.4956162Z %22 = arith.remsi %21, %c3_i32 : i32 2026-02-21T09:55:13.4956306Z %23 = arith.subi %21, %22 : i32 2026-02-21T09:55:13.4956463Z %24 = arith.addi %0, %23 : i32 2026-02-21T09:55:13.4956614Z scf.for %arg3 = %0 to %24 step %c3_i32 : i32 { 2026-02-21T09:55:13.4956782Z %25 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:55:13.4956934Z %26 = arith.muli %25, %c4_i32 : i32 2026-02-21T09:55:13.4957083Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:55:13.4957229Z %28 = arith.minsi %27, %c4_i32 : i32 2026-02-21T09:55:13.4957374Z %29 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:55:13.4957538Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:55:13.4957682Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:55:13.4957813Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:55:13.4957975Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:55:13.4958187Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.4958455Z %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.4958721Z %36 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.4959007Z %37 = arith.addi %35, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.4959231Z %38 = arith.muli %32, %c128_i32 : i32 2026-02-21T09:55:13.4959425Z %39 = tt.splat %38 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.4959680Z %40 = arith.addi %39, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.4960011Z %41 = tt.expand_dims %36 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.4960334Z %42 = arith.muli %41, %cst_2 : tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.4960572Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.4960784Z %44 = arith.extsi %38 : i32 to i64 2026-02-21T09:55:13.4960992Z %45 = tt.splat %44 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.4961267Z %46 = arith.addi %45, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.4961605Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.4961995Z %48 = tt.broadcast %47 : tensor<1x128xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.4962228Z %49 = arith.cmpi sge, %47, %cst_10 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.4962405Z %50 = arith.cmpi slt, %47, %cst_11 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.4962663Z %51 = arith.andi %49, %50 : tensor<1x128xi1, #blocked2> 2026-02-21T09:55:13.4962850Z %52 = tt.broadcast %51 : tensor<1x128xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.4963066Z %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.4963339Z %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:55:13.4963610Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.4963798Z %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.4963996Z %57 = tt.addptr %8, %56 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.4964197Z %58 = tt.load %57 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:55:13.4964484Z %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.4964847Z ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.4965119Z %60 = arith.addi %7, %cst_4 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.4965422Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:55:13.4965690Z %62 = tt.broadcast %61 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.4965884Z %63 = arith.addi %43, %62 : tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.4966084Z %64 = tt.addptr %8, %63 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.4966285Z %65 = tt.load %64 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:55:13.4966591Z %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.4966946Z ttg.local_store %65, %66 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.4967494Z %67:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:55:13.4967928Z %312 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:55:13.4968054Z %313 = arith.muli %312, %c2_i32 : i32 2026-02-21T09:55:13.4968233Z %314 = tt.splat %313 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.4968456Z %315 = arith.addi %314, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.4968740Z %316 = tt.expand_dims %315 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:55:13.4969022Z %317 = tt.broadcast %316 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.4969217Z %318 = arith.addi %43, %317 : tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.4969423Z %319 = tt.addptr %8, %318 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.4969633Z %320 = tt.load %319 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:55:13.4969939Z %321 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.4970407Z %322 = arith.extf %321 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.4970694Z %323 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:55:13.4970868Z %324 = tt.splat %323 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.4971095Z %325 = arith.addi %324, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.4971373Z %326 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4971626Z %327 = arith.muli %326, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4971820Z %328 = tt.broadcast %327 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.4972016Z %329 = arith.addi %328, %48 : tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.4972217Z %330 = tt.addptr %9, %329 : tensor<2x128x!tt.ptr, #blocked2>, tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.4972425Z %331 = arith.cmpi sge, %326, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4972598Z %332 = arith.cmpi slt, %326, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4972760Z %333 = arith.andi %331, %332 : tensor<2x1xi1, #blocked2> 2026-02-21T09:55:13.4972950Z %334 = tt.broadcast %333 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.4977520Z %335 = arith.andi %334, %52 : tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.4977694Z %336 = tt.load %330, %335, %cst_12 : tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.4977957Z %337 = ttg.convert_layout %336 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.4978243Z %338 = arith.shli %337, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.4978485Z %339 = arith.shrsi %338, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.4978723Z %340 = arith.shrsi %337, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.4979056Z %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.4979396Z %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.4979685Z %343 = tt.broadcast %341 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.4979952Z %344 = arith.select %17, %343, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.4980195Z %345 = tt.broadcast %342 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.4980434Z %346 = arith.select %19, %345, %344 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.4980667Z %347 = tt.reshape %346 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.4980895Z %348 = arith.sitofp %347 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:55:13.4981153Z %349 = ttg.local_alloc %348 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:55:13.4981483Z %350 = ttg.local_load %349 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.4981965Z %351 = tt.dot %322, %350, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.4982318Z %352 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.4982463Z %353 = arith.cmpi slt, %352, %c2_i32 : i32 2026-02-21T09:55:13.4982600Z %354 = arith.select %353, %352, %c0_i32 : i32 2026-02-21T09:55:13.4982872Z %355 = ttg.memdesc_index %53[%354] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.4983233Z ttg.local_store %320, %355 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.4983639Z scf.yield %351, %354, %arg8, %355 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.4983984Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:55:13.4984302Z %68 = ttg.local_load %67#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.4984738Z %69 = arith.extf %68 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.4985070Z %70 = arith.addi %11, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.4985347Z %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4985612Z %72 = arith.muli %71, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4985804Z %73 = tt.broadcast %72 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.4985996Z %74 = arith.addi %73, %48 : tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.4986185Z %75 = tt.addptr %9, %74 : tensor<2x128x!tt.ptr, #blocked2>, tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.4986390Z %76 = arith.cmpi sge, %71, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4986557Z %77 = arith.cmpi slt, %71, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4986717Z %78 = arith.andi %76, %77 : tensor<2x1xi1, #blocked2> 2026-02-21T09:55:13.4986916Z %79 = tt.broadcast %78 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.4987099Z %80 = arith.andi %79, %52 : tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.4987264Z %81 = tt.load %75, %80, %cst_12 : tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.4987516Z %82 = ttg.convert_layout %81 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.4987816Z %83 = arith.shli %82, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.4988047Z %84 = arith.shrsi %83, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.4988280Z %85 = arith.shrsi %82, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.4988564Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.4988898Z %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.4989180Z %88 = tt.broadcast %86 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.4989421Z %89 = arith.select %17, %88, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.4989654Z %90 = tt.broadcast %87 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.4989883Z %91 = arith.select %19, %90, %89 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.4990107Z %92 = tt.reshape %91 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.4990343Z %93 = arith.sitofp %92 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:55:13.4990591Z %94 = ttg.local_alloc %93 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:55:13.4990912Z %95 = ttg.local_load %94 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.4991378Z %96 = tt.dot %69, %95, %67#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.4991872Z %97 = ttg.local_load %67#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.4992297Z %98 = arith.extf %97 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.4992629Z %99 = arith.addi %11, %cst_6 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.4992907Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4993157Z %101 = arith.muli %100, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4993351Z %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.4993570Z %103 = arith.addi %102, %48 : tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.4993770Z %104 = tt.addptr %9, %103 : tensor<2x128x!tt.ptr, #blocked2>, tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.4993979Z %105 = arith.cmpi sge, %100, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4994152Z %106 = arith.cmpi slt, %100, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.4994319Z %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2> 2026-02-21T09:55:13.4994504Z %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.4994715Z %109 = arith.andi %108, %52 : tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.4994882Z %110 = tt.load %104, %109, %cst_12 : tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.4995144Z %111 = ttg.convert_layout %110 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.4995432Z %112 = arith.shli %111, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.4995686Z %113 = arith.shrsi %112, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.4995929Z %114 = arith.shrsi %111, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.4996225Z %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.4996570Z %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.4996862Z %117 = tt.broadcast %115 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.4997102Z %118 = arith.select %17, %117, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.4997346Z %119 = tt.broadcast %116 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.4997577Z %120 = arith.select %19, %119, %118 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.4997809Z %121 = tt.reshape %120 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.4998033Z %122 = arith.sitofp %121 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:55:13.4998305Z %123 = ttg.local_alloc %122 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:55:13.4998632Z %124 = ttg.local_load %123 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.4999103Z %125 = tt.dot %98, %124, %96, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.4999488Z ttg.local_dealloc %53 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.4999706Z %126 = arith.truncf %125 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.4999977Z %127 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.5000220Z %128 = arith.muli %127, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.5000451Z %129 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.5000715Z %130 = tt.broadcast %128 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5000924Z %131 = tt.broadcast %129 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5001105Z %132 = arith.addi %130, %131 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5001314Z %133 = tt.addptr %20, %132 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5001513Z tt.store %133, %126 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.5001661Z %134 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:55:13.5001787Z %135 = arith.divsi %134, %c256_i32 : i32 2026-02-21T09:55:13.5001907Z %136 = arith.muli %135, %c4_i32 : i32 2026-02-21T09:55:13.5002029Z %137 = arith.subi %c128_i32, %136 : i32 2026-02-21T09:55:13.5002147Z %138 = arith.minsi %137, %c4_i32 : i32 2026-02-21T09:55:13.5002269Z %139 = arith.remsi %134, %c256_i32 : i32 2026-02-21T09:55:13.5002384Z %140 = arith.remsi %139, %138 : i32 2026-02-21T09:55:13.5002518Z %141 = arith.addi %136, %140 : i32 2026-02-21T09:55:13.5002672Z %142 = arith.divsi %139, %138 : i32 2026-02-21T09:55:13.5002792Z %143 = arith.muli %141, %c128_i32 : i32 2026-02-21T09:55:13.5002967Z %144 = tt.splat %143 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.5003186Z %145 = tt.splat %143 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.5003425Z %146 = arith.addi %144, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.5003641Z %147 = arith.addi %145, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.5003810Z %148 = arith.muli %142, %c128_i32 : i32 2026-02-21T09:55:13.5003975Z %149 = tt.splat %148 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.5004180Z %150 = arith.addi %149, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.5004461Z %151 = tt.expand_dims %146 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.5004715Z %152 = arith.muli %151, %cst_2 : tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.5004914Z %153 = tt.broadcast %152 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5005091Z %154 = arith.extsi %148 : i32 to i64 2026-02-21T09:55:13.5005261Z %155 = tt.splat %154 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.5005488Z %156 = arith.addi %155, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.5005769Z %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.5006075Z %158 = tt.broadcast %157 : tensor<1x128xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5006280Z %159 = arith.cmpi sge, %157, %cst_10 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.5006457Z %160 = arith.cmpi slt, %157, %cst_11 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.5006625Z %161 = arith.andi %159, %160 : tensor<1x128xi1, #blocked2> 2026-02-21T09:55:13.5006814Z %162 = tt.broadcast %161 : tensor<1x128xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5007068Z %163 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.5007253Z %164 = arith.addi %153, %55 : tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5007454Z %165 = tt.addptr %8, %164 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5007663Z %166 = tt.load %165 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:55:13.5007950Z %167 = ttg.memdesc_index %163[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5008316Z ttg.local_store %166, %167 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5008557Z %168 = arith.addi %153, %62 : tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5008757Z %169 = tt.addptr %8, %168 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5008993Z %170 = tt.load %169 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:55:13.5009275Z %171 = ttg.memdesc_index %163[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5009634Z ttg.local_store %170, %171 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5010164Z %172:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %167, %arg8 = %171) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:55:13.5010611Z %312 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:55:13.5010743Z %313 = arith.muli %312, %c2_i32 : i32 2026-02-21T09:55:13.5010916Z %314 = tt.splat %313 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.5011141Z %315 = arith.addi %314, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.5011444Z %316 = tt.expand_dims %315 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:55:13.5011721Z %317 = tt.broadcast %316 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5011919Z %318 = arith.addi %153, %317 : tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5012120Z %319 = tt.addptr %8, %318 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5012332Z %320 = tt.load %319 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:55:13.5012635Z %321 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5013074Z %322 = arith.extf %321 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5013358Z %323 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:55:13.5013529Z %324 = tt.splat %323 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.5013756Z %325 = arith.addi %324, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.5014052Z %326 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5014299Z %327 = arith.muli %326, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5014495Z %328 = tt.broadcast %327 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5014689Z %329 = arith.addi %328, %158 : tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5014891Z %330 = tt.addptr %9, %329 : tensor<2x128x!tt.ptr, #blocked2>, tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5015102Z %331 = arith.cmpi sge, %326, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5015275Z %332 = arith.cmpi slt, %326, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5015441Z %333 = arith.andi %331, %332 : tensor<2x1xi1, #blocked2> 2026-02-21T09:55:13.5015629Z %334 = tt.broadcast %333 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5015825Z %335 = arith.andi %334, %162 : tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5015995Z %336 = tt.load %330, %335, %cst_12 : tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.5016259Z %337 = ttg.convert_layout %336 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5016546Z %338 = arith.shli %337, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5016800Z %339 = arith.shrsi %338, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5017043Z %340 = arith.shrsi %337, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5017333Z %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5017675Z %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5017963Z %343 = tt.broadcast %341 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5018227Z %344 = arith.select %17, %343, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5018469Z %345 = tt.broadcast %342 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5018706Z %346 = arith.select %19, %345, %344 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5018949Z %347 = tt.reshape %346 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.5019178Z %348 = arith.sitofp %347 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:55:13.5019430Z %349 = ttg.local_alloc %348 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:55:13.5019759Z %350 = ttg.local_load %349 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5020240Z %351 = tt.dot %322, %350, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.5020592Z %352 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.5020721Z %353 = arith.cmpi slt, %352, %c2_i32 : i32 2026-02-21T09:55:13.5020854Z %354 = arith.select %353, %352, %c0_i32 : i32 2026-02-21T09:55:13.5021128Z %355 = ttg.memdesc_index %163[%354] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5021497Z ttg.local_store %320, %355 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5021921Z scf.yield %351, %354, %arg8, %355 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5022267Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:55:13.5022586Z %173 = ttg.local_load %172#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5023027Z %174 = arith.extf %173 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5023331Z %175 = arith.addi %73, %158 : tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5023536Z %176 = tt.addptr %9, %175 : tensor<2x128x!tt.ptr, #blocked2>, tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5023744Z %177 = arith.andi %79, %162 : tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5023922Z %178 = tt.load %176, %177, %cst_12 : tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.5024182Z %179 = ttg.convert_layout %178 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5024471Z %180 = arith.shli %179, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5024709Z %181 = arith.shrsi %180, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5024974Z %182 = arith.shrsi %179, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5025274Z %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5025615Z %184 = tt.expand_dims %182 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5025906Z %185 = tt.broadcast %183 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5026170Z %186 = arith.select %17, %185, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5026415Z %187 = tt.broadcast %184 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5026655Z %188 = arith.select %19, %187, %186 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5026888Z %189 = tt.reshape %188 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.5027133Z %190 = arith.sitofp %189 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:55:13.5027389Z %191 = ttg.local_alloc %190 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:55:13.5027722Z %192 = ttg.local_load %191 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5028203Z %193 = tt.dot %174, %192, %172#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.5028711Z %194 = ttg.local_load %172#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5029155Z %195 = arith.extf %194 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5029467Z %196 = arith.addi %102, %158 : tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5029671Z %197 = tt.addptr %9, %196 : tensor<2x128x!tt.ptr, #blocked2>, tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5029880Z %198 = arith.andi %108, %162 : tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5030067Z %199 = tt.load %197, %198, %cst_12 : tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.5030333Z %200 = ttg.convert_layout %199 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5030624Z %201 = arith.shli %200, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5030863Z %202 = arith.shrsi %201, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5031109Z %203 = arith.shrsi %200, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5031407Z %204 = tt.expand_dims %202 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5031755Z %205 = tt.expand_dims %203 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5032050Z %206 = tt.broadcast %204 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5032294Z %207 = arith.select %17, %206, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5032543Z %208 = tt.broadcast %205 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5032777Z %209 = arith.select %19, %208, %207 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5033031Z %210 = tt.reshape %209 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.5033263Z %211 = arith.sitofp %210 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:55:13.5033518Z %212 = ttg.local_alloc %211 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:55:13.5033849Z %213 = ttg.local_load %212 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5034324Z %214 = tt.dot %195, %213, %193, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.5034737Z ttg.local_dealloc %163 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.5034958Z %215 = arith.truncf %214 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.5035232Z %216 = tt.expand_dims %147 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.5035491Z %217 = arith.muli %216, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.5035728Z %218 = tt.expand_dims %150 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.5035991Z %219 = tt.broadcast %217 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5036204Z %220 = tt.broadcast %218 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5036390Z %221 = arith.addi %219, %220 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5036590Z %222 = tt.addptr %20, %221 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5036793Z tt.store %222, %215 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.5036945Z %223 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:55:13.5037075Z %224 = arith.divsi %223, %c256_i32 : i32 2026-02-21T09:55:13.5037201Z %225 = arith.muli %224, %c4_i32 : i32 2026-02-21T09:55:13.5037328Z %226 = arith.subi %c128_i32, %225 : i32 2026-02-21T09:55:13.5037450Z %227 = arith.minsi %226, %c4_i32 : i32 2026-02-21T09:55:13.5037574Z %228 = arith.remsi %223, %c256_i32 : i32 2026-02-21T09:55:13.5037694Z %229 = arith.remsi %228, %227 : i32 2026-02-21T09:55:13.5037816Z %230 = arith.addi %225, %229 : i32 2026-02-21T09:55:13.5037957Z %231 = arith.divsi %228, %227 : i32 2026-02-21T09:55:13.5038076Z %232 = arith.muli %230, %c128_i32 : i32 2026-02-21T09:55:13.5038257Z %233 = tt.splat %232 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.5038478Z %234 = tt.splat %232 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.5038702Z %235 = arith.addi %233, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.5038921Z %236 = arith.addi %234, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.5039094Z %237 = arith.muli %231, %c128_i32 : i32 2026-02-21T09:55:13.5039266Z %238 = tt.splat %237 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.5039475Z %239 = arith.addi %238, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.5039755Z %240 = tt.expand_dims %235 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.5040013Z %241 = arith.muli %240, %cst_2 : tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.5040218Z %242 = tt.broadcast %241 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5040402Z %243 = arith.extsi %237 : i32 to i64 2026-02-21T09:55:13.5040573Z %244 = tt.splat %243 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.5040818Z %245 = arith.addi %244, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.5041100Z %246 = tt.expand_dims %245 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.5041388Z %247 = tt.broadcast %246 : tensor<1x128xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5041599Z %248 = arith.cmpi sge, %246, %cst_10 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.5041777Z %249 = arith.cmpi slt, %246, %cst_11 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.5041968Z %250 = arith.andi %248, %249 : tensor<1x128xi1, #blocked2> 2026-02-21T09:55:13.5042161Z %251 = tt.broadcast %250 : tensor<1x128xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5042387Z %252 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.5042616Z %253 = arith.addi %242, %55 : tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5042822Z %254 = tt.addptr %8, %253 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5043058Z %255 = tt.load %254 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:55:13.5043347Z %256 = ttg.memdesc_index %252[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5043715Z ttg.local_store %255, %256 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5043958Z %257 = arith.addi %242, %62 : tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5044167Z %258 = tt.addptr %8, %257 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5044740Z %259 = tt.load %258 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:55:13.5045040Z %260 = ttg.memdesc_index %252[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5045406Z ttg.local_store %259, %260 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5046046Z %261:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %256, %arg8 = %260) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:55:13.5046478Z %312 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:55:13.5046616Z %313 = arith.muli %312, %c2_i32 : i32 2026-02-21T09:55:13.5046788Z %314 = tt.splat %313 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.5047023Z %315 = arith.addi %314, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.5047309Z %316 = tt.expand_dims %315 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:55:13.5047588Z %317 = tt.broadcast %316 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5047791Z %318 = arith.addi %242, %317 : tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5047997Z %319 = tt.addptr %8, %318 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5048216Z %320 = tt.load %319 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:55:13.5048525Z %321 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5048968Z %322 = arith.extf %321 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5049300Z %323 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:55:13.5049475Z %324 = tt.splat %323 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.5049707Z %325 = arith.addi %324, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.5049991Z %326 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5050244Z %327 = arith.muli %326, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5050447Z %328 = tt.broadcast %327 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5050680Z %329 = arith.addi %328, %247 : tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5050890Z %330 = tt.addptr %9, %329 : tensor<2x128x!tt.ptr, #blocked2>, tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5051107Z %331 = arith.cmpi sge, %326, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5051283Z %332 = arith.cmpi slt, %326, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5051457Z %333 = arith.andi %331, %332 : tensor<2x1xi1, #blocked2> 2026-02-21T09:55:13.5051673Z %334 = tt.broadcast %333 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5051873Z %335 = arith.andi %334, %251 : tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5052047Z %336 = tt.load %330, %335, %cst_12 : tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.5052315Z %337 = ttg.convert_layout %336 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5052612Z %338 = arith.shli %337, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5052852Z %339 = arith.shrsi %338, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5053100Z %340 = arith.shrsi %337, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5053396Z %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5053744Z %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5054035Z %343 = tt.broadcast %341 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5054305Z %344 = arith.select %17, %343, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5054558Z %345 = tt.broadcast %342 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5054798Z %346 = arith.select %19, %345, %344 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5055033Z %347 = tt.reshape %346 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.5055266Z %348 = arith.sitofp %347 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:55:13.5055525Z %349 = ttg.local_alloc %348 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:55:13.5055861Z %350 = ttg.local_load %349 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5056346Z %351 = tt.dot %322, %350, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.5056705Z %352 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.5056842Z %353 = arith.cmpi slt, %352, %c2_i32 : i32 2026-02-21T09:55:13.5056979Z %354 = arith.select %353, %352, %c0_i32 : i32 2026-02-21T09:55:13.5057276Z %355 = ttg.memdesc_index %252[%354] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5057649Z ttg.local_store %320, %355 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5058052Z scf.yield %351, %354, %arg8, %355 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5058404Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:55:13.5058723Z %262 = ttg.local_load %261#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5059183Z %263 = arith.extf %262 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5059491Z %264 = arith.addi %73, %247 : tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5059708Z %265 = tt.addptr %9, %264 : tensor<2x128x!tt.ptr, #blocked2>, tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5059915Z %266 = arith.andi %79, %251 : tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5060090Z %267 = tt.load %265, %266, %cst_12 : tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.5060351Z %268 = ttg.convert_layout %267 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5060641Z %269 = arith.shli %268, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5060885Z %270 = arith.shrsi %269, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5061133Z %271 = arith.shrsi %268, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5061431Z %272 = tt.expand_dims %270 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5061769Z %273 = tt.expand_dims %271 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5062055Z %274 = tt.broadcast %272 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5062311Z %275 = arith.select %17, %274, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5062552Z %276 = tt.broadcast %273 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5062790Z %277 = arith.select %19, %276, %275 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5063019Z %278 = tt.reshape %277 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.5063244Z %279 = arith.sitofp %278 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:55:13.5063498Z %280 = ttg.local_alloc %279 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:55:13.5063828Z %281 = ttg.local_load %280 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5064302Z %282 = tt.dot %263, %281, %261#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.5064796Z %283 = ttg.local_load %261#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5065231Z %284 = arith.extf %283 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5065548Z %285 = arith.addi %102, %247 : tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5065747Z %286 = tt.addptr %9, %285 : tensor<2x128x!tt.ptr, #blocked2>, tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5065954Z %287 = arith.andi %108, %251 : tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5066125Z %288 = tt.load %286, %287, %cst_12 : tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.5066387Z %289 = ttg.convert_layout %288 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5066674Z %290 = arith.shli %289, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5066930Z %291 = arith.shrsi %290, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5067168Z %292 = arith.shrsi %289, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5067459Z %293 = tt.expand_dims %291 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5067813Z %294 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5068100Z %295 = tt.broadcast %293 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5068340Z %296 = arith.select %17, %295, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5068582Z %297 = tt.broadcast %294 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5068815Z %298 = arith.select %19, %297, %296 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5069046Z %299 = tt.reshape %298 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.5069271Z %300 = arith.sitofp %299 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:55:13.5069526Z %301 = ttg.local_alloc %300 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:55:13.5069861Z %302 = ttg.local_load %301 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5070351Z %303 = tt.dot %284, %302, %282, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.5070738Z ttg.local_dealloc %252 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.5070956Z %304 = arith.truncf %303 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.5071229Z %305 = tt.expand_dims %236 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.5071469Z %306 = arith.muli %305, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.5071703Z %307 = tt.expand_dims %239 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.5071964Z %308 = tt.broadcast %306 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5072170Z %309 = tt.broadcast %307 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5072353Z %310 = arith.addi %308, %309 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5072548Z %311 = tt.addptr %20, %310 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5072752Z tt.store %311, %304 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.5072896Z } {tt.num_stages = 1 : i32} 2026-02-21T09:55:13.5073019Z scf.for %arg3 = %24 to %2 step %c1_i32 : i32 { 2026-02-21T09:55:13.5073155Z %25 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:55:13.5073295Z %26 = arith.muli %25, %c4_i32 : i32 2026-02-21T09:55:13.5073414Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:55:13.5073532Z %28 = arith.minsi %27, %c4_i32 : i32 2026-02-21T09:55:13.5073651Z %29 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:55:13.5073771Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:55:13.5073885Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:55:13.5073994Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:55:13.5074108Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:55:13.5074275Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.5074517Z %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.5074732Z %36 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.5074946Z %37 = arith.addi %35, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.5075111Z %38 = arith.muli %32, %c128_i32 : i32 2026-02-21T09:55:13.5075267Z %39 = tt.splat %38 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.5075486Z %40 = arith.addi %39, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.5075756Z %41 = tt.expand_dims %36 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.5076008Z %42 = arith.muli %41, %cst_2 : tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.5076201Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5076376Z %44 = arith.extsi %38 : i32 to i64 2026-02-21T09:55:13.5076543Z %45 = tt.splat %44 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.5076762Z %46 = arith.addi %45, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.5077042Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.5077319Z %48 = tt.broadcast %47 : tensor<1x128xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5077521Z %49 = arith.cmpi sge, %47, %cst_10 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.5077696Z %50 = arith.cmpi slt, %47, %cst_11 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.5077860Z %51 = arith.andi %49, %50 : tensor<1x128xi1, #blocked2> 2026-02-21T09:55:13.5078061Z %52 = tt.broadcast %51 : tensor<1x128xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5078275Z %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.5078541Z %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:55:13.5078812Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5078999Z %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5079198Z %57 = tt.addptr %8, %56 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5079397Z %58 = tt.load %57 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:55:13.5079682Z %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5080045Z ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5080315Z %60 = arith.addi %7, %cst_4 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.5080588Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:55:13.5080872Z %62 = tt.broadcast %61 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5081062Z %63 = arith.addi %43, %62 : tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5081261Z %64 = tt.addptr %8, %63 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5081461Z %65 = tt.load %64 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:55:13.5081744Z %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5082097Z ttg.local_store %65, %66 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5082692Z %67:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:55:13.5083119Z %134 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:55:13.5083242Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:55:13.5083443Z %136 = tt.splat %135 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.5083669Z %137 = arith.addi %136, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.5083947Z %138 = tt.expand_dims %137 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1> 2026-02-21T09:55:13.5084224Z %139 = tt.broadcast %138 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5084420Z %140 = arith.addi %43, %139 : tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5084623Z %141 = tt.addptr %8, %140 : tensor<128x4x!tt.ptr, #blocked1>, tensor<128x4xi32, #blocked1> 2026-02-21T09:55:13.5084830Z %142 = tt.load %141 : tensor<128x4x!tt.ptr, #blocked1> 2026-02-21T09:55:13.5085135Z %143 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5085573Z %144 = arith.extf %143 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5085857Z %145 = arith.extsi %arg4 : i32 to i64 2026-02-21T09:55:13.5086054Z %146 = tt.splat %145 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.5086279Z %147 = arith.addi %146, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.5086551Z %148 = tt.expand_dims %147 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5086800Z %149 = arith.muli %148, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5086992Z %150 = tt.broadcast %149 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5087189Z %151 = arith.addi %150, %48 : tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5087387Z %152 = tt.addptr %9, %151 : tensor<2x128x!tt.ptr, #blocked2>, tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5087598Z %153 = arith.cmpi sge, %148, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5087775Z %154 = arith.cmpi slt, %148, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5087938Z %155 = arith.andi %153, %154 : tensor<2x1xi1, #blocked2> 2026-02-21T09:55:13.5088130Z %156 = tt.broadcast %155 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5088320Z %157 = arith.andi %156, %52 : tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5088492Z %158 = tt.load %152, %157, %cst_12 : tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.5088773Z %159 = ttg.convert_layout %158 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5089058Z %160 = arith.shli %159, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5089299Z %161 = arith.shrsi %160, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5089538Z %162 = arith.shrsi %159, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5089836Z %163 = tt.expand_dims %161 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5090197Z %164 = tt.expand_dims %162 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5090481Z %165 = tt.broadcast %163 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5090728Z %166 = arith.select %17, %165, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5090983Z %167 = tt.broadcast %164 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5091220Z %168 = arith.select %19, %167, %166 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5091453Z %169 = tt.reshape %168 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.5091678Z %170 = arith.sitofp %169 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:55:13.5091934Z %171 = ttg.local_alloc %170 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:55:13.5092261Z %172 = ttg.local_load %171 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5092740Z %173 = tt.dot %144, %172, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.5093096Z %174 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.5093223Z %175 = arith.cmpi slt, %174, %c2_i32 : i32 2026-02-21T09:55:13.5093357Z %176 = arith.select %175, %174, %c0_i32 : i32 2026-02-21T09:55:13.5093644Z %177 = ttg.memdesc_index %53[%176] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5094004Z ttg.local_store %142, %177 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5094407Z scf.yield %173, %176, %arg8, %177 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.5094745Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:55:13.5095062Z %68 = ttg.local_load %67#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5095491Z %69 = arith.extf %68 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5095818Z %70 = arith.addi %11, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.5096094Z %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5096338Z %72 = arith.muli %71, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5096525Z %73 = tt.broadcast %72 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5096735Z %74 = arith.addi %73, %48 : tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5096926Z %75 = tt.addptr %9, %74 : tensor<2x128x!tt.ptr, #blocked2>, tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5097132Z %76 = arith.cmpi sge, %71, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5097298Z %77 = arith.cmpi slt, %71, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5097458Z %78 = arith.andi %76, %77 : tensor<2x1xi1, #blocked2> 2026-02-21T09:55:13.5097642Z %79 = tt.broadcast %78 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5097826Z %80 = arith.andi %79, %52 : tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5098008Z %81 = tt.load %75, %80, %cst_12 : tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.5098258Z %82 = ttg.convert_layout %81 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5098536Z %83 = arith.shli %82, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5098766Z %84 = arith.shrsi %83, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5099009Z %85 = arith.shrsi %82, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5099293Z %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5099622Z %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5099905Z %88 = tt.broadcast %86 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5100145Z %89 = arith.select %17, %88, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5100378Z %90 = tt.broadcast %87 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5100609Z %91 = arith.select %19, %90, %89 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5100831Z %92 = tt.reshape %91 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.5101050Z %93 = arith.sitofp %92 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:55:13.5101298Z %94 = ttg.local_alloc %93 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:55:13.5101635Z %95 = ttg.local_load %94 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5102102Z %96 = tt.dot %69, %95, %67#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.5102591Z %97 = ttg.local_load %67#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5103017Z %98 = arith.extf %97 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5103343Z %99 = arith.addi %11, %cst_6 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.5103620Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5103870Z %101 = arith.muli %100, %cst_7 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5104067Z %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5104258Z %103 = arith.addi %102, %48 : tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5104456Z %104 = tt.addptr %9, %103 : tensor<2x128x!tt.ptr, #blocked2>, tensor<2x128xi64, #blocked2> 2026-02-21T09:55:13.5104687Z %105 = arith.cmpi sge, %100, %cst_8 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5104859Z %106 = arith.cmpi slt, %100, %cst_9 : tensor<2x1xi64, #blocked2> 2026-02-21T09:55:13.5105028Z %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2> 2026-02-21T09:55:13.5105211Z %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5105403Z %109 = arith.andi %108, %52 : tensor<2x128xi1, #blocked2> 2026-02-21T09:55:13.5105573Z %110 = tt.load %104, %109, %cst_12 : tensor<2x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.5105836Z %111 = ttg.convert_layout %110 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5106137Z %112 = arith.shli %111, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5106374Z %113 = arith.shrsi %112, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5106613Z %114 = arith.shrsi %111, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.5106916Z %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5107255Z %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:55:13.5107540Z %117 = tt.broadcast %115 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5107779Z %118 = arith.select %17, %117, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5108022Z %119 = tt.broadcast %116 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5108256Z %120 = arith.select %19, %119, %118 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:55:13.5108489Z %121 = tt.reshape %120 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.5108715Z %122 = arith.sitofp %121 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2> 2026-02-21T09:55:13.5108969Z %123 = ttg.local_alloc %122 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:55:13.5109295Z %124 = ttg.local_load %123 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.5109779Z %125 = tt.dot %98, %124, %96, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.5110162Z ttg.local_dealloc %53 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.5110379Z %126 = arith.truncf %125 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.5110655Z %127 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.5110896Z %128 = arith.muli %127, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.5111124Z %129 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.5111386Z %130 = tt.broadcast %128 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5111596Z %131 = tt.broadcast %129 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5111777Z %132 = arith.addi %130, %131 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5111972Z %133 = tt.addptr %20, %132 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.5112170Z tt.store %133, %126 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.5112327Z } {tt.num_stages = 1 : i32} 2026-02-21T09:55:13.5112429Z tt.return 2026-02-21T09:55:13.5112512Z } 2026-02-21T09:55:13.5112588Z } 2026-02-21T09:55:13.5112632Z 2026-02-21T09:55:13.5112663Z {-# 2026-02-21T09:55:13.5112746Z external_resources: { 2026-02-21T09:55:13.5112847Z mlir_reproducer: { 2026-02-21T09:55:13.5113842Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:55:13.5114855Z disable_threading: false, 2026-02-21T09:55:13.5114961Z verify_each: true 2026-02-21T09:55:13.5123724Z } 2026-02-21T09:55:13.5123818Z } 2026-02-21T09:55:13.5123891Z #-} 2026-02-21T09:55:13.5124241Z /tmp/torchinductor_root/vt/cvtqkqvhi7lrkmrn7rd2e2qmchoq3bnvg4cimoocko5nbr34ijna.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:55:13.5124955Z /tmp/torchinductor_root/vt/cvtqkqvhi7lrkmrn7rd2e2qmchoq3bnvg4cimoocko5nbr34ijna.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:55:13.5125519Z [644s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:55:13.5126295Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=3, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:55:13.5127010Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:55:13.5127181Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:55:13.6235436Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:55:13.6244477Z #blocked = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:55:13.6244832Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T09:55:13.6245141Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:55:13.6245444Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:55:13.6245729Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:55:13.6245980Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:55:13.6246217Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:55:13.6246395Z #smem = #ttg.shared_memory 2026-02-21T09:55:13.6246630Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:55:13.6247103Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:55:13.6247577Z %cst = arith.constant dense<4> : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6247731Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:55:13.6247850Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:55:13.6247993Z %cst_0 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6248172Z %cst_1 = arith.constant dense<0> : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6248313Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:55:13.6248432Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:55:13.6248546Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:55:13.6248660Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:55:13.6248850Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:55:13.6248964Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:55:13.6249108Z %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.6249286Z %cst_3 = arith.constant dense<8192> : tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6249464Z %cst_4 = arith.constant dense<0> : tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6249633Z %cst_5 = arith.constant dense<512> : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6249861Z %cst_6 = arith.constant dense<0> : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6250030Z %cst_7 = arith.constant dense<8192> : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6250239Z %cst_8 = arith.constant dense<2> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.6250487Z %cst_9 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.6250672Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:55:13.6250829Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma> 2026-02-21T09:55:13.6251014Z %cst_11 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:55:13.6251189Z %cst_12 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:55:13.6251361Z %cst_13 = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.6251507Z %0 = tt.get_program_id x : i32 2026-02-21T09:55:13.6251622Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:55:13.6251738Z %2 = arith.minsi %1, %c4096_i32 : i32 2026-02-21T09:55:13.6251941Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.6252234Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.6252503Z %5 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:55:13.6252770Z %6 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.6253030Z %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.6253272Z %8 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:55:13.6253476Z %9 = tt.splat %arg1 : !tt.ptr -> tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:55:13.6253703Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.6254016Z %11 = arith.extsi %10 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.6254372Z %12 = arith.extsi %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:55:13.6254725Z %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> 2026-02-21T09:55:13.6255139Z %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T09:55:13.6255559Z %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T09:55:13.6255815Z %16 = arith.cmpi eq, %15, %cst_11 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:55:13.6256019Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1> 2026-02-21T09:55:13.6256221Z %18 = arith.cmpi eq, %15, %cst_12 : tensor<1x2x1xi32, #blocked1> 2026-02-21T09:55:13.6256418Z %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1> 2026-02-21T09:55:13.6256646Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:55:13.6256806Z %21 = arith.subi %2, %0 : i32 2026-02-21T09:55:13.6256920Z %22 = arith.remsi %21, %c3_i32 : i32 2026-02-21T09:55:13.6257034Z %23 = arith.subi %21, %22 : i32 2026-02-21T09:55:13.6257149Z %24 = arith.addi %0, %23 : i32 2026-02-21T09:55:13.6257271Z scf.for %arg3 = %0 to %24 step %c3_i32 : i32 { 2026-02-21T09:55:13.6257411Z %25 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:55:13.6257548Z %26 = arith.muli %25, %c4_i32 : i32 2026-02-21T09:55:13.6257667Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:55:13.6257784Z %28 = arith.minsi %27, %c4_i32 : i32 2026-02-21T09:55:13.6257905Z %29 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:55:13.6258023Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:55:13.6258134Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:55:13.6258246Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:55:13.6258362Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:55:13.6258530Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.6258743Z %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.6258959Z %36 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.6259171Z %37 = arith.addi %35, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.6259333Z %38 = arith.muli %32, %c256_i32 : i32 2026-02-21T09:55:13.6259491Z %39 = tt.splat %38 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.6259692Z %40 = arith.addi %39, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.6259979Z %41 = tt.expand_dims %36 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.6260230Z %42 = arith.muli %41, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.6260423Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6260597Z %44 = arith.extsi %38 : i32 to i64 2026-02-21T09:55:13.6260760Z %45 = tt.splat %44 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:55:13.6260975Z %46 = arith.addi %45, %12 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:55:13.6261245Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6261521Z %48 = tt.broadcast %47 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6261718Z %49 = arith.cmpi sge, %47, %cst_4 : tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6261884Z %50 = arith.cmpi slt, %47, %cst_3 : tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6262047Z %51 = arith.andi %49, %50 : tensor<1x256xi1, #blocked> 2026-02-21T09:55:13.6262224Z %52 = tt.broadcast %51 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6262437Z %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.6262720Z %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:55:13.6262988Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6263178Z %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6263373Z %57 = tt.addptr %8, %56 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6263579Z %58 = tt.load %57 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:55:13.6263820Z %59 = tt.expand_dims %11 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6264076Z %60 = arith.muli %59, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6264261Z %61 = tt.broadcast %60 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6264444Z %62 = arith.addi %61, %48 : tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6264636Z %63 = tt.addptr %9, %62 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6264849Z %64 = arith.cmpi sge, %59, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6265015Z %65 = arith.cmpi slt, %59, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6265174Z %66 = arith.andi %64, %65 : tensor<2x1xi1, #blocked> 2026-02-21T09:55:13.6265347Z %67 = tt.broadcast %66 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6265530Z %68 = arith.andi %67, %52 : tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6265746Z %69 = tt.load %63, %68, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:55:13.6266089Z %70 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6266451Z ttg.local_store %58, %70 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6266721Z %71 = arith.addi %7, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.6266999Z %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:55:13.6267284Z %73 = tt.broadcast %72 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6267495Z %74 = arith.addi %43, %73 : tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6267687Z %75 = tt.addptr %8, %74 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6267892Z %76 = tt.load %75 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:55:13.6268079Z %77 = arith.addi %11, %cst_8 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.6268351Z %78 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6268589Z %79 = arith.muli %78, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6268771Z %80 = tt.broadcast %79 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6268955Z %81 = arith.addi %80, %48 : tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6269142Z %82 = tt.addptr %9, %81 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6269342Z %83 = arith.cmpi sge, %78, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6269503Z %84 = arith.cmpi slt, %78, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6269661Z %85 = arith.andi %83, %84 : tensor<2x1xi1, #blocked> 2026-02-21T09:55:13.6269835Z %86 = tt.broadcast %85 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6270016Z %87 = arith.andi %86, %52 : tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6270247Z %88 = tt.load %82, %87, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:55:13.6270580Z %89 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6270937Z ttg.local_store %76, %89 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6271563Z %90:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %70, %arg8 = %89, %arg9 = %69, %arg10 = %88) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:55:13.6272102Z %317 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:55:13.6272229Z %318 = arith.muli %317, %c2_i32 : i32 2026-02-21T09:55:13.6272404Z %319 = tt.splat %318 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.6272627Z %320 = arith.addi %319, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.6272923Z %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:55:13.6273201Z %322 = tt.broadcast %321 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6273398Z %323 = arith.addi %43, %322 : tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6273601Z %324 = tt.addptr %8, %323 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6273809Z %325 = tt.load %324 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:55:13.6274113Z %326 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6274556Z %327 = arith.extf %326 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6274842Z %328 = arith.extsi %317 : i32 to i64 2026-02-21T09:55:13.6275014Z %329 = tt.splat %328 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.6275233Z %330 = arith.addi %329, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.6275523Z %331 = tt.expand_dims %330 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6275766Z %332 = arith.muli %331, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6275959Z %333 = tt.broadcast %332 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6276151Z %334 = arith.addi %333, %48 : tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6276346Z %335 = tt.addptr %9, %334 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6276555Z %336 = arith.cmpi sge, %331, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6276723Z %337 = arith.cmpi slt, %331, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6276887Z %338 = arith.andi %336, %337 : tensor<2x1xi1, #blocked> 2026-02-21T09:55:13.6277072Z %339 = tt.broadcast %338 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6277261Z %340 = arith.andi %339, %52 : tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6277431Z %341 = tt.load %335, %340, %cst_1 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:55:13.6277603Z %342 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6277763Z %343 = arith.shrsi %342, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6278008Z %344 = ttg.convert_layout %343 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6278275Z %345 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6278521Z %346 = ttg.convert_layout %345 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6278858Z %347 = tt.expand_dims %344 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6279205Z %348 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6279513Z %349 = tt.broadcast %347 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6279762Z %350 = arith.select %17, %349, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6280025Z %351 = tt.broadcast %348 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6280273Z %352 = arith.select %19, %351, %350 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6280534Z %353 = tt.reshape %352 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:55:13.6280762Z %354 = arith.sitofp %353 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:55:13.6281020Z %355 = ttg.local_alloc %354 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:55:13.6281348Z %356 = ttg.local_load %355 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6281834Z %357 = tt.dot %327, %356, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:55:13.6282191Z %358 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.6282319Z %359 = arith.cmpi slt, %358, %c2_i32 : i32 2026-02-21T09:55:13.6282456Z %360 = arith.select %359, %358, %c0_i32 : i32 2026-02-21T09:55:13.6282831Z %361 = ttg.memdesc_index %53[%360] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6283213Z ttg.local_store %325, %361 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6283699Z scf.yield %357, %360, %arg8, %361, %arg10, %341 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6284123Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:55:13.6284443Z %91 = ttg.local_load %90#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6284882Z %92 = arith.extf %91 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6285182Z %93 = arith.shli %90#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6285344Z %94 = arith.shrsi %93, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6285583Z %95 = ttg.convert_layout %94 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6285822Z %96 = arith.shrsi %90#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6286059Z %97 = ttg.convert_layout %96 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6286407Z %98 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6286747Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6287033Z %100 = tt.broadcast %98 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6287275Z %101 = arith.select %17, %100, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6287521Z %102 = tt.broadcast %99 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6287775Z %103 = arith.select %19, %102, %101 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6288013Z %104 = tt.reshape %103 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:55:13.6288241Z %105 = arith.sitofp %104 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:55:13.6288494Z %106 = ttg.local_alloc %105 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:55:13.6288850Z %107 = ttg.local_load %106 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6289320Z %108 = tt.dot %92, %107, %90#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:55:13.6289814Z %109 = ttg.local_load %90#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6290244Z %110 = arith.extf %109 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6290542Z %111 = arith.shli %90#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6290703Z %112 = arith.shrsi %111, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6290948Z %113 = ttg.convert_layout %112 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6291198Z %114 = arith.shrsi %90#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6291457Z %115 = ttg.convert_layout %114 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6291791Z %116 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6292136Z %117 = tt.expand_dims %115 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6292429Z %118 = tt.broadcast %116 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6292673Z %119 = arith.select %17, %118, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6292918Z %120 = tt.broadcast %117 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6293154Z %121 = arith.select %19, %120, %119 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6293390Z %122 = tt.reshape %121 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:55:13.6293616Z %123 = arith.sitofp %122 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:55:13.6293869Z %124 = ttg.local_alloc %123 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:55:13.6294197Z %125 = ttg.local_load %124 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6294685Z %126 = tt.dot %110, %125, %108, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:55:13.6295068Z ttg.local_dealloc %53 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.6295285Z %127 = arith.truncf %126 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:55:13.6295555Z %128 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.6295800Z %129 = arith.muli %128, %cst_13 : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.6296053Z %130 = tt.expand_dims %40 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:55:13.6296311Z %131 = tt.broadcast %129 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6296521Z %132 = tt.broadcast %130 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6296702Z %133 = arith.addi %131, %132 : tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6296911Z %134 = tt.addptr %20, %133 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6297113Z tt.store %134, %127 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:55:13.6297256Z %135 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:55:13.6297381Z %136 = arith.divsi %135, %c128_i32 : i32 2026-02-21T09:55:13.6297503Z %137 = arith.muli %136, %c4_i32 : i32 2026-02-21T09:55:13.6297625Z %138 = arith.subi %c128_i32, %137 : i32 2026-02-21T09:55:13.6297745Z %139 = arith.minsi %138, %c4_i32 : i32 2026-02-21T09:55:13.6297867Z %140 = arith.remsi %135, %c128_i32 : i32 2026-02-21T09:55:13.6297982Z %141 = arith.remsi %140, %139 : i32 2026-02-21T09:55:13.6298098Z %142 = arith.addi %137, %141 : i32 2026-02-21T09:55:13.6298214Z %143 = arith.divsi %140, %139 : i32 2026-02-21T09:55:13.6298331Z %144 = arith.muli %142, %c128_i32 : i32 2026-02-21T09:55:13.6298507Z %145 = tt.splat %144 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.6298728Z %146 = tt.splat %144 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.6298948Z %147 = arith.addi %145, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.6299184Z %148 = arith.addi %146, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.6299353Z %149 = arith.muli %143, %c256_i32 : i32 2026-02-21T09:55:13.6299518Z %150 = tt.splat %149 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.6299751Z %151 = arith.addi %150, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.6300028Z %152 = tt.expand_dims %147 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.6300286Z %153 = arith.muli %152, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.6300489Z %154 = tt.broadcast %153 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6300668Z %155 = arith.extsi %149 : i32 to i64 2026-02-21T09:55:13.6300834Z %156 = tt.splat %155 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:55:13.6301056Z %157 = arith.addi %156, %12 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:55:13.6301332Z %158 = tt.expand_dims %157 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6301612Z %159 = tt.broadcast %158 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6301817Z %160 = arith.cmpi sge, %158, %cst_4 : tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6302006Z %161 = arith.cmpi slt, %158, %cst_3 : tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6302172Z %162 = arith.andi %160, %161 : tensor<1x256xi1, #blocked> 2026-02-21T09:55:13.6302356Z %163 = tt.broadcast %162 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6302573Z %164 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.6302760Z %165 = arith.addi %154, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6302960Z %166 = tt.addptr %8, %165 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6303170Z %167 = tt.load %166 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:55:13.6303345Z %168 = arith.addi %61, %159 : tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6303537Z %169 = tt.addptr %9, %168 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6303733Z %170 = arith.andi %67, %163 : tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6303949Z %171 = tt.load %169, %170, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:55:13.6304309Z %172 = ttg.memdesc_index %164[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6304670Z ttg.local_store %167, %172 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6304913Z %173 = arith.addi %154, %73 : tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6305112Z %174 = tt.addptr %8, %173 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6305321Z %175 = tt.load %174 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:55:13.6305480Z %176 = arith.addi %80, %159 : tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6305671Z %177 = tt.addptr %9, %176 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6305868Z %178 = arith.andi %86, %163 : tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6306081Z %179 = tt.load %177, %178, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:55:13.6306422Z %180 = ttg.memdesc_index %164[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6306799Z ttg.local_store %175, %180 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6307428Z %181:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %172, %arg8 = %180, %arg9 = %171, %arg10 = %179) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:55:13.6307961Z %317 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:55:13.6308087Z %318 = arith.muli %317, %c2_i32 : i32 2026-02-21T09:55:13.6308259Z %319 = tt.splat %318 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.6308485Z %320 = arith.addi %319, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.6308762Z %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:55:13.6309045Z %322 = tt.broadcast %321 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6309245Z %323 = arith.addi %154, %322 : tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6309448Z %324 = tt.addptr %8, %323 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6309658Z %325 = tt.load %324 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:55:13.6309975Z %326 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6310415Z %327 = arith.extf %326 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6310699Z %328 = arith.extsi %317 : i32 to i64 2026-02-21T09:55:13.6310874Z %329 = tt.splat %328 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.6311097Z %330 = arith.addi %329, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.6311382Z %331 = tt.expand_dims %330 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6311627Z %332 = arith.muli %331, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6311818Z %333 = tt.broadcast %332 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6312007Z %334 = arith.addi %333, %159 : tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6312219Z %335 = tt.addptr %9, %334 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6312423Z %336 = arith.cmpi sge, %331, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6312596Z %337 = arith.cmpi slt, %331, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6312760Z %338 = arith.andi %336, %337 : tensor<2x1xi1, #blocked> 2026-02-21T09:55:13.6312943Z %339 = tt.broadcast %338 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6313135Z %340 = arith.andi %339, %163 : tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6313303Z %341 = tt.load %335, %340, %cst_1 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:55:13.6313477Z %342 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6313637Z %343 = arith.shrsi %342, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6313886Z %344 = ttg.convert_layout %343 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6314141Z %345 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6314387Z %346 = ttg.convert_layout %345 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6314743Z %347 = tt.expand_dims %344 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6315091Z %348 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6315380Z %349 = tt.broadcast %347 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6315632Z %350 = arith.select %17, %349, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6315876Z %351 = tt.broadcast %348 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6316120Z %352 = arith.select %19, %351, %350 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6316360Z %353 = tt.reshape %352 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:55:13.6316587Z %354 = arith.sitofp %353 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:55:13.6316844Z %355 = ttg.local_alloc %354 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:55:13.6317172Z %356 = ttg.local_load %355 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6317653Z %357 = tt.dot %327, %356, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:55:13.6318019Z %358 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.6318148Z %359 = arith.cmpi slt, %358, %c2_i32 : i32 2026-02-21T09:55:13.6318285Z %360 = arith.select %359, %358, %c0_i32 : i32 2026-02-21T09:55:13.6318557Z %361 = ttg.memdesc_index %164[%360] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6318923Z ttg.local_store %325, %361 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6319426Z scf.yield %357, %360, %arg8, %361, %arg10, %341 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6319848Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:55:13.6320180Z %182 = ttg.local_load %181#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6320613Z %183 = arith.extf %182 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6320914Z %184 = arith.shli %181#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6321077Z %185 = arith.shrsi %184, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6321323Z %186 = ttg.convert_layout %185 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6321571Z %187 = arith.shrsi %181#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6321818Z %188 = ttg.convert_layout %187 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6322155Z %189 = tt.expand_dims %186 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6322501Z %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6322854Z %191 = tt.broadcast %189 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6323100Z %192 = arith.select %17, %191, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6323347Z %193 = tt.broadcast %190 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6323586Z %194 = arith.select %19, %193, %192 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6323825Z %195 = tt.reshape %194 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:55:13.6324050Z %196 = arith.sitofp %195 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:55:13.6324306Z %197 = ttg.local_alloc %196 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:55:13.6324634Z %198 = ttg.local_load %197 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6325107Z %199 = tt.dot %183, %198, %181#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:55:13.6325609Z %200 = ttg.local_load %181#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6326062Z %201 = arith.extf %200 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6326362Z %202 = arith.shli %181#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6326523Z %203 = arith.shrsi %202, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6326767Z %204 = ttg.convert_layout %203 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6326835Z %205 = arith.shrsi %181#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6326979Z %206 = ttg.convert_layout %205 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6327157Z %207 = tt.expand_dims %204 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6327309Z %208 = tt.expand_dims %206 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6327408Z %209 = tt.broadcast %207 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6327540Z %210 = arith.select %17, %209, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6327638Z %211 = tt.broadcast %208 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6327742Z %212 = arith.select %19, %211, %210 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6327835Z %213 = tt.reshape %212 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:55:13.6327933Z %214 = arith.sitofp %213 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:55:13.6328052Z %215 = ttg.local_alloc %214 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:55:13.6328220Z %216 = ttg.local_load %215 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6328486Z %217 = tt.dot %201, %216, %199, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:55:13.6328575Z ttg.local_dealloc %164 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.6328685Z %218 = arith.truncf %217 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:55:13.6328829Z %219 = tt.expand_dims %148 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.6328893Z %220 = arith.muli %219, %cst_13 : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.6329031Z %221 = tt.expand_dims %151 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:55:13.6329119Z %222 = tt.broadcast %220 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6329201Z %223 = tt.broadcast %221 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6329261Z %224 = arith.addi %222, %223 : tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6329364Z %225 = tt.addptr %20, %224 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6329432Z tt.store %225, %218 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:55:13.6329477Z %226 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:55:13.6329526Z %227 = arith.divsi %226, %c128_i32 : i32 2026-02-21T09:55:13.6329571Z %228 = arith.muli %227, %c4_i32 : i32 2026-02-21T09:55:13.6329616Z %229 = arith.subi %c128_i32, %228 : i32 2026-02-21T09:55:13.6329657Z %230 = arith.minsi %229, %c4_i32 : i32 2026-02-21T09:55:13.6329705Z %231 = arith.remsi %226, %c128_i32 : i32 2026-02-21T09:55:13.6329762Z %232 = arith.remsi %231, %230 : i32 2026-02-21T09:55:13.6329802Z %233 = arith.addi %228, %232 : i32 2026-02-21T09:55:13.6329845Z %234 = arith.divsi %231, %230 : i32 2026-02-21T09:55:13.6329890Z %235 = arith.muli %233, %c128_i32 : i32 2026-02-21T09:55:13.6329985Z %236 = tt.splat %235 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.6330074Z %237 = tt.splat %235 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.6330168Z %238 = arith.addi %236, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.6330272Z %239 = arith.addi %237, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.6330319Z %240 = arith.muli %234, %c256_i32 : i32 2026-02-21T09:55:13.6330403Z %241 = tt.splat %240 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.6330486Z %242 = arith.addi %241, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.6330640Z %243 = tt.expand_dims %238 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.6330721Z %244 = arith.muli %243, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.6330816Z %245 = tt.broadcast %244 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6330859Z %246 = arith.extsi %240 : i32 to i64 2026-02-21T09:55:13.6330952Z %247 = tt.splat %246 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:55:13.6331043Z %248 = arith.addi %247, %12 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:55:13.6331191Z %249 = tt.expand_dims %248 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6331289Z %250 = tt.broadcast %249 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6331361Z %251 = arith.cmpi sge, %249, %cst_4 : tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6331431Z %252 = arith.cmpi slt, %249, %cst_3 : tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6331495Z %253 = arith.andi %251, %252 : tensor<1x256xi1, #blocked> 2026-02-21T09:55:13.6331585Z %254 = tt.broadcast %253 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6331688Z %255 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.6331751Z %256 = arith.addi %245, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6331855Z %257 = tt.addptr %8, %256 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6331920Z %258 = tt.load %257 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:55:13.6331981Z %259 = arith.addi %61, %250 : tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6332079Z %260 = tt.addptr %9, %259 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6332138Z %261 = arith.andi %67, %254 : tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6332261Z %262 = tt.load %260, %261, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:55:13.6332448Z %263 = ttg.memdesc_index %255[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6332592Z ttg.local_store %258, %263 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6332655Z %264 = arith.addi %245, %73 : tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6332757Z %265 = tt.addptr %8, %264 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6332821Z %266 = tt.load %265 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:55:13.6332896Z %267 = arith.addi %80, %250 : tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6332996Z %268 = tt.addptr %9, %267 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6333056Z %269 = arith.andi %86, %254 : tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6333174Z %270 = tt.load %268, %269, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:55:13.6333361Z %271 = ttg.memdesc_index %255[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6333504Z ttg.local_store %266, %271 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6333978Z %272:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %263, %arg8 = %271, %arg9 = %262, %arg10 = %270) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:55:13.6334026Z %317 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:55:13.6334086Z %318 = arith.muli %317, %c2_i32 : i32 2026-02-21T09:55:13.6334184Z %319 = tt.splat %318 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.6334276Z %320 = arith.addi %319, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.6334425Z %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:55:13.6334520Z %322 = tt.broadcast %321 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6334586Z %323 = arith.addi %245, %322 : tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6334690Z %324 = tt.addptr %8, %323 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6334756Z %325 = tt.load %324 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:55:13.6334964Z %326 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6335168Z %327 = arith.extf %326 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6335227Z %328 = arith.extsi %317 : i32 to i64 2026-02-21T09:55:13.6335321Z %329 = tt.splat %328 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.6335412Z %330 = arith.addi %329, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.6335558Z %331 = tt.expand_dims %330 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6335625Z %332 = arith.muli %331, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6335718Z %333 = tt.broadcast %332 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6335778Z %334 = arith.addi %333, %250 : tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6335882Z %335 = tt.addptr %9, %334 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6335951Z %336 = arith.cmpi sge, %331, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6336018Z %337 = arith.cmpi slt, %331, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6336080Z %338 = arith.andi %336, %337 : tensor<2x1xi1, #blocked> 2026-02-21T09:55:13.6336169Z %339 = tt.broadcast %338 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6336230Z %340 = arith.andi %339, %254 : tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6336320Z %341 = tt.load %335, %340, %cst_1 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:55:13.6336384Z %342 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6336445Z %343 = arith.shrsi %342, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6336596Z %344 = ttg.convert_layout %343 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6336661Z %345 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6336808Z %346 = ttg.convert_layout %345 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6336984Z %347 = tt.expand_dims %344 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6337137Z %348 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6337238Z %349 = tt.broadcast %347 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6337367Z %350 = arith.select %17, %349, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6337467Z %351 = tt.broadcast %348 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6337574Z %352 = arith.select %19, %351, %350 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6337670Z %353 = tt.reshape %352 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:55:13.6337772Z %354 = arith.sitofp %353 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:55:13.6337894Z %355 = ttg.local_alloc %354 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:55:13.6338069Z %356 = ttg.local_load %355 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6338351Z %357 = tt.dot %327, %356, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:55:13.6338400Z %358 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.6338454Z %359 = arith.cmpi slt, %358, %c2_i32 : i32 2026-02-21T09:55:13.6338522Z %360 = arith.select %359, %358, %c0_i32 : i32 2026-02-21T09:55:13.6338711Z %361 = ttg.memdesc_index %255[%360] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6338857Z ttg.local_store %325, %361 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6339172Z scf.yield %357, %360, %arg8, %361, %arg10, %341 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6339258Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:55:13.6339464Z %273 = ttg.local_load %272#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6339667Z %274 = arith.extf %273 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6339736Z %275 = arith.shli %272#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6339806Z %276 = arith.shrsi %275, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6339954Z %277 = ttg.convert_layout %276 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6340036Z %278 = arith.shrsi %272#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6340185Z %279 = ttg.convert_layout %278 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6340342Z %280 = tt.expand_dims %277 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6340496Z %281 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6340602Z %282 = tt.broadcast %280 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6340734Z %283 = arith.select %17, %282, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6340833Z %284 = tt.broadcast %281 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6340944Z %285 = arith.select %19, %284, %283 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6341055Z %286 = tt.reshape %285 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:55:13.6341190Z %287 = arith.sitofp %286 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:55:13.6341382Z %288 = ttg.local_alloc %287 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:55:13.6341679Z %289 = ttg.local_load %288 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6342171Z %290 = tt.dot %274, %289, %272#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:55:13.6342545Z %291 = ttg.local_load %272#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6342913Z %292 = arith.extf %291 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6343024Z %293 = arith.shli %272#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6343136Z %294 = arith.shrsi %293, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6343422Z %295 = ttg.convert_layout %294 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6343530Z %296 = arith.shrsi %272#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6343792Z %297 = ttg.convert_layout %296 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6344073Z %298 = tt.expand_dims %295 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6344353Z %299 = tt.expand_dims %297 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6344533Z %300 = tt.broadcast %298 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6344728Z %301 = arith.select %17, %300, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6344904Z %302 = tt.broadcast %299 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6345099Z %303 = arith.select %19, %302, %301 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6345265Z %304 = tt.reshape %303 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:55:13.6345434Z %305 = arith.sitofp %304 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:55:13.6345672Z %306 = ttg.local_alloc %305 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:55:13.6345990Z %307 = ttg.local_load %306 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6346490Z %308 = tt.dot %292, %307, %290, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:55:13.6346651Z ttg.local_dealloc %255 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.6346836Z %309 = arith.truncf %308 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:55:13.6347094Z %310 = tt.expand_dims %239 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.6347206Z %311 = arith.muli %310, %cst_13 : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.6347462Z %312 = tt.expand_dims %242 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:55:13.6347633Z %313 = tt.broadcast %311 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6347781Z %314 = tt.broadcast %312 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6347888Z %315 = arith.addi %313, %314 : tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6348064Z %316 = tt.addptr %20, %315 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6348173Z tt.store %316, %309 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:55:13.6348249Z } {tt.num_stages = 1 : i32} 2026-02-21T09:55:13.6348337Z scf.for %arg3 = %24 to %2 step %c1_i32 : i32 { 2026-02-21T09:55:13.6348418Z %25 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:55:13.6348494Z %26 = arith.muli %25, %c4_i32 : i32 2026-02-21T09:55:13.6348567Z %27 = arith.subi %c128_i32, %26 : i32 2026-02-21T09:55:13.6348639Z %28 = arith.minsi %27, %c4_i32 : i32 2026-02-21T09:55:13.6348721Z %29 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:55:13.6348793Z %30 = arith.remsi %29, %28 : i32 2026-02-21T09:55:13.6348862Z %31 = arith.addi %26, %30 : i32 2026-02-21T09:55:13.6348930Z %32 = arith.divsi %29, %28 : i32 2026-02-21T09:55:13.6349007Z %33 = arith.muli %31, %c128_i32 : i32 2026-02-21T09:55:13.6349192Z %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.6349340Z %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.6349508Z %36 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.6349653Z %37 = arith.addi %35, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.6349725Z %38 = arith.muli %32, %c256_i32 : i32 2026-02-21T09:55:13.6349875Z %39 = tt.splat %38 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.6350022Z %40 = arith.addi %39, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.6350296Z %41 = tt.expand_dims %36 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.6350408Z %42 = arith.muli %41, %cst_2 : tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.6350577Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6350648Z %44 = arith.extsi %38 : i32 to i64 2026-02-21T09:55:13.6350807Z %45 = tt.splat %44 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:55:13.6350972Z %46 = arith.addi %45, %12 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> 2026-02-21T09:55:13.6351258Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6351420Z %48 = tt.broadcast %47 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6351544Z %49 = arith.cmpi sge, %47, %cst_4 : tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6351660Z %50 = arith.cmpi slt, %47, %cst_3 : tensor<1x256xi64, #blocked> 2026-02-21T09:55:13.6351760Z %51 = arith.andi %49, %50 : tensor<1x256xi1, #blocked> 2026-02-21T09:55:13.6351925Z %52 = tt.broadcast %51 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6352093Z %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.6352418Z %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:55:13.6352621Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6352739Z %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6352944Z %57 = tt.addptr %8, %56 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6353087Z %58 = tt.load %57 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:55:13.6353376Z %59 = tt.expand_dims %11 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6353492Z %60 = arith.muli %59, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6353657Z %61 = tt.broadcast %60 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6353764Z %62 = arith.addi %61, %48 : tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6353954Z %63 = tt.addptr %9, %62 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6354078Z %64 = arith.cmpi sge, %59, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6354200Z %65 = arith.cmpi slt, %59, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6354312Z %66 = arith.andi %64, %65 : tensor<2x1xi1, #blocked> 2026-02-21T09:55:13.6354477Z %67 = tt.broadcast %66 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6354586Z %68 = arith.andi %67, %52 : tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6354813Z %69 = tt.load %63, %68, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:55:13.6355218Z %70 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6355503Z ttg.local_store %58, %70 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6355691Z %71 = arith.addi %7, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.6355980Z %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:55:13.6356155Z %73 = tt.broadcast %72 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6356270Z %74 = arith.addi %43, %73 : tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6356469Z %75 = tt.addptr %8, %74 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6356586Z %76 = tt.load %75 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:55:13.6356771Z %77 = arith.addi %11, %cst_8 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.6357054Z %78 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6357167Z %79 = arith.muli %78, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6357341Z %80 = tt.broadcast %79 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6357472Z %81 = arith.addi %80, %48 : tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6357661Z %82 = tt.addptr %9, %81 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6357785Z %83 = arith.cmpi sge, %78, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6357913Z %84 = arith.cmpi slt, %78, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6358017Z %85 = arith.andi %83, %84 : tensor<2x1xi1, #blocked> 2026-02-21T09:55:13.6358184Z %86 = tt.broadcast %85 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6358321Z %87 = arith.andi %86, %52 : tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6358547Z %88 = tt.load %82, %87, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:55:13.6358928Z %89 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6359214Z ttg.local_store %76, %89 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6360175Z %90:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %70, %arg8 = %89, %arg9 = %69, %arg10 = %88) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>) : i32 { 2026-02-21T09:55:13.6360261Z %135 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:55:13.6360348Z %136 = arith.muli %135, %c2_i32 : i32 2026-02-21T09:55:13.6360533Z %137 = tt.splat %136 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.6360715Z %138 = arith.addi %137, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.6361018Z %139 = tt.expand_dims %138 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:55:13.6361208Z %140 = tt.broadcast %139 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6361328Z %141 = arith.addi %43, %140 : tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6361541Z %142 = tt.addptr %8, %141 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:55:13.6361687Z %143 = tt.load %142 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:55:13.6362121Z %144 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6362615Z %145 = arith.extf %144 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6362698Z %146 = arith.extsi %135 : i32 to i64 2026-02-21T09:55:13.6362894Z %147 = tt.splat %146 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.6363101Z %148 = arith.addi %147, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.6363412Z %149 = tt.expand_dims %148 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6363540Z %150 = arith.muli %149, %cst_7 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6363734Z %151 = tt.broadcast %150 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6363860Z %152 = arith.addi %151, %48 : tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6364071Z %153 = tt.addptr %9, %152 : tensor<2x256x!tt.ptr, #blocked>, tensor<2x256xi64, #blocked> 2026-02-21T09:55:13.6364218Z %154 = arith.cmpi sge, %149, %cst_6 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6364396Z %155 = arith.cmpi slt, %149, %cst_5 : tensor<2x1xi64, #blocked> 2026-02-21T09:55:13.6364515Z %156 = arith.andi %154, %155 : tensor<2x1xi1, #blocked> 2026-02-21T09:55:13.6364703Z %157 = tt.broadcast %156 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6364825Z %158 = arith.andi %157, %52 : tensor<2x256xi1, #blocked> 2026-02-21T09:55:13.6364975Z %159 = tt.load %153, %158, %cst_1 : tensor<2x256x!tt.ptr, #blocked> 2026-02-21T09:55:13.6365104Z %160 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6365234Z %161 = arith.shrsi %160, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6365582Z %162 = ttg.convert_layout %161 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6365714Z %163 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6366036Z %164 = ttg.convert_layout %163 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6366398Z %165 = tt.expand_dims %162 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6366732Z %166 = tt.expand_dims %164 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6366950Z %167 = tt.broadcast %165 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6367185Z %168 = arith.select %17, %167, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6367396Z %169 = tt.broadcast %166 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6367623Z %170 = arith.select %19, %169, %168 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6367821Z %171 = tt.reshape %170 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:55:13.6368023Z %172 = arith.sitofp %171 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:55:13.6368286Z %173 = ttg.local_alloc %172 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:55:13.6368663Z %174 = ttg.local_load %173 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6369288Z %175 = tt.dot %145, %174, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:55:13.6369385Z %176 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.6369481Z %177 = arith.cmpi slt, %176, %c2_i32 : i32 2026-02-21T09:55:13.6369580Z %178 = arith.select %177, %176, %c0_i32 : i32 2026-02-21T09:55:13.6369983Z %179 = ttg.memdesc_index %53[%178] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6370295Z ttg.local_store %143, %179 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:55:13.6370988Z scf.yield %175, %178, %arg8, %179, %arg10, %159 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6371160Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:55:13.6371587Z %91 = ttg.local_load %90#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6372016Z %92 = arith.extf %91 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6372168Z %93 = arith.shli %90#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6372289Z %94 = arith.shrsi %93, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6372599Z %95 = ttg.convert_layout %94 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6372730Z %96 = arith.shrsi %90#4, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6373035Z %97 = ttg.convert_layout %96 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6373387Z %98 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6373717Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6373926Z %100 = tt.broadcast %98 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6374179Z %101 = arith.select %17, %100, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6374390Z %102 = tt.broadcast %99 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6374613Z %103 = arith.select %19, %102, %101 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6374811Z %104 = tt.reshape %103 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:55:13.6375018Z %105 = arith.sitofp %104 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:55:13.6375274Z %106 = ttg.local_alloc %105 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:55:13.6375650Z %107 = ttg.local_load %106 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6376251Z %108 = tt.dot %92, %107, %90#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:55:13.6376682Z %109 = ttg.local_load %90#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6377139Z %110 = arith.extf %109 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6377272Z %111 = arith.shli %90#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6377395Z %112 = arith.shrsi %111, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6377713Z %113 = ttg.convert_layout %112 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6377849Z %114 = arith.shrsi %90#5, %cst : tensor<2x256xi8, #blocked> 2026-02-21T09:55:13.6378163Z %115 = ttg.convert_layout %114 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.6378501Z %116 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6378837Z %117 = tt.expand_dims %115 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1> 2026-02-21T09:55:13.6379053Z %118 = tt.broadcast %116 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6379285Z %119 = arith.select %17, %118, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6379503Z %120 = tt.broadcast %117 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6379750Z %121 = arith.select %19, %120, %119 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1> 2026-02-21T09:55:13.6379946Z %122 = tt.reshape %121 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:55:13.6380152Z %123 = arith.sitofp %122 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:55:13.6380408Z %124 = ttg.local_alloc %123 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:55:13.6380777Z %125 = ttg.local_load %124 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.6381398Z %126 = tt.dot %110, %125, %108, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:55:13.6381580Z ttg.local_dealloc %53 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.6381768Z %127 = arith.truncf %126 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:55:13.6382097Z %128 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.6382221Z %129 = arith.muli %128, %cst_13 : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.6382523Z %130 = tt.expand_dims %40 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:55:13.6382706Z %131 = tt.broadcast %129 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6382882Z %132 = tt.broadcast %130 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6383001Z %133 = arith.addi %131, %132 : tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6383214Z %134 = tt.addptr %20, %133 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi32, #mma> 2026-02-21T09:55:13.6383341Z tt.store %134, %127 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:55:13.6383422Z } {tt.num_stages = 1 : i32} 2026-02-21T09:55:13.6383496Z tt.return 2026-02-21T09:55:13.6383560Z } 2026-02-21T09:55:13.6383627Z } 2026-02-21T09:55:13.6383634Z 2026-02-21T09:55:13.6383695Z {-# 2026-02-21T09:55:13.6383781Z external_resources: { 2026-02-21T09:55:13.6383860Z mlir_reproducer: { 2026-02-21T09:55:13.6386007Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:55:13.6386099Z disable_threading: false, 2026-02-21T09:55:13.6386173Z verify_each: true 2026-02-21T09:55:13.6386238Z } 2026-02-21T09:55:13.6386305Z } 2026-02-21T09:55:13.6386367Z #-} 2026-02-21T09:55:13.6386900Z /tmp/torchinductor_root/os/coswwui5nvwsubzopnbmqmwkhbwsnfpi7fvgc4rugf6f6skohdge.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:55:13.6387846Z /tmp/torchinductor_root/os/coswwui5nvwsubzopnbmqmwkhbwsnfpi7fvgc4rugf6f6skohdge.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:55:13.6388084Z [644s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:55:13.6389490Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:55:13.6389621Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:55:13.6389795Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:55:13.7717324Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:55:13.7729725Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}> 2026-02-21T09:55:13.7729900Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}> 2026-02-21T09:55:13.7730107Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:55:13.7730249Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 4], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:55:13.7730376Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:55:13.7730506Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:55:13.7730562Z #smem = #ttg.shared_memory 2026-02-21T09:55:13.7730773Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:55:13.7731124Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:55:13.7731217Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.7731313Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.7731402Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.7731493Z %cst_2 = arith.constant dense<8192> : tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7731619Z %cst_3 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.7731682Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:55:13.7731781Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7731838Z %c500_i32 = arith.constant 500 : i32 2026-02-21T09:55:13.7731895Z %c12_i32 = arith.constant 12 : i32 2026-02-21T09:55:13.7732032Z %cst_5 = arith.constant dense<500> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7732168Z %cst_6 = arith.constant dense<504> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7732304Z %cst_7 = arith.constant dense<508> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7732436Z %cst_8 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7732573Z %cst_9 = arith.constant dense<16> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7732630Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:55:13.7732682Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:55:13.7732735Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:55:13.7732831Z %cst_10 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7732888Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:55:13.7732939Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:55:13.7732997Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:55:13.7733073Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:55:13.7733208Z %cst_11 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7733263Z %0 = tt.get_program_id x : i32 2026-02-21T09:55:13.7733322Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:55:13.7733374Z %2 = arith.minsi %1, %c8192_i32 : i32 2026-02-21T09:55:13.7733529Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.7733673Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.7733833Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.7733977Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.7734119Z %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7734254Z %8 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7734376Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7734475Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7734672Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:55:13.7734943Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:55:13.7735127Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.7735214Z %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.7735326Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:55:13.7735414Z %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.7735521Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:55:13.7735619Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.7735691Z %19 = arith.subi %2, %0 : i32 2026-02-21T09:55:13.7735743Z %20 = arith.remsi %19, %c3_i32 : i32 2026-02-21T09:55:13.7735794Z %21 = arith.subi %19, %20 : i32 2026-02-21T09:55:13.7735845Z %22 = arith.addi %0, %21 : i32 2026-02-21T09:55:13.7735919Z scf.for %arg3 = %0 to %22 step %c3_i32 : i32 { 2026-02-21T09:55:13.7735977Z %23 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:55:13.7736027Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:55:13.7736082Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:55:13.7736132Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:55:13.7736187Z %27 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:55:13.7736237Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:55:13.7736286Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:55:13.7736334Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:55:13.7736383Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:55:13.7736500Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.7736599Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.7736716Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.7736816Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.7736865Z %36 = arith.muli %30, %c128_i32 : i32 2026-02-21T09:55:13.7736980Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.7737090Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.7737188Z %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.7737294Z %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.7737477Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.7737555Z %42 = arith.muli %41, %cst_3 : tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.7737685Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7737866Z %44 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:55:13.7737979Z %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7738110Z %46 = ttg.local_alloc : () -> !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.7738287Z %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:55:13.7738394Z %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7738468Z %49 = arith.addi %43, %48 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7738595Z %50 = tt.addptr %9, %49 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7738671Z %51 = tt.load %50 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7738899Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7739073Z ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7739187Z %53 = arith.addi %8, %cst_8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7739357Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:55:13.7739485Z %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7739557Z %56 = arith.addi %43, %55 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7739681Z %57 = tt.addptr %9, %56 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7739755Z %58 = tt.load %57 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7739986Z %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7740157Z ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7740271Z %60 = arith.addi %8, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7740441Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:55:13.7740549Z %62 = tt.broadcast %61 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7740622Z %63 = arith.addi %43, %62 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7740742Z %64 = tt.addptr %9, %63 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7740814Z %65 = tt.load %64 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7741033Z %66 = ttg.memdesc_index %46[%c2_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7741237Z ttg.local_store %65, %66 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7741767Z %67:5 = scf.for %arg4 = %c0_i32 to %c500_i32 step %c4_i32 iter_args(%arg5 = %cst_4, %arg6 = %c2_i32, %arg7 = %52, %arg8 = %59, %arg9 = %66) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>) : i32 { 2026-02-21T09:55:13.7741891Z %360 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7742023Z %361 = arith.addi %360, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7742080Z %362 = arith.addi %arg4, %c12_i32 : i32 2026-02-21T09:55:13.7742138Z %363 = arith.muli %362, %c2_i32 : i32 2026-02-21T09:55:13.7742250Z %364 = tt.splat %363 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7742374Z %365 = arith.addi %364, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7742553Z %366 = tt.expand_dims %365 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:55:13.7742666Z %367 = tt.broadcast %366 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7742743Z %368 = arith.addi %43, %367 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7742874Z %369 = tt.addptr %9, %368 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7742953Z %370 = tt.load %369 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7743202Z %371 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7743452Z %372 = arith.extf %371 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7743623Z %373 = tt.expand_dims %361 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7743700Z %374 = arith.muli %373, %cst_2 : tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7743839Z %375 = tt.broadcast %374 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7743916Z %376 = arith.addi %375, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7744037Z %377 = tt.addptr %10, %376 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7744112Z %378 = tt.load %377 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7744291Z %379 = ttg.convert_layout %378 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7744410Z %380 = arith.shli %379, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7744582Z %381 = arith.shrsi %380, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7744772Z %382 = arith.shrsi %379, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7745017Z %383 = tt.expand_dims %381 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7745198Z %384 = tt.expand_dims %382 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7745314Z %385 = tt.broadcast %383 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7745464Z %386 = arith.select %15, %385, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7745578Z %387 = tt.broadcast %384 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7745698Z %388 = arith.select %17, %387, %386 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7745805Z %389 = tt.reshape %388 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7745921Z %390 = arith.sitofp %389 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7746065Z %391 = ttg.local_alloc %390 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7746288Z %392 = ttg.local_load %391 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7746616Z %393 = tt.dot %372, %392, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7746689Z %394 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.7746747Z %395 = arith.cmpi slt, %394, %c3_i32 : i32 2026-02-21T09:55:13.7746809Z %396 = arith.select %395, %394, %c0_i32 : i32 2026-02-21T09:55:13.7747025Z %397 = ttg.memdesc_index %46[%396] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7747196Z ttg.local_store %370, %397 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7747565Z scf.yield %393, %396, %arg8, %arg9, %397 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7747665Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32} 2026-02-21T09:55:13.7747775Z %68 = arith.addi %7, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7748010Z %69 = ttg.local_load %67#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7748259Z %70 = arith.extf %69 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7748429Z %71 = tt.expand_dims %68 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7748507Z %72 = arith.muli %71, %cst_2 : tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7748612Z %73 = tt.broadcast %72 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7748682Z %74 = arith.addi %73, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7748801Z %75 = tt.addptr %10, %74 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7748871Z %76 = tt.load %75 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7749140Z %77 = ttg.convert_layout %76 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7749277Z %78 = arith.shli %77, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7749394Z %79 = arith.shrsi %78, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7749508Z %80 = arith.shrsi %77, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7749690Z %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7749940Z %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7750114Z %83 = tt.broadcast %81 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7750270Z %84 = arith.select %15, %83, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7750378Z %85 = tt.broadcast %82 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7750491Z %86 = arith.select %17, %85, %84 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7750613Z %87 = tt.reshape %86 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7750720Z %88 = arith.sitofp %87 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7750858Z %89 = ttg.local_alloc %88 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7751061Z %90 = ttg.local_load %89 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7751390Z %91 = tt.dot %70, %90, %67#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7751500Z %92 = arith.addi %7, %cst_6 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7751732Z %93 = ttg.local_load %67#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7751963Z %94 = arith.extf %93 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7752130Z %95 = tt.expand_dims %92 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7752206Z %96 = arith.muli %95, %cst_2 : tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7752310Z %97 = tt.broadcast %96 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7752377Z %98 = arith.addi %97, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7752513Z %99 = tt.addptr %10, %98 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7752585Z %100 = tt.load %99 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7752761Z %101 = ttg.convert_layout %100 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7752883Z %102 = arith.shli %101, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7753001Z %103 = arith.shrsi %102, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7753115Z %104 = arith.shrsi %101, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7753303Z %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7753476Z %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7753591Z %107 = tt.broadcast %105 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7753722Z %108 = arith.select %15, %107, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7753832Z %109 = tt.broadcast %106 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7753948Z %110 = arith.select %17, %109, %108 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7754075Z %111 = tt.reshape %110 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7754186Z %112 = arith.sitofp %111 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7754326Z %113 = ttg.local_alloc %112 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7754531Z %114 = ttg.local_load %113 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7754839Z %115 = tt.dot %94, %114, %91, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7754965Z %116 = arith.addi %7, %cst_7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7755200Z %117 = ttg.local_load %67#4 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7755451Z %118 = arith.extf %117 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7755621Z %119 = tt.expand_dims %116 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7755701Z %120 = arith.muli %119, %cst_2 : tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7755808Z %121 = tt.broadcast %120 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7755881Z %122 = arith.addi %121, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7756004Z %123 = tt.addptr %10, %122 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7756075Z %124 = tt.load %123 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7756251Z %125 = ttg.convert_layout %124 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7756373Z %126 = arith.shli %125, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7756490Z %127 = arith.shrsi %126, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7756623Z %128 = arith.shrsi %125, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7756803Z %129 = tt.expand_dims %127 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7756977Z %130 = tt.expand_dims %128 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7757089Z %131 = tt.broadcast %129 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7757217Z %132 = arith.select %15, %131, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7757327Z %133 = tt.broadcast %130 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7757444Z %134 = arith.select %17, %133, %132 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7757553Z %135 = tt.reshape %134 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7757669Z %136 = arith.sitofp %135 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7757809Z %137 = ttg.local_alloc %136 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7758015Z %138 = ttg.local_load %137 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7758340Z %139 = tt.dot %118, %138, %115, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7758446Z ttg.local_dealloc %46 : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.7758553Z %140 = arith.truncf %139 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.7758718Z %141 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.7758790Z %142 = arith.muli %141, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.7758971Z %143 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.7759069Z %144 = tt.broadcast %142 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7759165Z %145 = tt.broadcast %143 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7759239Z %146 = arith.addi %144, %145 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7759377Z %147 = tt.addptr %18, %146 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7759451Z tt.store %147, %140 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.7759507Z %148 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:55:13.7759561Z %149 = arith.divsi %148, %c256_i32 : i32 2026-02-21T09:55:13.7759611Z %150 = arith.muli %149, %c4_i32 : i32 2026-02-21T09:55:13.7759667Z %151 = arith.subi %c128_i32, %150 : i32 2026-02-21T09:55:13.7759719Z %152 = arith.minsi %151, %c4_i32 : i32 2026-02-21T09:55:13.7759769Z %153 = arith.remsi %148, %c256_i32 : i32 2026-02-21T09:55:13.7759820Z %154 = arith.remsi %153, %152 : i32 2026-02-21T09:55:13.7759867Z %155 = arith.addi %150, %154 : i32 2026-02-21T09:55:13.7759913Z %156 = arith.divsi %153, %152 : i32 2026-02-21T09:55:13.7759965Z %157 = arith.muli %155, %c128_i32 : i32 2026-02-21T09:55:13.7760080Z %158 = tt.splat %157 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.7760181Z %159 = tt.splat %157 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.7760290Z %160 = arith.addi %158, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.7760393Z %161 = arith.addi %159, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.7760459Z %162 = arith.muli %156, %c128_i32 : i32 2026-02-21T09:55:13.7760559Z %163 = tt.splat %162 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.7760672Z %164 = tt.splat %162 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.7760771Z %165 = arith.addi %163, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.7760878Z %166 = arith.addi %164, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.7761059Z %167 = tt.expand_dims %160 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.7761137Z %168 = arith.muli %167, %cst_3 : tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.7761250Z %169 = tt.broadcast %168 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7761427Z %170 = tt.expand_dims %166 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:55:13.7761539Z %171 = tt.broadcast %170 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7761642Z %172 = ttg.local_alloc : () -> !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.7761717Z %173 = arith.addi %169, %48 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7761855Z %174 = tt.addptr %9, %173 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7761930Z %175 = tt.load %174 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7762156Z %176 = ttg.memdesc_index %172[%c0_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7762327Z ttg.local_store %175, %176 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7762398Z %177 = arith.addi %169, %55 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7762519Z %178 = tt.addptr %9, %177 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7762670Z %179 = tt.load %178 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7762885Z %180 = ttg.memdesc_index %172[%c1_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7763052Z ttg.local_store %179, %180 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7763145Z %181 = arith.addi %169, %62 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7763265Z %182 = tt.addptr %9, %181 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7763338Z %183 = tt.load %182 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7763563Z %184 = ttg.memdesc_index %172[%c2_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7763731Z ttg.local_store %183, %184 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7764256Z %185:5 = scf.for %arg4 = %c0_i32 to %c500_i32 step %c4_i32 iter_args(%arg5 = %cst_4, %arg6 = %c2_i32, %arg7 = %176, %arg8 = %180, %arg9 = %184) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>) : i32 { 2026-02-21T09:55:13.7764377Z %360 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7764486Z %361 = arith.addi %360, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7764546Z %362 = arith.addi %arg4, %c12_i32 : i32 2026-02-21T09:55:13.7764619Z %363 = arith.muli %362, %c2_i32 : i32 2026-02-21T09:55:13.7764729Z %364 = tt.splat %363 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7764839Z %365 = arith.addi %364, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7765011Z %366 = tt.expand_dims %365 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:55:13.7765121Z %367 = tt.broadcast %366 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7765200Z %368 = arith.addi %169, %367 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7765322Z %369 = tt.addptr %9, %368 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7765399Z %370 = tt.load %369 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7765640Z %371 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7765879Z %372 = arith.extf %371 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7766048Z %373 = tt.expand_dims %361 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7766145Z %374 = arith.muli %373, %cst_2 : tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7766255Z %375 = tt.broadcast %374 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7766328Z %376 = arith.addi %375, %171 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7766452Z %377 = tt.addptr %10, %376 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7766527Z %378 = tt.load %377 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7766700Z %379 = ttg.convert_layout %378 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7766839Z %380 = arith.shli %379, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7766957Z %381 = arith.shrsi %380, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7767076Z %382 = arith.shrsi %379, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7767272Z %383 = tt.expand_dims %381 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7767447Z %384 = tt.expand_dims %382 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7767562Z %385 = tt.broadcast %383 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7767692Z %386 = arith.select %15, %385, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7767805Z %387 = tt.broadcast %384 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7767924Z %388 = arith.select %17, %387, %386 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7768036Z %389 = tt.reshape %388 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7768147Z %390 = arith.sitofp %389 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7768290Z %391 = ttg.local_alloc %390 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7768492Z %392 = ttg.local_load %391 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7768829Z %393 = tt.dot %372, %392, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7768886Z %394 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.7768945Z %395 = arith.cmpi slt, %394, %c3_i32 : i32 2026-02-21T09:55:13.7769004Z %396 = arith.select %395, %394, %c0_i32 : i32 2026-02-21T09:55:13.7769221Z %397 = ttg.memdesc_index %172[%396] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7769392Z ttg.local_store %370, %397 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7769759Z scf.yield %393, %396, %arg8, %arg9, %397 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7769855Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32} 2026-02-21T09:55:13.7770093Z %186 = ttg.local_load %185#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7770328Z %187 = arith.extf %186 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7770415Z %188 = arith.addi %73, %171 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7770540Z %189 = tt.addptr %10, %188 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7770611Z %190 = tt.load %189 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7770780Z %191 = ttg.convert_layout %190 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7770910Z %192 = arith.shli %191, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7771043Z %193 = arith.shrsi %192, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7771157Z %194 = arith.shrsi %191, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7771340Z %195 = tt.expand_dims %193 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7771529Z %196 = tt.expand_dims %194 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7771643Z %197 = tt.broadcast %195 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7771769Z %198 = arith.select %15, %197, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7771878Z %199 = tt.broadcast %196 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7771996Z %200 = arith.select %17, %199, %198 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7772106Z %201 = tt.reshape %200 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7772215Z %202 = arith.sitofp %201 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7772356Z %203 = ttg.local_alloc %202 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7772561Z %204 = ttg.local_load %203 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7772891Z %205 = tt.dot %187, %204, %185#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7773125Z %206 = ttg.local_load %185#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7773356Z %207 = arith.extf %206 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7773427Z %208 = arith.addi %97, %171 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7773550Z %209 = tt.addptr %10, %208 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7773621Z %210 = tt.load %209 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7773790Z %211 = ttg.convert_layout %210 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7773908Z %212 = arith.shli %211, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7774026Z %213 = arith.shrsi %212, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7774140Z %214 = arith.shrsi %211, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7774317Z %215 = tt.expand_dims %213 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7774507Z %216 = tt.expand_dims %214 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7774620Z %217 = tt.broadcast %215 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7774741Z %218 = arith.select %15, %217, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7774854Z %219 = tt.broadcast %216 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7774969Z %220 = arith.select %17, %219, %218 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7775098Z %221 = tt.reshape %220 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7775208Z %222 = arith.sitofp %221 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7775346Z %223 = ttg.local_alloc %222 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7775562Z %224 = ttg.local_load %223 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7775870Z %225 = tt.dot %207, %224, %205, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7776099Z %226 = ttg.local_load %185#4 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7776333Z %227 = arith.extf %226 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7776407Z %228 = arith.addi %121, %171 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7776528Z %229 = tt.addptr %10, %228 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7776599Z %230 = tt.load %229 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7776771Z %231 = ttg.convert_layout %230 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7776886Z %232 = arith.shli %231, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7777018Z %233 = arith.shrsi %232, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7777134Z %234 = arith.shrsi %231, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7777310Z %235 = tt.expand_dims %233 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7777481Z %236 = tt.expand_dims %234 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7777598Z %237 = tt.broadcast %235 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7777720Z %238 = arith.select %15, %237, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7777829Z %239 = tt.broadcast %236 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7777950Z %240 = arith.select %17, %239, %238 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7778055Z %241 = tt.reshape %240 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7778165Z %242 = arith.sitofp %241 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7778305Z %243 = ttg.local_alloc %242 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7778517Z %244 = ttg.local_load %243 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7778822Z %245 = tt.dot %227, %244, %225, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7778928Z ttg.local_dealloc %172 : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.7779033Z %246 = arith.truncf %245 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.7779196Z %247 = tt.expand_dims %161 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.7779285Z %248 = arith.muli %247, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.7779445Z %249 = tt.expand_dims %165 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.7779543Z %250 = tt.broadcast %248 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7779641Z %251 = tt.broadcast %249 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7779725Z %252 = arith.addi %250, %251 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7779840Z %253 = tt.addptr %18, %252 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7779917Z tt.store %253, %246 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.7779970Z %254 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:55:13.7780022Z %255 = arith.divsi %254, %c256_i32 : i32 2026-02-21T09:55:13.7780075Z %256 = arith.muli %255, %c4_i32 : i32 2026-02-21T09:55:13.7780126Z %257 = arith.subi %c128_i32, %256 : i32 2026-02-21T09:55:13.7780175Z %258 = arith.minsi %257, %c4_i32 : i32 2026-02-21T09:55:13.7780238Z %259 = arith.remsi %254, %c256_i32 : i32 2026-02-21T09:55:13.7780287Z %260 = arith.remsi %259, %258 : i32 2026-02-21T09:55:13.7780334Z %261 = arith.addi %256, %260 : i32 2026-02-21T09:55:13.7780384Z %262 = arith.divsi %259, %258 : i32 2026-02-21T09:55:13.7780435Z %263 = arith.muli %261, %c128_i32 : i32 2026-02-21T09:55:13.7780546Z %264 = tt.splat %263 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.7780649Z %265 = tt.splat %263 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.7780772Z %266 = arith.addi %264, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.7780868Z %267 = arith.addi %265, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.7780923Z %268 = arith.muli %262, %c128_i32 : i32 2026-02-21T09:55:13.7781022Z %269 = tt.splat %268 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.7781128Z %270 = tt.splat %268 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.7781229Z %271 = arith.addi %269, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.7781334Z %272 = arith.addi %270, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.7781509Z %273 = tt.expand_dims %266 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.7781588Z %274 = arith.muli %273, %cst_3 : tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.7781701Z %275 = tt.broadcast %274 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7781873Z %276 = tt.expand_dims %272 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:55:13.7781982Z %277 = tt.broadcast %276 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7782101Z %278 = ttg.local_alloc : () -> !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.7782170Z %279 = arith.addi %275, %48 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7782290Z %280 = tt.addptr %9, %279 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7782368Z %281 = tt.load %280 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7782588Z %282 = ttg.memdesc_index %278[%c0_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7782754Z ttg.local_store %281, %282 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7782842Z %283 = arith.addi %275, %55 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7782960Z %284 = tt.addptr %9, %283 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7783034Z %285 = tt.load %284 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7783251Z %286 = ttg.memdesc_index %278[%c1_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7783432Z ttg.local_store %285, %286 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7783502Z %287 = arith.addi %275, %62 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7783619Z %288 = tt.addptr %9, %287 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7783688Z %289 = tt.load %288 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7783888Z %290 = ttg.memdesc_index %278[%c2_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7784047Z ttg.local_store %289, %290 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7784535Z %291:5 = scf.for %arg4 = %c0_i32 to %c500_i32 step %c4_i32 iter_args(%arg5 = %cst_4, %arg6 = %c2_i32, %arg7 = %282, %arg8 = %286, %arg9 = %290) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>) : i32 { 2026-02-21T09:55:13.7784646Z %360 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7784767Z %361 = arith.addi %360, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7784822Z %362 = arith.addi %arg4, %c12_i32 : i32 2026-02-21T09:55:13.7784870Z %363 = arith.muli %362, %c2_i32 : i32 2026-02-21T09:55:13.7784976Z %364 = tt.splat %363 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7785076Z %365 = arith.addi %364, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7785238Z %366 = tt.expand_dims %365 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:55:13.7785348Z %367 = tt.broadcast %366 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7785418Z %368 = arith.addi %275, %367 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7785534Z %369 = tt.addptr %9, %368 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7785611Z %370 = tt.load %369 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7785835Z %371 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7786058Z %372 = arith.extf %371 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7786237Z %373 = tt.expand_dims %361 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7786308Z %374 = arith.muli %373, %cst_2 : tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7786410Z %375 = tt.broadcast %374 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7786482Z %376 = arith.addi %375, %277 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7786596Z %377 = tt.addptr %10, %376 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7786680Z %378 = tt.load %377 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7786848Z %379 = ttg.convert_layout %378 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7786958Z %380 = arith.shli %379, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7787069Z %381 = arith.shrsi %380, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7787196Z %382 = arith.shrsi %379, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7787363Z %383 = tt.expand_dims %381 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7787528Z %384 = tt.expand_dims %382 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7787639Z %385 = tt.broadcast %383 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7787758Z %386 = arith.select %15, %385, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7787863Z %387 = tt.broadcast %384 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7787979Z %388 = arith.select %17, %387, %386 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7788080Z %389 = tt.reshape %388 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7788184Z %390 = arith.sitofp %389 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7788319Z %391 = ttg.local_alloc %390 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7788525Z %392 = ttg.local_load %391 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7788822Z %393 = tt.dot %372, %392, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7788879Z %394 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.7788933Z %395 = arith.cmpi slt, %394, %c3_i32 : i32 2026-02-21T09:55:13.7788988Z %396 = arith.select %395, %394, %c0_i32 : i32 2026-02-21T09:55:13.7789197Z %397 = ttg.memdesc_index %278[%396] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7789356Z ttg.local_store %370, %397 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7789695Z scf.yield %393, %396, %arg8, %arg9, %397 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7798446Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32} 2026-02-21T09:55:13.7798718Z %292 = ttg.local_load %291#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7798972Z %293 = arith.extf %292 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7799036Z %294 = arith.addi %73, %277 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7799141Z %295 = tt.addptr %10, %294 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7799204Z %296 = tt.load %295 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7799353Z %297 = ttg.convert_layout %296 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7799473Z %298 = arith.shli %297, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7799573Z %299 = arith.shrsi %298, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7799673Z %300 = arith.shrsi %297, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7799844Z %301 = tt.expand_dims %299 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7799993Z %302 = tt.expand_dims %300 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7800093Z %303 = tt.broadcast %301 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7800199Z %304 = arith.select %15, %303, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7800294Z %305 = tt.broadcast %302 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7800396Z %306 = arith.select %17, %305, %304 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7800506Z %307 = tt.reshape %306 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7800598Z %308 = arith.sitofp %307 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7800719Z %309 = ttg.local_alloc %308 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7800894Z %310 = ttg.local_load %309 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7801177Z %311 = tt.dot %293, %310, %291#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7801377Z %312 = ttg.local_load %291#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7801579Z %313 = arith.extf %312 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7801642Z %314 = arith.addi %97, %277 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7801746Z %315 = tt.addptr %10, %314 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7801808Z %316 = tt.load %315 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7801953Z %317 = ttg.convert_layout %316 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7802053Z %318 = arith.shli %317, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7802154Z %319 = arith.shrsi %318, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7802252Z %320 = arith.shrsi %317, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7802429Z %321 = tt.expand_dims %319 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7802658Z %322 = tt.expand_dims %320 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7802757Z %323 = tt.broadcast %321 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7802862Z %324 = arith.select %15, %323, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7802960Z %325 = tt.broadcast %322 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7803083Z %326 = arith.select %17, %325, %324 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7803173Z %327 = tt.reshape %326 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7803270Z %328 = arith.sitofp %327 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7803406Z %329 = ttg.local_alloc %328 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7803573Z %330 = ttg.local_load %329 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7803839Z %331 = tt.dot %313, %330, %311, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7804035Z %332 = ttg.local_load %291#4 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7804229Z %333 = arith.extf %332 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7804295Z %334 = arith.addi %121, %277 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7804395Z %335 = tt.addptr %10, %334 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7804454Z %336 = tt.load %335 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7804601Z %337 = ttg.convert_layout %336 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7804716Z %338 = arith.shli %337, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7804816Z %339 = arith.shrsi %338, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7804915Z %340 = arith.shrsi %337, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7805063Z %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7805212Z %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7805310Z %343 = tt.broadcast %341 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7805412Z %344 = arith.select %15, %343, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7805506Z %345 = tt.broadcast %342 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7805606Z %346 = arith.select %17, %345, %344 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7805696Z %347 = tt.reshape %346 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7805789Z %348 = arith.sitofp %347 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7805924Z %349 = ttg.local_alloc %348 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7806091Z %350 = ttg.local_load %349 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7807549Z %351 = tt.dot %333, %350, %331, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7807650Z ttg.local_dealloc %278 : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.7807757Z %352 = arith.truncf %351 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.7807898Z %353 = tt.expand_dims %267 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.7807958Z %354 = arith.muli %353, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.7808094Z %355 = tt.expand_dims %271 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.7808192Z %356 = tt.broadcast %354 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7808276Z %357 = tt.broadcast %355 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7808337Z %358 = arith.addi %356, %357 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7808434Z %359 = tt.addptr %18, %358 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7808500Z tt.store %359, %352 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.7808542Z } {tt.num_stages = 1 : i32} 2026-02-21T09:55:13.7808594Z scf.for %arg3 = %22 to %2 step %c1_i32 : i32 { 2026-02-21T09:55:13.7808641Z %23 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:55:13.7808684Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:55:13.7808727Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:55:13.7808769Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:55:13.7808816Z %27 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:55:13.7808857Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:55:13.7808897Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:55:13.7808937Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:55:13.7808979Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:55:13.7809086Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.7809169Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.7809263Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.7809342Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.7809383Z %36 = arith.muli %30, %c128_i32 : i32 2026-02-21T09:55:13.7809467Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.7809554Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.7809634Z %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.7809722Z %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.7809870Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.7809934Z %42 = arith.muli %41, %cst_3 : tensor<128x1xi32, #blocked2> 2026-02-21T09:55:13.7810027Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7810170Z %44 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:55:13.7810272Z %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7810360Z %46 = ttg.local_alloc : () -> !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.7810499Z %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:55:13.7810586Z %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7810648Z %49 = arith.addi %43, %48 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7810747Z %50 = tt.addptr %9, %49 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7810822Z %51 = tt.load %50 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7811006Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7811146Z ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7811257Z %53 = arith.addi %8, %cst_8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7811400Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:55:13.7811488Z %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7811545Z %56 = arith.addi %43, %55 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7811647Z %57 = tt.addptr %9, %56 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7811705Z %58 = tt.load %57 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7811884Z %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7812025Z ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7812115Z %60 = arith.addi %8, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7812253Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:55:13.7812356Z %62 = tt.broadcast %61 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7812413Z %63 = arith.addi %43, %62 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7812511Z %64 = tt.addptr %9, %63 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7812568Z %65 = tt.load %64 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7812749Z %66 = ttg.memdesc_index %46[%c2_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7812885Z ttg.local_store %65, %66 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7813318Z %67:5 = scf.for %arg4 = %c0_i32 to %c500_i32 step %c4_i32 iter_args(%arg5 = %cst_4, %arg6 = %c2_i32, %arg7 = %52, %arg8 = %59, %arg9 = %66) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>) : i32 { 2026-02-21T09:55:13.7813419Z %148 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7813511Z %149 = arith.addi %148, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7813560Z %150 = arith.addi %arg4, %c12_i32 : i32 2026-02-21T09:55:13.7813618Z %151 = arith.muli %150, %c2_i32 : i32 2026-02-21T09:55:13.7813709Z %152 = tt.splat %151 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7813798Z %153 = arith.addi %152, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.7813943Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:55:13.7814036Z %155 = tt.broadcast %154 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7814097Z %156 = arith.addi %43, %155 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7814216Z %157 = tt.addptr %9, %156 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:13.7814280Z %158 = tt.load %157 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:13.7814479Z %159 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7814694Z %160 = arith.extf %159 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7814837Z %161 = tt.expand_dims %149 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7814903Z %162 = arith.muli %161, %cst_2 : tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7814998Z %163 = tt.broadcast %162 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7815062Z %164 = arith.addi %163, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7815163Z %165 = tt.addptr %10, %164 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7815226Z %166 = tt.load %165 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7815374Z %167 = ttg.convert_layout %166 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7815473Z %168 = arith.shli %167, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7815574Z %169 = arith.shrsi %168, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7815671Z %170 = arith.shrsi %167, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7815834Z %171 = tt.expand_dims %169 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7815985Z %172 = tt.expand_dims %170 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7816081Z %173 = tt.broadcast %171 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7816188Z %174 = arith.select %15, %173, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7816285Z %175 = tt.broadcast %172 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7816386Z %176 = arith.select %17, %175, %174 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7816477Z %177 = tt.reshape %176 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7816573Z %178 = arith.sitofp %177 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7816690Z %179 = ttg.local_alloc %178 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7816862Z %180 = ttg.local_load %179 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7817132Z %181 = tt.dot %160, %180, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7817193Z %182 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.7817240Z %183 = arith.cmpi slt, %182, %c3_i32 : i32 2026-02-21T09:55:13.7817291Z %184 = arith.select %183, %182, %c0_i32 : i32 2026-02-21T09:55:13.7817471Z %185 = ttg.memdesc_index %46[%184] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7817612Z ttg.local_store %158, %185 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7817934Z scf.yield %181, %184, %arg8, %arg9, %185 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> 2026-02-21T09:55:13.7818017Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32} 2026-02-21T09:55:13.7818123Z %68 = arith.addi %7, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7818319Z %69 = ttg.local_load %67#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7818513Z %70 = arith.extf %69 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7818654Z %71 = tt.expand_dims %68 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7818717Z %72 = arith.muli %71, %cst_2 : tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7818805Z %73 = tt.broadcast %72 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7818863Z %74 = arith.addi %73, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7818962Z %75 = tt.addptr %10, %74 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7819020Z %76 = tt.load %75 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7819174Z %77 = ttg.convert_layout %76 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7819281Z %78 = arith.shli %77, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7819377Z %79 = arith.shrsi %78, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7819474Z %80 = arith.shrsi %77, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7819620Z %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7819767Z %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7819862Z %83 = tt.broadcast %81 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7819961Z %84 = arith.select %15, %83, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7820051Z %85 = tt.broadcast %82 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7820146Z %86 = arith.select %17, %85, %84 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7820232Z %87 = tt.reshape %86 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7820320Z %88 = arith.sitofp %87 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7820435Z %89 = ttg.local_alloc %88 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7820618Z %90 = ttg.local_load %89 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7820878Z %91 = tt.dot %70, %90, %67#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7820971Z %92 = arith.addi %7, %cst_6 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7821165Z %93 = ttg.local_load %67#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7821371Z %94 = arith.extf %93 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7821514Z %95 = tt.expand_dims %92 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7821590Z %96 = arith.muli %95, %cst_2 : tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7821678Z %97 = tt.broadcast %96 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7821735Z %98 = arith.addi %97, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7821834Z %99 = tt.addptr %10, %98 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7821895Z %100 = tt.load %99 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7822043Z %101 = ttg.convert_layout %100 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7822142Z %102 = arith.shli %101, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7822242Z %103 = arith.shrsi %102, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7822339Z %104 = arith.shrsi %101, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7822488Z %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7822634Z %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7822745Z %107 = tt.broadcast %105 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7822850Z %108 = arith.select %15, %107, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7822944Z %109 = tt.broadcast %106 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7823045Z %110 = arith.select %17, %109, %108 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7823135Z %111 = tt.reshape %110 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7823227Z %112 = arith.sitofp %111 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7823345Z %113 = ttg.local_alloc %112 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7823514Z %114 = ttg.local_load %113 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7823769Z %115 = tt.dot %94, %114, %91, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7823864Z %116 = arith.addi %7, %cst_7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.7824071Z %117 = ttg.local_load %67#4 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7824266Z %118 = arith.extf %117 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7824409Z %119 = tt.expand_dims %116 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7824474Z %120 = arith.muli %119, %cst_2 : tensor<4x1xi32, #blocked1> 2026-02-21T09:55:13.7824564Z %121 = tt.broadcast %120 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7824648Z %122 = arith.addi %121, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7824748Z %123 = tt.addptr %10, %122 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:13.7824808Z %124 = tt.load %123 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:13.7824957Z %125 = ttg.convert_layout %124 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7825070Z %126 = arith.shli %125, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7825168Z %127 = arith.shrsi %126, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7825265Z %128 = arith.shrsi %125, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.7825416Z %129 = tt.expand_dims %127 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7825562Z %130 = tt.expand_dims %128 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.7825659Z %131 = tt.broadcast %129 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7825765Z %132 = arith.select %15, %131, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7825858Z %133 = tt.broadcast %130 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7825955Z %134 = arith.select %17, %133, %132 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.7826046Z %135 = tt.reshape %134 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1> 2026-02-21T09:55:13.7826151Z %136 = arith.sitofp %135 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1> 2026-02-21T09:55:13.7826269Z %137 = ttg.local_alloc %136 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.7826438Z %138 = ttg.local_load %137 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.7826699Z %139 = tt.dot %118, %138, %115, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.7826787Z ttg.local_dealloc %46 : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.7826877Z %140 = arith.truncf %139 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.7827015Z %141 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.7827073Z %142 = arith.muli %141, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.7827210Z %143 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.7827294Z %144 = tt.broadcast %142 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7827390Z %145 = tt.broadcast %143 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7827449Z %146 = arith.addi %144, %145 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7827546Z %147 = tt.addptr %18, %146 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.7827608Z tt.store %147, %140 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.7827650Z } {tt.num_stages = 1 : i32} 2026-02-21T09:55:13.7827683Z tt.return 2026-02-21T09:55:13.7827715Z } 2026-02-21T09:55:13.7827747Z } 2026-02-21T09:55:13.7827753Z 2026-02-21T09:55:13.7827783Z {-# 2026-02-21T09:55:13.7827822Z external_resources: { 2026-02-21T09:55:13.7827874Z mlir_reproducer: { 2026-02-21T09:55:13.7828820Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:55:13.7828863Z disable_threading: false, 2026-02-21T09:55:13.7828901Z verify_each: true 2026-02-21T09:55:13.7828932Z } 2026-02-21T09:55:13.7828961Z } 2026-02-21T09:55:13.7828992Z #-} 2026-02-21T09:55:13.7829232Z /tmp/torchinductor_root/da/cdarxfmep5fjd6wacwiqv2hhflmegucy3f3xcz54fqqe7jhe47c5.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:55:13.7829648Z /tmp/torchinductor_root/da/cdarxfmep5fjd6wacwiqv2hhflmegucy3f3xcz54fqqe7jhe47c5.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:55:13.7829760Z [644s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:55:13.7830412Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[1, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:55:13.7830468Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:55:13.7830551Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:55:13.9087284Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:55:13.9098857Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T09:55:13.9099224Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:55:13.9099525Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [2, 32], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:55:13.9099815Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:55:13.9100092Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:55:13.9100340Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:55:13.9100572Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:55:13.9100820Z #smem = #ttg.shared_memory 2026-02-21T09:55:13.9101050Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:55:13.9101524Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:55:13.9101902Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.9102077Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.9102273Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.9102454Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9102621Z %c127_i32 = arith.constant 127 : i32 2026-02-21T09:55:13.9102739Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:55:13.9102857Z %c504_i32 = arith.constant 504 : i32 2026-02-21T09:55:13.9103035Z %cst_3 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.9103297Z %cst_4 = arith.constant dense<4> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9103506Z %cst_5 = arith.constant dense<8192> : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9103677Z %cst_6 = arith.constant dense<0> : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9103847Z %cst_7 = arith.constant dense<512> : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9104019Z %cst_8 = arith.constant dense<0> : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9104196Z %cst_9 = arith.constant dense<8192> : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9104372Z %cst_10 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.9104519Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:55:13.9104628Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:55:13.9104741Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:55:13.9104855Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:55:13.9104970Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:55:13.9105109Z %cst_11 = arith.constant dense<0> : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9105278Z %cst_12 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9105422Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:55:13.9105531Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:55:13.9105666Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:55:13.9105801Z %cst_13 = arith.constant dense<4> : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9106009Z %cst_14 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9106193Z %0 = tt.get_program_id x : i32 2026-02-21T09:55:13.9106304Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:55:13.9106418Z %2 = arith.minsi %1, %c8192_i32 : i32 2026-02-21T09:55:13.9106621Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.9106899Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.9107166Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.9107432Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.9107691Z %7 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.9107928Z %8 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:13.9108128Z %9 = tt.splat %arg1 : !tt.ptr -> tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.9108355Z %10 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9108684Z %11 = arith.extsi %10 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9109044Z %12 = arith.extsi %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.9109395Z %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:55:13.9109798Z %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:55:13.9110210Z %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.9110456Z %16 = arith.cmpi eq, %15, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.9110648Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:55:13.9110856Z %18 = arith.cmpi eq, %15, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:13.9111040Z %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:55:13.9111244Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.9111398Z %21 = arith.subi %2, %0 : i32 2026-02-21T09:55:13.9111508Z %22 = arith.remsi %21, %c3_i32 : i32 2026-02-21T09:55:13.9111619Z %23 = arith.subi %21, %22 : i32 2026-02-21T09:55:13.9111729Z %24 = arith.addi %0, %23 : i32 2026-02-21T09:55:13.9111848Z scf.for %arg3 = %0 to %24 step %c3_i32 : i32 { 2026-02-21T09:55:13.9111984Z %147 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:55:13.9112107Z %148 = arith.muli %147, %c4_i32 : i32 2026-02-21T09:55:13.9112224Z %149 = arith.subi %c128_i32, %148 : i32 2026-02-21T09:55:13.9112342Z %150 = arith.minsi %149, %c4_i32 : i32 2026-02-21T09:55:13.9112459Z %151 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:55:13.9112577Z %152 = arith.remsi %151, %150 : i32 2026-02-21T09:55:13.9112686Z %153 = arith.addi %148, %152 : i32 2026-02-21T09:55:13.9112799Z %154 = arith.divsi %151, %150 : i32 2026-02-21T09:55:13.9112912Z %155 = arith.muli %153, %c128_i32 : i32 2026-02-21T09:55:13.9113103Z %156 = tt.splat %155 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.9113322Z %157 = tt.splat %155 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.9113540Z %158 = arith.addi %156, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.9113753Z %159 = arith.addi %157, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.9113918Z %160 = arith.muli %154, %c128_i32 : i32 2026-02-21T09:55:13.9114078Z %161 = tt.splat %160 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.9114286Z %162 = arith.addi %161, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.9114557Z %163 = tt.expand_dims %158 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.9114811Z %164 = arith.muli %163, %cst_10 : tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.9115008Z %165 = tt.broadcast %164 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9115183Z %166 = arith.extsi %160 : i32 to i64 2026-02-21T09:55:13.9115351Z %167 = tt.splat %166 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.9115572Z %168 = arith.addi %167, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.9115850Z %169 = tt.expand_dims %168 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9116154Z %170 = tt.broadcast %169 : tensor<1x128xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9116358Z %171 = arith.cmpi sge, %169, %cst_8 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9116533Z %172 = arith.cmpi slt, %169, %cst_9 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9116698Z %173 = arith.andi %171, %172 : tensor<1x128xi1, #blocked2> 2026-02-21T09:55:13.9116886Z %174 = tt.broadcast %173 : tensor<1x128xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9117104Z %175 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.9117387Z %176 = tt.expand_dims %7 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:55:13.9117656Z %177 = tt.broadcast %176 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9117848Z %178 = arith.addi %165, %177 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9118063Z %179 = tt.addptr %8, %178 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9118268Z %180 = tt.load %179 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:13.9118508Z %181 = tt.expand_dims %11 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9118751Z %182 = arith.muli %181, %cst_5 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9118940Z %183 = tt.broadcast %182 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9119138Z %184 = arith.addi %183, %170 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9119334Z %185 = tt.addptr %9, %184 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9119544Z %186 = arith.cmpi sge, %181, %cst_6 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9119722Z %187 = arith.cmpi slt, %181, %cst_7 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9119882Z %188 = arith.andi %186, %187 : tensor<4x1xi1, #blocked2> 2026-02-21T09:55:13.9120066Z %189 = tt.broadcast %188 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9120252Z %190 = arith.andi %189, %174 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9120488Z %191 = tt.load %185, %190, %cst_11 {amd.pipeliner_part = "prologue"} : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.9120834Z %192 = ttg.memdesc_index %175[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9121196Z ttg.local_store %180, %192 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9121472Z %193 = arith.addi %7, %cst_3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.9121747Z %194 = tt.expand_dims %193 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:55:13.9122021Z %195 = tt.broadcast %194 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9122214Z %196 = arith.addi %165, %195 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9122411Z %197 = tt.addptr %8, %196 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9122719Z %198 = tt.load %197 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:13.9122909Z %199 = arith.addi %11, %cst_4 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9123183Z %200 = tt.expand_dims %199 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9123425Z %201 = arith.muli %200, %cst_5 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9123636Z %202 = tt.broadcast %201 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9123826Z %203 = arith.addi %202, %170 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9124020Z %204 = tt.addptr %9, %203 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9124225Z %205 = arith.cmpi sge, %200, %cst_6 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9124394Z %206 = arith.cmpi slt, %200, %cst_7 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9124553Z %207 = arith.andi %205, %206 : tensor<4x1xi1, #blocked2> 2026-02-21T09:55:13.9124734Z %208 = tt.broadcast %207 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9124938Z %209 = arith.andi %208, %174 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9125150Z %210 = tt.load %204, %209, %cst_11 {amd.pipeliner_part = "prologue"} : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.9125491Z %211 = ttg.memdesc_index %175[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9125865Z ttg.local_store %198, %211 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9126503Z %212:6 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %192, %arg8 = %211, %arg9 = %191, %arg10 = %210) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi8, #blocked2>, tensor<4x128xi8, #blocked2>) : i32 { 2026-02-21T09:55:13.9127038Z %439 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:55:13.9127162Z %440 = arith.muli %439, %c2_i32 : i32 2026-02-21T09:55:13.9127332Z %441 = tt.splat %440 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.9127558Z %442 = arith.addi %441, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.9127832Z %443 = tt.expand_dims %442 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:55:13.9128109Z %444 = tt.broadcast %443 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9128305Z %445 = arith.addi %165, %444 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9128526Z %446 = tt.addptr %8, %445 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9128729Z %447 = tt.load %446 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:13.9129031Z %448 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9129465Z %449 = arith.extf %448 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9129748Z %450 = arith.extsi %439 : i32 to i64 2026-02-21T09:55:13.9129918Z %451 = tt.splat %450 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9130140Z %452 = arith.addi %451, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9130414Z %453 = tt.expand_dims %452 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9130660Z %454 = arith.muli %453, %cst_5 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9130852Z %455 = tt.broadcast %454 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9131045Z %456 = arith.addi %455, %170 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9131243Z %457 = tt.addptr %9, %456 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9131464Z %458 = arith.cmpi sge, %453, %cst_6 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9131633Z %459 = arith.cmpi slt, %453, %cst_7 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9131795Z %460 = arith.andi %458, %459 : tensor<4x1xi1, #blocked2> 2026-02-21T09:55:13.9131980Z %461 = tt.broadcast %460 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9132170Z %462 = arith.andi %461, %174 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9132342Z %463 = tt.load %457, %462, %cst_11 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.9132538Z %464 = arith.shli %arg9, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9132704Z %465 = arith.shrsi %464, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9132950Z %466 = ttg.convert_layout %465 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9133203Z %467 = arith.shrsi %arg9, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9133468Z %468 = ttg.convert_layout %467 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9133799Z %469 = tt.expand_dims %466 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9134136Z %470 = tt.expand_dims %468 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9134417Z %471 = tt.broadcast %469 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9134661Z %472 = arith.select %17, %471, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9134902Z %473 = tt.broadcast %470 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9135137Z %474 = arith.select %19, %473, %472 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9135367Z %475 = tt.reshape %474 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:55:13.9135595Z %476 = arith.sitofp %475 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:55:13.9135846Z %477 = ttg.local_alloc %476 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.9136187Z %478 = ttg.local_load %477 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9136668Z %479 = tt.dot %449, %478, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9137019Z %480 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.9137148Z %481 = arith.cmpi slt, %480, %c2_i32 : i32 2026-02-21T09:55:13.9137280Z %482 = arith.select %481, %480, %c0_i32 : i32 2026-02-21T09:55:13.9137553Z %483 = ttg.memdesc_index %175[%482] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9137917Z ttg.local_store %447, %483 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9138403Z scf.yield %479, %482, %arg8, %483, %arg10, %463 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi8, #blocked2>, tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9138831Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:55:13.9139144Z %213 = ttg.local_load %212#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9139593Z %214 = arith.extf %213 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9139893Z %215 = arith.shli %212#4, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9140060Z %216 = arith.shrsi %215, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9140308Z %217 = ttg.convert_layout %216 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9140554Z %218 = arith.shrsi %212#4, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9140816Z %219 = ttg.convert_layout %218 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9141147Z %220 = tt.expand_dims %217 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9141484Z %221 = tt.expand_dims %219 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9141780Z %222 = tt.broadcast %220 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9142020Z %223 = arith.select %17, %222, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9142261Z %224 = tt.broadcast %221 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9142491Z %225 = arith.select %19, %224, %223 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9142722Z %226 = tt.reshape %225 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:55:13.9142946Z %227 = arith.sitofp %226 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:55:13.9143198Z %228 = ttg.local_alloc %227 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.9143523Z %229 = ttg.local_load %228 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9144015Z %230 = tt.dot %214, %229, %212#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9144536Z %231 = ttg.local_load %212#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9144969Z %232 = arith.extf %231 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9145271Z %233 = arith.shli %212#5, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9145439Z %234 = arith.shrsi %233, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9145684Z %235 = ttg.convert_layout %234 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9145930Z %236 = arith.shrsi %212#5, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9146175Z %237 = ttg.convert_layout %236 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9146505Z %238 = tt.expand_dims %235 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9146841Z %239 = tt.expand_dims %237 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9147123Z %240 = tt.broadcast %238 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9147373Z %241 = arith.select %17, %240, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9147610Z %242 = tt.broadcast %239 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9147843Z %243 = arith.select %19, %242, %241 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9148074Z %244 = tt.reshape %243 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:55:13.9148298Z %245 = arith.sitofp %244 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:55:13.9148548Z %246 = ttg.local_alloc %245 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.9148889Z %247 = ttg.local_load %246 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9149363Z %248 = tt.dot %232, %247, %230, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9149763Z ttg.local_dealloc %175 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.9149979Z %249 = arith.truncf %248 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.9150251Z %250 = tt.expand_dims %159 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.9150491Z %251 = arith.muli %250, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.9150721Z %252 = tt.expand_dims %162 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.9150983Z %253 = tt.broadcast %251 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9151187Z %254 = tt.broadcast %252 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9151369Z %255 = arith.addi %253, %254 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9151561Z %256 = tt.addptr %20, %255 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9151762Z tt.store %256, %249 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.9151906Z %257 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:55:13.9152028Z %258 = arith.divsi %257, %c256_i32 : i32 2026-02-21T09:55:13.9152147Z %259 = arith.muli %258, %c4_i32 : i32 2026-02-21T09:55:13.9152285Z %260 = arith.subi %c128_i32, %259 : i32 2026-02-21T09:55:13.9152403Z %261 = arith.minsi %260, %c4_i32 : i32 2026-02-21T09:55:13.9152521Z %262 = arith.remsi %257, %c256_i32 : i32 2026-02-21T09:55:13.9152635Z %263 = arith.remsi %262, %261 : i32 2026-02-21T09:55:13.9152749Z %264 = arith.addi %259, %263 : i32 2026-02-21T09:55:13.9152862Z %265 = arith.divsi %262, %261 : i32 2026-02-21T09:55:13.9152977Z %266 = arith.muli %264, %c128_i32 : i32 2026-02-21T09:55:13.9153149Z %267 = tt.splat %266 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.9153370Z %268 = tt.splat %266 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.9153591Z %269 = arith.addi %267, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.9153803Z %270 = arith.addi %268, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.9153973Z %271 = arith.muli %265, %c128_i32 : i32 2026-02-21T09:55:13.9154134Z %272 = tt.splat %271 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.9154340Z %273 = arith.addi %272, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.9154614Z %274 = tt.expand_dims %269 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.9154885Z %275 = arith.muli %274, %cst_10 : tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.9155081Z %276 = tt.broadcast %275 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9155259Z %277 = arith.extsi %271 : i32 to i64 2026-02-21T09:55:13.9155427Z %278 = tt.splat %277 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.9155654Z %279 = arith.addi %278, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.9155936Z %280 = tt.expand_dims %279 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9156243Z %281 = tt.broadcast %280 : tensor<1x128xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9156448Z %282 = arith.cmpi sge, %280, %cst_8 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9156625Z %283 = arith.cmpi slt, %280, %cst_9 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9156793Z %284 = arith.andi %282, %283 : tensor<1x128xi1, #blocked2> 2026-02-21T09:55:13.9156984Z %285 = tt.broadcast %284 : tensor<1x128xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9157222Z %286 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.9157410Z %287 = arith.addi %276, %177 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9157613Z %288 = tt.addptr %8, %287 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9157825Z %289 = tt.load %288 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:13.9157987Z %290 = arith.addi %183, %281 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9158187Z %291 = tt.addptr %9, %290 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9158387Z %292 = arith.andi %189, %285 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9158607Z %293 = tt.load %291, %292, %cst_11 {amd.pipeliner_part = "prologue"} : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.9158959Z %294 = ttg.memdesc_index %286[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9159323Z ttg.local_store %289, %294 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9159568Z %295 = arith.addi %276, %195 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9159784Z %296 = tt.addptr %8, %295 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9159990Z %297 = tt.load %296 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:13.9160155Z %298 = arith.addi %202, %281 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9160348Z %299 = tt.addptr %9, %298 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9160550Z %300 = arith.andi %208, %285 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9160763Z %301 = tt.load %299, %300, %cst_11 {amd.pipeliner_part = "prologue"} : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.9161111Z %302 = ttg.memdesc_index %286[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9161475Z ttg.local_store %297, %302 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9162108Z %303:6 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %294, %arg8 = %302, %arg9 = %293, %arg10 = %301) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi8, #blocked2>, tensor<4x128xi8, #blocked2>) : i32 { 2026-02-21T09:55:13.9162697Z %439 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:55:13.9162825Z %440 = arith.muli %439, %c2_i32 : i32 2026-02-21T09:55:13.9162997Z %441 = tt.splat %440 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.9163224Z %442 = arith.addi %441, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.9163502Z %443 = tt.expand_dims %442 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:55:13.9163785Z %444 = tt.broadcast %443 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9163988Z %445 = arith.addi %276, %444 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9164210Z %446 = tt.addptr %8, %445 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9164418Z %447 = tt.load %446 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:13.9164720Z %448 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9165183Z %449 = arith.extf %448 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9165466Z %450 = arith.extsi %439 : i32 to i64 2026-02-21T09:55:13.9165639Z %451 = tt.splat %450 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9165865Z %452 = arith.addi %451, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9166144Z %453 = tt.expand_dims %452 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9166396Z %454 = arith.muli %453, %cst_5 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9166596Z %455 = tt.broadcast %454 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9166790Z %456 = arith.addi %455, %281 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9166995Z %457 = tt.addptr %9, %456 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9167202Z %458 = arith.cmpi sge, %453, %cst_6 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9167374Z %459 = arith.cmpi slt, %453, %cst_7 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9167535Z %460 = arith.andi %458, %459 : tensor<4x1xi1, #blocked2> 2026-02-21T09:55:13.9167746Z %461 = tt.broadcast %460 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9167946Z %462 = arith.andi %461, %285 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9168117Z %463 = tt.load %457, %462, %cst_11 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.9168298Z %464 = arith.shli %arg9, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9168466Z %465 = arith.shrsi %464, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9168717Z %466 = ttg.convert_layout %465 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9168973Z %467 = arith.shrsi %arg9, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9169222Z %468 = ttg.convert_layout %467 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9169561Z %469 = tt.expand_dims %466 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9169901Z %470 = tt.expand_dims %468 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9170189Z %471 = tt.broadcast %469 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9170439Z %472 = arith.select %17, %471, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9170697Z %473 = tt.broadcast %470 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9170938Z %474 = arith.select %19, %473, %472 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9171173Z %475 = tt.reshape %474 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:55:13.9171402Z %476 = arith.sitofp %475 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:55:13.9171659Z %477 = ttg.local_alloc %476 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.9172004Z %478 = ttg.local_load %477 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9172485Z %479 = tt.dot %449, %478, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9172838Z %480 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.9172999Z %481 = arith.cmpi slt, %480, %c2_i32 : i32 2026-02-21T09:55:13.9173137Z %482 = arith.select %481, %480, %c0_i32 : i32 2026-02-21T09:55:13.9173408Z %483 = ttg.memdesc_index %286[%482] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9173773Z ttg.local_store %447, %483 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9174269Z scf.yield %479, %482, %arg8, %483, %arg10, %463 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi8, #blocked2>, tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9174695Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:55:13.9175014Z %304 = ttg.local_load %303#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9175451Z %305 = arith.extf %304 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9175781Z %306 = arith.shli %303#4, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9175957Z %307 = arith.shrsi %306, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9176206Z %308 = ttg.convert_layout %307 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9176458Z %309 = arith.shrsi %303#4, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9176706Z %310 = ttg.convert_layout %309 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9177041Z %311 = tt.expand_dims %308 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9177382Z %312 = tt.expand_dims %310 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9177669Z %313 = tt.broadcast %311 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9177914Z %314 = arith.select %17, %313, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9178161Z %315 = tt.broadcast %312 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9178401Z %316 = arith.select %19, %315, %314 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9178636Z %317 = tt.reshape %316 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:55:13.9178879Z %318 = arith.sitofp %317 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:55:13.9179139Z %319 = ttg.local_alloc %318 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.9179471Z %320 = ttg.local_load %319 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9179951Z %321 = tt.dot %305, %320, %303#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9180465Z %322 = ttg.local_load %303#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9180899Z %323 = arith.extf %322 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9181202Z %324 = arith.shli %303#5, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9181387Z %325 = arith.shrsi %324, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9181633Z %326 = ttg.convert_layout %325 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9181883Z %327 = arith.shrsi %303#5, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9182131Z %328 = ttg.convert_layout %327 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9182466Z %329 = tt.expand_dims %326 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9182805Z %330 = tt.expand_dims %328 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9183090Z %331 = tt.broadcast %329 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9183334Z %332 = arith.select %17, %331, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9183576Z %333 = tt.broadcast %330 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9183833Z %334 = arith.select %19, %333, %332 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9184065Z %335 = tt.reshape %334 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:55:13.9184292Z %336 = arith.sitofp %335 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:55:13.9184545Z %337 = ttg.local_alloc %336 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.9184872Z %338 = ttg.local_load %337 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9185339Z %339 = tt.dot %323, %338, %321, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9185730Z ttg.local_dealloc %286 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.9185952Z %340 = arith.truncf %339 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.9186224Z %341 = tt.expand_dims %270 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.9186470Z %342 = arith.muli %341, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.9186701Z %343 = tt.expand_dims %273 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.9186981Z %344 = tt.broadcast %342 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9187187Z %345 = tt.broadcast %343 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9187368Z %346 = arith.addi %344, %345 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9187563Z %347 = tt.addptr %20, %346 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9187762Z tt.store %347, %340 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.9187909Z %348 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:55:13.9188030Z %349 = arith.divsi %348, %c256_i32 : i32 2026-02-21T09:55:13.9188171Z %350 = arith.muli %349, %c4_i32 : i32 2026-02-21T09:55:13.9188292Z %351 = arith.subi %c128_i32, %350 : i32 2026-02-21T09:55:13.9188412Z %352 = arith.minsi %351, %c4_i32 : i32 2026-02-21T09:55:13.9188529Z %353 = arith.remsi %348, %c256_i32 : i32 2026-02-21T09:55:13.9188646Z %354 = arith.remsi %353, %352 : i32 2026-02-21T09:55:13.9188760Z %355 = arith.addi %350, %354 : i32 2026-02-21T09:55:13.9188873Z %356 = arith.divsi %353, %352 : i32 2026-02-21T09:55:13.9189010Z %357 = arith.muli %355, %c128_i32 : i32 2026-02-21T09:55:13.9189186Z %358 = tt.splat %357 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.9189405Z %359 = tt.splat %357 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.9189629Z %360 = arith.addi %358, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.9189844Z %361 = arith.addi %359, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.9190016Z %362 = arith.muli %356, %c128_i32 : i32 2026-02-21T09:55:13.9190178Z %363 = tt.splat %362 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.9190388Z %364 = arith.addi %363, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.9190664Z %365 = tt.expand_dims %360 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.9190917Z %366 = arith.muli %365, %cst_10 : tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.9191115Z %367 = tt.broadcast %366 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9191288Z %368 = arith.extsi %362 : i32 to i64 2026-02-21T09:55:13.9191476Z %369 = tt.splat %368 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.9191701Z %370 = arith.addi %369, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.9191983Z %371 = tt.expand_dims %370 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9192268Z %372 = tt.broadcast %371 : tensor<1x128xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9192473Z %373 = arith.cmpi sge, %371, %cst_8 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9192649Z %374 = arith.cmpi slt, %371, %cst_9 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9192818Z %375 = arith.andi %373, %374 : tensor<1x128xi1, #blocked2> 2026-02-21T09:55:13.9193009Z %376 = tt.broadcast %375 : tensor<1x128xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9193231Z %377 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.9193420Z %378 = arith.addi %367, %177 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9193620Z %379 = tt.addptr %8, %378 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9193826Z %380 = tt.load %379 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:13.9193986Z %381 = arith.addi %183, %372 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9194198Z %382 = tt.addptr %9, %381 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9194397Z %383 = arith.andi %189, %376 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9194613Z %384 = tt.load %382, %383, %cst_11 {amd.pipeliner_part = "prologue"} : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.9194957Z %385 = ttg.memdesc_index %377[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9195324Z ttg.local_store %380, %385 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9195584Z %386 = arith.addi %367, %195 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9195780Z %387 = tt.addptr %8, %386 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9195983Z %388 = tt.load %387 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:13.9196143Z %389 = arith.addi %202, %372 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9196337Z %390 = tt.addptr %9, %389 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9196547Z %391 = arith.andi %208, %376 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9196759Z %392 = tt.load %390, %391, %cst_11 {amd.pipeliner_part = "prologue"} : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.9197100Z %393 = ttg.memdesc_index %377[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9197458Z ttg.local_store %388, %393 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9198093Z %394:6 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %385, %arg8 = %393, %arg9 = %384, %arg10 = %392) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi8, #blocked2>, tensor<4x128xi8, #blocked2>) : i32 { 2026-02-21T09:55:13.9198624Z %439 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:55:13.9198751Z %440 = arith.muli %439, %c2_i32 : i32 2026-02-21T09:55:13.9198954Z %441 = tt.splat %440 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.9199182Z %442 = arith.addi %441, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.9199477Z %443 = tt.expand_dims %442 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:55:13.9199757Z %444 = tt.broadcast %443 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9199950Z %445 = arith.addi %367, %444 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9200155Z %446 = tt.addptr %8, %445 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9200361Z %447 = tt.load %446 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:13.9200661Z %448 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9201100Z %449 = arith.extf %448 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9201380Z %450 = arith.extsi %439 : i32 to i64 2026-02-21T09:55:13.9201552Z %451 = tt.splat %450 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9201777Z %452 = arith.addi %451, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9202050Z %453 = tt.expand_dims %452 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9202317Z %454 = arith.muli %453, %cst_5 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9202510Z %455 = tt.broadcast %454 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9202751Z %456 = arith.addi %455, %372 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9202951Z %457 = tt.addptr %9, %456 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9203157Z %458 = arith.cmpi sge, %453, %cst_6 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9203327Z %459 = arith.cmpi slt, %453, %cst_7 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9203517Z %460 = arith.andi %458, %459 : tensor<4x1xi1, #blocked2> 2026-02-21T09:55:13.9203705Z %461 = tt.broadcast %460 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9203898Z %462 = arith.andi %461, %376 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9204068Z %463 = tt.load %457, %462, %cst_11 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.9204248Z %464 = arith.shli %arg9, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9204431Z %465 = arith.shrsi %464, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9204682Z %466 = ttg.convert_layout %465 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9204933Z %467 = arith.shrsi %arg9, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9205180Z %468 = ttg.convert_layout %467 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9205520Z %469 = tt.expand_dims %466 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9205855Z %470 = tt.expand_dims %468 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9206141Z %471 = tt.broadcast %469 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9206386Z %472 = arith.select %17, %471, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9206628Z %473 = tt.broadcast %470 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9206865Z %474 = arith.select %19, %473, %472 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9207115Z %475 = tt.reshape %474 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:55:13.9207341Z %476 = arith.sitofp %475 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:55:13.9207594Z %477 = ttg.local_alloc %476 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.9207921Z %478 = ttg.local_load %477 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9208401Z %479 = tt.dot %449, %478, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9208748Z %480 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:55:13.9208875Z %481 = arith.cmpi slt, %480, %c2_i32 : i32 2026-02-21T09:55:13.9209013Z %482 = arith.select %481, %480, %c0_i32 : i32 2026-02-21T09:55:13.9209282Z %483 = ttg.memdesc_index %377[%482] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9209643Z ttg.local_store %447, %483 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9210129Z scf.yield %479, %482, %arg8, %483, %arg10, %463 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi8, #blocked2>, tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9210570Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:55:13.9210886Z %395 = ttg.local_load %394#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9211321Z %396 = arith.extf %395 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9211645Z %397 = arith.shli %394#4, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9211815Z %398 = arith.shrsi %397, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9212059Z %399 = ttg.convert_layout %398 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9212308Z %400 = arith.shrsi %394#4, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9212563Z %401 = ttg.convert_layout %400 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9212893Z %402 = tt.expand_dims %399 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9213230Z %403 = tt.expand_dims %401 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9213514Z %404 = tt.broadcast %402 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9213758Z %405 = arith.select %17, %404, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9213995Z %406 = tt.broadcast %403 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9214228Z %407 = arith.select %19, %406, %405 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9214462Z %408 = tt.reshape %407 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:55:13.9214684Z %409 = arith.sitofp %408 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:55:13.9214935Z %410 = ttg.local_alloc %409 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.9215279Z %411 = ttg.local_load %410 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9215752Z %412 = tt.dot %396, %411, %394#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9216251Z %413 = ttg.local_load %394#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9216685Z %414 = arith.extf %413 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9216994Z %415 = arith.shli %394#5, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9217166Z %416 = arith.shrsi %415, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9217410Z %417 = ttg.convert_layout %416 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9217659Z %418 = arith.shrsi %394#5, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9217901Z %419 = ttg.convert_layout %418 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9218250Z %420 = tt.expand_dims %417 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9218587Z %421 = tt.expand_dims %419 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9218866Z %422 = tt.broadcast %420 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9219108Z %423 = arith.select %17, %422, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9219348Z %424 = tt.broadcast %421 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9219597Z %425 = arith.select %19, %424, %423 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9219827Z %426 = tt.reshape %425 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:55:13.9220049Z %427 = arith.sitofp %426 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:55:13.9220301Z %428 = ttg.local_alloc %427 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.9220638Z %429 = ttg.local_load %428 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9221107Z %430 = tt.dot %414, %429, %412, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9221195Z ttg.local_dealloc %377 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.9221287Z %431 = arith.truncf %430 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.9221425Z %432 = tt.expand_dims %361 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.9221486Z %433 = arith.muli %432, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.9221622Z %434 = tt.expand_dims %364 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.9221705Z %435 = tt.broadcast %433 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9221789Z %436 = tt.broadcast %434 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9221849Z %437 = arith.addi %435, %436 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9221961Z %438 = tt.addptr %20, %437 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9222026Z tt.store %438, %431 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.9222075Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:55:13.9222116Z %25 = arith.subi %2, %24 : i32 2026-02-21T09:55:13.9222158Z %26 = arith.muli %25, %c128_i32 : i32 2026-02-21T09:55:13.9222246Z %27 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.9222291Z %28 = arith.cmpi sgt, %26, %c0_i32 : i32 2026-02-21T09:55:13.9222332Z %29 = arith.divsi %24, %c256_i32 : i32 2026-02-21T09:55:13.9222373Z %30 = arith.muli %29, %c4_i32 : i32 2026-02-21T09:55:13.9222413Z %31 = arith.subi %c128_i32, %30 : i32 2026-02-21T09:55:13.9222453Z %32 = arith.minsi %31, %c4_i32 : i32 2026-02-21T09:55:13.9222493Z %33 = arith.remsi %24, %c256_i32 : i32 2026-02-21T09:55:13.9222535Z %34 = arith.remsi %33, %32 : i32 2026-02-21T09:55:13.9222574Z %35 = arith.addi %30, %34 : i32 2026-02-21T09:55:13.9222611Z %36 = arith.divsi %33, %32 : i32 2026-02-21T09:55:13.9222655Z %37 = arith.muli %35, %c128_i32 : i32 2026-02-21T09:55:13.9222745Z %38 = tt.splat %37 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.9222835Z %39 = arith.addi %38, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.9222891Z %40 = arith.muli %36, %c128_i32 : i32 2026-02-21T09:55:13.9223036Z %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.9223100Z %42 = arith.muli %41, %cst_10 : tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.9223192Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9223233Z %44 = arith.extsi %40 : i32 to i64 2026-02-21T09:55:13.9223322Z %45 = tt.splat %44 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.9223407Z %46 = arith.addi %45, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.9223571Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9223658Z %48 = tt.broadcast %47 : tensor<1x128xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9223727Z %49 = arith.cmpi sge, %47, %cst_8 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9223796Z %50 = arith.cmpi slt, %47, %cst_9 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9223867Z %51 = arith.andi %49, %50 : tensor<1x128xi1, #blocked2> 2026-02-21T09:55:13.9223953Z %52 = tt.broadcast %51 : tensor<1x128xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9224095Z %53 = tt.expand_dims %7 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:55:13.9224180Z %54 = tt.broadcast %53 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9224239Z %55 = arith.addi %43, %54 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9224340Z %56 = tt.addptr %8, %55 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9224398Z %57 = tt.splat %28 : i1 -> tensor<128x8xi1, #blocked1> 2026-02-21T09:55:13.9224503Z %58 = tt.load %56, %57 {amd.pipeliner_part = "prologue"} : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:13.9224694Z %59 = ttg.memdesc_index %27[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9224833Z ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9224878Z %60 = arith.cmpi sgt, %26, %c1_i32 : i32 2026-02-21T09:55:13.9224987Z %61 = arith.addi %7, %cst_3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.9225123Z %62 = tt.expand_dims %61 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:55:13.9225208Z %63 = tt.broadcast %62 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9225268Z %64 = arith.addi %43, %63 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9225368Z %65 = tt.addptr %8, %64 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9225424Z %66 = tt.splat %60 : i1 -> tensor<128x8xi1, #blocked1> 2026-02-21T09:55:13.9225527Z %67 = tt.load %65, %66 {amd.pipeliner_part = "prologue"} : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:13.9225709Z %68 = ttg.memdesc_index %27[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9225846Z ttg.local_store %67, %68 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9225887Z %69 = arith.subi %26, %c2_i32 : i32 2026-02-21T09:55:13.9226680Z %70:18 = scf.for %arg3 = %c0_i32 to %69 step %c1_i32 iter_args(%arg4 = %c1_i32, %arg5 = %24, %arg6 = %cst_2, %arg7 = %37, %arg8 = %40, %arg9 = %43, %arg10 = %48, %arg11 = %52, %arg12 = %c1_i32, %arg13 = %c0_i32, %arg14 = %c4_i32, %arg15 = %59, %arg16 = %68, %arg17 = %48, %arg18 = %52, %arg19 = %c0_i32, %arg20 = %37, %arg21 = %40) -> (i32, i32, tensor<128x128xf32, #mma>, i32, i32, tensor<128x8xi32, #blocked1>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32, i32, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32, i32, i32) : i32 { 2026-02-21T09:55:13.9226745Z %147 = arith.addi %arg14, %c4_i32 : i32 2026-02-21T09:55:13.9226791Z %148 = arith.addi %arg4, %c1_i32 : i32 2026-02-21T09:55:13.9226841Z %149 = arith.cmpi eq, %arg4, %c127_i32 : i32 2026-02-21T09:55:13.9226905Z %150 = arith.select %149, %c0_i32, %148 : i32 2026-02-21T09:55:13.9226953Z %151 = arith.cmpi eq, %150, %c0_i32 : i32 2026-02-21T09:55:13.9226999Z %152 = arith.select %151, %c0_i32, %147 : i32 2026-02-21T09:55:13.9227148Z %153:6 = scf.if %151 -> (i32, i32, tensor<128x8xi32, #blocked1>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32) { 2026-02-21T09:55:13.9227194Z %199 = arith.addi %arg5, %c1_i32 : i32 2026-02-21T09:55:13.9227238Z %200 = arith.divsi %199, %c256_i32 : i32 2026-02-21T09:55:13.9227294Z %201 = arith.muli %200, %c4_i32 : i32 2026-02-21T09:55:13.9227336Z %202 = arith.subi %c128_i32, %201 : i32 2026-02-21T09:55:13.9227380Z %203 = arith.minsi %202, %c4_i32 : i32 2026-02-21T09:55:13.9227424Z %204 = arith.remsi %199, %c256_i32 : i32 2026-02-21T09:55:13.9227466Z %205 = arith.remsi %204, %203 : i32 2026-02-21T09:55:13.9227509Z %206 = arith.addi %201, %205 : i32 2026-02-21T09:55:13.9227549Z %207 = arith.divsi %204, %203 : i32 2026-02-21T09:55:13.9227592Z %208 = arith.muli %206, %c128_i32 : i32 2026-02-21T09:55:13.9227690Z %209 = tt.splat %208 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.9227783Z %210 = arith.addi %209, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:13.9227827Z %211 = arith.muli %207, %c128_i32 : i32 2026-02-21T09:55:13.9227978Z %212 = tt.expand_dims %210 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.9228046Z %213 = arith.muli %212, %cst_10 : tensor<128x1xi32, #blocked1> 2026-02-21T09:55:13.9228140Z %214 = tt.broadcast %213 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9228199Z %215 = arith.extsi %211 : i32 to i64 2026-02-21T09:55:13.9228294Z %216 = tt.splat %215 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.9228388Z %217 = arith.addi %216, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:13.9228537Z %218 = tt.expand_dims %217 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9228633Z %219 = tt.broadcast %218 : tensor<1x128xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9228705Z %220 = arith.cmpi sge, %218, %cst_8 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9228774Z %221 = arith.cmpi slt, %218, %cst_9 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:13.9228839Z %222 = arith.andi %220, %221 : tensor<1x128xi1, #blocked2> 2026-02-21T09:55:13.9228931Z %223 = tt.broadcast %222 : tensor<1x128xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9229097Z scf.yield %208, %211, %214, %219, %223, %199 : i32, i32, tensor<128x8xi32, #blocked1>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32 2026-02-21T09:55:13.9229134Z } else { 2026-02-21T09:55:13.9229318Z scf.yield %arg7, %arg8, %arg9, %arg10, %arg11, %arg5 : i32, i32, tensor<128x8xi32, #blocked1>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32 2026-02-21T09:55:13.9229350Z } 2026-02-21T09:55:13.9229406Z %154 = arith.muli %152, %c2_i32 : i32 2026-02-21T09:55:13.9229499Z %155 = tt.splat %154 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.9229590Z %156 = arith.addi %155, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:13.9229734Z %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:55:13.9229825Z %158 = tt.broadcast %157 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9229889Z %159 = arith.addi %153#2, %158 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9230102Z %160 = tt.addptr %8, %159 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:13.9230167Z %161 = tt.load %160 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:13.9230369Z %162 = ttg.local_load %arg15 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9230595Z %163 = arith.extf %162 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9230640Z %164 = arith.extsi %arg13 : i32 to i64 2026-02-21T09:55:13.9230731Z %165 = tt.splat %164 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9230821Z %166 = arith.addi %165, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9230965Z %167 = tt.expand_dims %166 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9231031Z %168 = arith.muli %167, %cst_5 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9231120Z %169 = tt.broadcast %168 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9231188Z %170 = arith.addi %169, %arg17 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9231288Z %171 = tt.addptr %9, %170 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9231355Z %172 = arith.cmpi sge, %167, %cst_6 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9231423Z %173 = arith.cmpi slt, %167, %cst_7 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9231481Z %174 = arith.andi %172, %173 : tensor<4x1xi1, #blocked2> 2026-02-21T09:55:13.9231584Z %175 = tt.broadcast %174 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9231649Z %176 = arith.andi %175, %arg18 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9231722Z %177 = tt.load %171, %176, %cst_11 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.9231866Z %178 = ttg.convert_layout %177 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9231968Z %179 = arith.shli %178, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9232066Z %180 = arith.shrsi %179, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9232163Z %181 = arith.shrsi %178, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9232314Z %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9232461Z %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9232559Z %184 = tt.broadcast %182 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9232665Z %185 = arith.select %17, %184, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9232756Z %186 = tt.broadcast %183 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9232868Z %187 = arith.select %19, %186, %185 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9232958Z %188 = tt.reshape %187 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:55:13.9233050Z %189 = arith.sitofp %188 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:55:13.9233167Z %190 = ttg.local_alloc %189 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.9233338Z %191 = ttg.local_load %190 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9233619Z %192 = tt.dot %163, %191, %arg6, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9233670Z %193 = arith.cmpi eq, %arg19, %c127_i32 : i32 2026-02-21T09:55:13.9233741Z %194 = arith.select %193, %cst_2, %192 : tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9233776Z scf.if %193 { 2026-02-21T09:55:13.9233880Z %199 = tt.splat %arg20 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.9233965Z %200 = arith.addi %199, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.9234056Z %201 = tt.splat %arg21 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.9234140Z %202 = arith.addi %201, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.9234229Z %203 = arith.truncf %192 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.9234370Z %204 = tt.expand_dims %200 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.9234431Z %205 = arith.muli %204, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.9234568Z %206 = tt.expand_dims %202 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.9234655Z %207 = tt.broadcast %205 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9234737Z %208 = tt.broadcast %206 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9234796Z %209 = arith.addi %207, %208 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9234910Z %210 = tt.addptr %20, %209 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9234975Z tt.store %210, %203 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.9235007Z } 2026-02-21T09:55:13.9235054Z %195 = arith.addi %arg12, %c1_i32 : i32 2026-02-21T09:55:13.9235099Z %196 = arith.cmpi slt, %195, %c2_i32 : i32 2026-02-21T09:55:13.9235145Z %197 = arith.select %196, %195, %c0_i32 : i32 2026-02-21T09:55:13.9235328Z %198 = ttg.memdesc_index %27[%197] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9235471Z ttg.local_store %161, %198 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:13.9236042Z scf.yield %150, %153#5, %194, %153#0, %153#1, %153#2, %153#3, %153#4, %197, %arg14, %152, %arg16, %198, %arg10, %arg11, %arg4, %arg7, %arg8 : i32, i32, tensor<128x128xf32, #mma>, i32, i32, tensor<128x8xi32, #blocked1>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32, i32, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32, i32, i32 2026-02-21T09:55:13.9236086Z } {tt.num_stages = 3 : i32} 2026-02-21T09:55:13.9236132Z %71 = arith.cmpi sge, %26, %c1_i32 : i32 2026-02-21T09:55:13.9236187Z %72 = arith.cmpi sge, %26, %c2_i32 : i32 2026-02-21T09:55:13.9236383Z %73 = ttg.local_load %70#11 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9236576Z %74 = arith.extf %73 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9236618Z %75 = arith.extsi %70#9 : i32 to i64 2026-02-21T09:55:13.9236707Z %76 = tt.splat %75 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9236792Z %77 = arith.addi %76, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9236947Z %78 = tt.expand_dims %77 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9237009Z %79 = arith.muli %78, %cst_5 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9237097Z %80 = tt.broadcast %79 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9237157Z %81 = arith.addi %80, %70#13 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9237267Z %82 = tt.addptr %9, %81 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9237335Z %83 = arith.cmpi sge, %78, %cst_6 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9237398Z %84 = arith.cmpi slt, %78, %cst_7 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9237454Z %85 = arith.andi %83, %84 : tensor<4x1xi1, #blocked2> 2026-02-21T09:55:13.9237542Z %86 = tt.broadcast %85 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9237602Z %87 = arith.andi %86, %70#14 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9237659Z %88 = tt.splat %71 : i1 -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9237716Z %89 = arith.andi %88, %87 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9237785Z %90 = tt.load %82, %89, %cst_11 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.9237844Z %91 = arith.shli %90, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9237908Z %92 = arith.shrsi %91, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9238049Z %93 = ttg.convert_layout %92 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9238108Z %94 = arith.shrsi %90, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9238264Z %95 = ttg.convert_layout %94 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9238407Z %96 = tt.expand_dims %93 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9238549Z %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9238643Z %98 = tt.broadcast %96 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9238742Z %99 = arith.select %17, %98, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9238834Z %100 = tt.broadcast %97 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9238933Z %101 = arith.select %19, %100, %99 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9239023Z %102 = tt.reshape %101 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:55:13.9239113Z %103 = arith.sitofp %102 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:55:13.9239232Z %104 = ttg.local_alloc %103 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.9239402Z %105 = ttg.local_load %104 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9239470Z %106 = scf.if %71 -> (tensor<128x128xf32, #mma>) { 2026-02-21T09:55:13.9239733Z %147 = tt.dot %74, %105, %70#2, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9239780Z scf.yield %147 : tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9239813Z } else { 2026-02-21T09:55:13.9239861Z scf.yield %70#2 : tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9239894Z } 2026-02-21T09:55:13.9239941Z %107 = arith.cmpi eq, %70#15, %c127_i32 : i32 2026-02-21T09:55:13.9240010Z %108 = arith.select %107, %cst_2, %106 : tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9240069Z %109 = arith.andi %71, %107 : i1 2026-02-21T09:55:13.9240102Z scf.if %109 { 2026-02-21T09:55:13.9240189Z %147 = tt.splat %70#16 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.9240276Z %148 = arith.addi %147, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.9240363Z %149 = tt.splat %70#17 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.9240458Z %150 = arith.addi %149, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.9240544Z %151 = arith.truncf %106 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.9240688Z %152 = tt.expand_dims %148 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.9240746Z %153 = arith.muli %152, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.9240885Z %154 = tt.expand_dims %150 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.9240969Z %155 = tt.broadcast %153 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9241051Z %156 = tt.broadcast %154 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9241110Z %157 = arith.addi %155, %156 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9241209Z %158 = tt.addptr %20, %157 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9241271Z tt.store %158, %151 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.9241302Z } 2026-02-21T09:55:13.9241372Z %110 = arith.select %71, %108, %70#2 : tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9241582Z %111 = ttg.local_load %70#12 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9241777Z %112 = arith.extf %111 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9241824Z %113 = arith.extsi %70#10 : i32 to i64 2026-02-21T09:55:13.9241914Z %114 = tt.splat %113 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9242002Z %115 = arith.addi %114, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:13.9242147Z %116 = tt.expand_dims %115 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9242209Z %117 = arith.muli %116, %cst_5 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9242299Z %118 = tt.broadcast %117 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9242363Z %119 = arith.addi %118, %70#6 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9242463Z %120 = tt.addptr %9, %119 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:13.9242530Z %121 = arith.cmpi sge, %116, %cst_6 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9248194Z %122 = arith.cmpi slt, %116, %cst_7 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:13.9248314Z %123 = arith.andi %121, %122 : tensor<4x1xi1, #blocked2> 2026-02-21T09:55:13.9248408Z %124 = tt.broadcast %123 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9248473Z %125 = arith.andi %124, %70#7 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9248532Z %126 = tt.splat %72 : i1 -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9248591Z %127 = arith.andi %126, %125 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:13.9248665Z %128 = tt.load %120, %127, %cst_11 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:13.9248731Z %129 = arith.shli %128, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9248794Z %130 = arith.shrsi %129, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9248968Z %131 = ttg.convert_layout %130 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9249032Z %132 = arith.shrsi %128, %cst_13 : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:13.9249175Z %133 = ttg.convert_layout %132 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:13.9249343Z %134 = tt.expand_dims %131 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9249491Z %135 = tt.expand_dims %133 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:13.9249588Z %136 = tt.broadcast %134 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9249692Z %137 = arith.select %17, %136, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9249787Z %138 = tt.broadcast %135 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9249885Z %139 = arith.select %19, %138, %137 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:13.9249975Z %140 = tt.reshape %139 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:55:13.9250067Z %141 = arith.sitofp %140 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:55:13.9250184Z %142 = ttg.local_alloc %141 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:13.9250354Z %143 = ttg.local_load %142 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:13.9250429Z %144 = scf.if %72 -> (tensor<128x128xf32, #mma>) { 2026-02-21T09:55:13.9250695Z %147 = tt.dot %112, %143, %110, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9250745Z scf.yield %147 : tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9250780Z } else { 2026-02-21T09:55:13.9250829Z scf.yield %110 : tensor<128x128xf32, #mma> 2026-02-21T09:55:13.9250860Z } 2026-02-21T09:55:13.9250906Z %145 = arith.cmpi eq, %70#0, %c127_i32 : i32 2026-02-21T09:55:13.9250947Z %146 = arith.andi %72, %145 : i1 2026-02-21T09:55:13.9250982Z scf.if %146 { 2026-02-21T09:55:13.9251071Z %147 = tt.splat %70#3 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.9251158Z %148 = arith.addi %147, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:13.9251242Z %149 = tt.splat %70#4 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.9251324Z %150 = arith.addi %149, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:13.9251415Z %151 = arith.truncf %144 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:13.9251556Z %152 = tt.expand_dims %148 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:13.9251629Z %153 = arith.muli %152, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:13.9251768Z %154 = tt.expand_dims %150 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:13.9251852Z %155 = tt.broadcast %153 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9251933Z %156 = tt.broadcast %154 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9251993Z %157 = arith.addi %155, %156 : tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9252091Z %158 = tt.addptr %20, %157 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:13.9252170Z tt.store %158, %151 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:13.9252203Z } 2026-02-21T09:55:13.9252288Z ttg.local_dealloc %27 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:13.9252320Z tt.return 2026-02-21T09:55:13.9252351Z } 2026-02-21T09:55:13.9252386Z } 2026-02-21T09:55:13.9252390Z 2026-02-21T09:55:13.9252420Z {-# 2026-02-21T09:55:13.9252460Z external_resources: { 2026-02-21T09:55:13.9252497Z mlir_reproducer: { 2026-02-21T09:55:13.9253456Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:55:13.9253499Z disable_threading: false, 2026-02-21T09:55:13.9253535Z verify_each: true 2026-02-21T09:55:13.9253566Z } 2026-02-21T09:55:13.9253596Z } 2026-02-21T09:55:13.9253627Z #-} 2026-02-21T09:55:13.9253867Z /tmp/torchinductor_root/3p/c3pbjzzhxqydmil5toh4ejnohqlb63egl3ekytu6xyigao735epj.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:55:13.9254276Z /tmp/torchinductor_root/3p/c3pbjzzhxqydmil5toh4ejnohqlb63egl3ekytu6xyigao735epj.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:55:13.9254403Z [644s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:55:13.9255027Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T09:55:13.9255084Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:55:13.9255166Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:55:14.1819626Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:55:14.1824127Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}> 2026-02-21T09:55:14.1824812Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T09:55:14.1825230Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}> 2026-02-21T09:55:14.1825882Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 4], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:55:14.1826163Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:55:14.1826432Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:55:14.1826552Z #smem = #ttg.shared_memory 2026-02-21T09:55:14.1827012Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:55:14.1827759Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:55:14.1828036Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:55:14.1828234Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:14.1828417Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:14.1828627Z %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1> 2026-02-21T09:55:14.1828747Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:55:14.1829007Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:55:14.1829117Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:55:14.1829225Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:55:14.1829341Z %c504_i32 = arith.constant 504 : i32 2026-02-21T09:55:14.1829636Z %cst_4 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:14.1829934Z %cst_5 = arith.constant dense<504> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:14.1830216Z %cst_6 = arith.constant dense<508> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:14.1830407Z %cst_7 = arith.constant dense<8192> : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1830597Z %cst_8 = arith.constant dense<0> : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1830785Z %cst_9 = arith.constant dense<512> : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1830974Z %cst_10 = arith.constant dense<0> : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:14.1831170Z %cst_11 = arith.constant dense<8192> : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:14.1831354Z %cst_12 = arith.constant dense<0> : tensor<4x128xi8, #blocked2> 2026-02-21T09:55:14.1831464Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:55:14.1831635Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:55:14.1831821Z %cst_13 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.1831933Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:55:14.1832021Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:55:14.1832186Z %cst_14 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.1832238Z %0 = tt.get_program_id x : i32 2026-02-21T09:55:14.1832285Z %1 = arith.divsi %0, %c256_i32 : i32 2026-02-21T09:55:14.1832334Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T09:55:14.1832380Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T09:55:14.1832430Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T09:55:14.1832477Z %5 = arith.remsi %0, %c256_i32 : i32 2026-02-21T09:55:14.1832535Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:55:14.1832582Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:55:14.1832623Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:55:14.1832666Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:55:14.1832818Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:14.1832956Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:14.1833084Z %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:14.1833239Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:14.1833353Z %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:14.1833446Z %15 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:14.1833551Z %16 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:14.1833639Z %17 = arith.addi %15, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:14.1833687Z %18 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:55:14.1833798Z %19 = tt.splat %18 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:14.1833888Z %20 = arith.addi %19, %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:14.1834029Z %21 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:14.1834200Z %22 = tt.expand_dims %16 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1> 2026-02-21T09:55:14.1834300Z %23 = arith.muli %22, %cst_2 : tensor<128x1xi32, #blocked1> 2026-02-21T09:55:14.1834408Z %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:14.1834505Z %25 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:14.1834554Z %26 = arith.extsi %18 : i32 to i64 2026-02-21T09:55:14.1834646Z %27 = tt.splat %arg1 : !tt.ptr -> tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:14.1834785Z %28 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:14.1834967Z %29 = arith.extsi %28 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:14.1835071Z %30 = tt.splat %26 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:14.1835268Z %31 = arith.extsi %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:14.1835370Z %32 = arith.addi %30, %31 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:14.1835568Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2> 2026-02-21T09:55:14.1835674Z %34 = tt.broadcast %33 : tensor<1x128xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:14.1835759Z %35 = arith.cmpi sge, %33, %cst_10 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:14.1835834Z %36 = arith.cmpi slt, %33, %cst_11 : tensor<1x128xi64, #blocked2> 2026-02-21T09:55:14.1835904Z %37 = arith.andi %35, %36 : tensor<1x128xi1, #blocked2> 2026-02-21T09:55:14.1836005Z %38 = tt.broadcast %37 : tensor<1x128xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:14.1836188Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:55:14.1836443Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:55:14.1836608Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:14.1836692Z %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:14.1836794Z %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:55:14.1836864Z %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:14.1836985Z %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:55:14.1837082Z %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:14.1837246Z %47 = tt.expand_dims %21 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:55:14.1837346Z %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:14.1837412Z %49 = arith.addi %24, %48 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:14.1837528Z %50 = tt.addptr %25, %49 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:14.1837618Z %51 = tt.load %50 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:14.1837834Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:14.1837997Z ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:14.1838110Z %53 = arith.addi %21, %cst_4 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:14.1838286Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:55:14.1838386Z %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:14.1838451Z %56 = arith.addi %24, %55 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:14.1838572Z %57 = tt.addptr %25, %56 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:14.1838638Z %58 = tt.load %57 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:14.1838846Z %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:14.1839006Z ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:14.1839415Z %60:4 = scf.for %arg3 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg4 = %cst_3, %arg5 = %c1_i32, %arg6 = %52, %arg7 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>) : i32 { 2026-02-21T09:55:14.1839468Z %128 = arith.addi %arg3, %c8_i32 : i32 2026-02-21T09:55:14.1839541Z %129 = arith.muli %128, %c2_i32 : i32 2026-02-21T09:55:14.1839652Z %130 = tt.splat %129 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:14.1839758Z %131 = arith.addi %130, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:14.1839926Z %132 = tt.expand_dims %131 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1> 2026-02-21T09:55:14.1840034Z %133 = tt.broadcast %132 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1> 2026-02-21T09:55:14.1840110Z %134 = arith.addi %24, %133 : tensor<128x8xi32, #blocked1> 2026-02-21T09:55:14.1840233Z %135 = tt.addptr %25, %134 : tensor<128x8x!tt.ptr, #blocked1>, tensor<128x8xi32, #blocked1> 2026-02-21T09:55:14.1840308Z %136 = tt.load %135 : tensor<128x8x!tt.ptr, #blocked1> 2026-02-21T09:55:14.1840541Z %137 = ttg.local_load %arg6 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:14.1840777Z %138 = arith.extf %137 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:14.1840829Z %139 = arith.extsi %arg3 : i32 to i64 2026-02-21T09:55:14.1840934Z %140 = tt.splat %139 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:14.1841053Z %141 = arith.addi %140, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:14.1841218Z %142 = tt.expand_dims %141 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1841297Z %143 = arith.muli %142, %cst_7 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1841404Z %144 = tt.broadcast %143 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:14.1841472Z %145 = arith.addi %144, %34 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:14.1841595Z %146 = tt.addptr %27, %145 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:14.1841694Z %147 = arith.cmpi sge, %142, %cst_8 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1841769Z %148 = arith.cmpi slt, %142, %cst_9 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1841835Z %149 = arith.andi %147, %148 : tensor<4x1xi1, #blocked2> 2026-02-21T09:55:14.1841924Z %150 = tt.broadcast %149 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:14.1841984Z %151 = arith.andi %150, %38 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:14.1842086Z %152 = tt.load %146, %151, %cst_12 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:14.1842244Z %153 = ttg.convert_layout %152 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.1842342Z %154 = arith.shli %153, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.1842440Z %155 = arith.shrsi %154, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.1842541Z %156 = arith.shrsi %153, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.1842753Z %157 = tt.expand_dims %155 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:14.1842900Z %158 = tt.expand_dims %156 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:14.1842997Z %159 = tt.broadcast %157 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.1843102Z %160 = arith.select %43, %159, %cst_13 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.1843215Z %161 = tt.broadcast %158 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.1843316Z %162 = arith.select %45, %161, %160 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.1843404Z %163 = tt.reshape %162 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked2> 2026-02-21T09:55:14.1843495Z %164 = arith.sitofp %163 : tensor<8x128xi8, #blocked2> to tensor<8x128xf32, #blocked2> 2026-02-21T09:55:14.1843614Z %165 = ttg.local_alloc %164 : (tensor<8x128xf32, #blocked2>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:14.1843784Z %166 = ttg.local_load %165 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:14.1844054Z %167 = tt.dot %138, %166, %arg4, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:14.1844101Z %168 = arith.addi %arg5, %c1_i32 : i32 2026-02-21T09:55:14.1844148Z %169 = arith.cmpi slt, %168, %c2_i32 : i32 2026-02-21T09:55:14.1844196Z %170 = arith.select %169, %168, %c0_i32 : i32 2026-02-21T09:55:14.1844375Z %171 = ttg.memdesc_index %46[%170] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:14.1844514Z ttg.local_store %136, %171 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:14.1844749Z scf.yield %167, %170, %arg7, %171 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:55:14.1844831Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:55:14.1845022Z %61 = ttg.local_load %60#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:14.1845213Z %62 = arith.extf %61 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:14.1845325Z %63 = arith.addi %29, %cst_5 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:14.1845465Z %64 = tt.expand_dims %63 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1845526Z %65 = arith.muli %64, %cst_7 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1845634Z %66 = tt.broadcast %65 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:14.1845692Z %67 = arith.addi %66, %34 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:14.1845788Z %68 = tt.addptr %27, %67 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:14.1845854Z %69 = arith.cmpi sge, %64, %cst_8 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1845917Z %70 = arith.cmpi slt, %64, %cst_9 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1845974Z %71 = arith.andi %69, %70 : tensor<4x1xi1, #blocked2> 2026-02-21T09:55:14.1846059Z %72 = tt.broadcast %71 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:14.1846115Z %73 = arith.andi %72, %38 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:14.1846181Z %74 = tt.load %68, %73, %cst_12 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:14.1846321Z %75 = ttg.convert_layout %74 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.1846417Z %76 = arith.shli %75, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.1846512Z %77 = arith.shrsi %76, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.1846604Z %78 = arith.shrsi %75, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.1846769Z %79 = tt.expand_dims %77 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:14.1846910Z %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:14.1847001Z %81 = tt.broadcast %79 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.1847102Z %82 = arith.select %43, %81, %cst_13 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.1847189Z %83 = tt.broadcast %80 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.1847283Z %84 = arith.select %45, %83, %82 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.1847368Z %85 = tt.reshape %84 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked2> 2026-02-21T09:55:14.1847455Z %86 = arith.sitofp %85 : tensor<8x128xi8, #blocked2> to tensor<8x128xf32, #blocked2> 2026-02-21T09:55:14.1847567Z %87 = ttg.local_alloc %86 : (tensor<8x128xf32, #blocked2>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:14.1847732Z %88 = ttg.local_load %87 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:14.1847988Z %89 = tt.dot %62, %88, %60#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:14.1848197Z %90 = ttg.local_load %60#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:14.1848388Z %91 = arith.extf %90 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:14.1848480Z %92 = arith.addi %29, %cst_6 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:14.1848636Z %93 = tt.expand_dims %92 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1848695Z %94 = arith.muli %93, %cst_7 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1848781Z %95 = tt.broadcast %94 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2> 2026-02-21T09:55:14.1848837Z %96 = arith.addi %95, %34 : tensor<4x128xi64, #blocked2> 2026-02-21T09:55:14.1848933Z %97 = tt.addptr %27, %96 : tensor<4x128x!tt.ptr, #blocked2>, tensor<4x128xi64, #blocked2> 2026-02-21T09:55:14.1849010Z %98 = arith.cmpi sge, %93, %cst_8 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1849072Z %99 = arith.cmpi slt, %93, %cst_9 : tensor<4x1xi64, #blocked2> 2026-02-21T09:55:14.1849130Z %100 = arith.andi %98, %99 : tensor<4x1xi1, #blocked2> 2026-02-21T09:55:14.1849217Z %101 = tt.broadcast %100 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2> 2026-02-21T09:55:14.1849274Z %102 = arith.andi %101, %38 : tensor<4x128xi1, #blocked2> 2026-02-21T09:55:14.1849346Z %103 = tt.load %97, %102, %cst_12 : tensor<4x128x!tt.ptr, #blocked2> 2026-02-21T09:55:14.1849488Z %104 = ttg.convert_layout %103 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.1849587Z %105 = arith.shli %104, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.1849685Z %106 = arith.shrsi %105, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.1849779Z %107 = arith.shrsi %104, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.1849925Z %108 = tt.expand_dims %106 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:14.1850088Z %109 = tt.expand_dims %107 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:14.1850182Z %110 = tt.broadcast %108 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.1850284Z %111 = arith.select %43, %110, %cst_13 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.1850377Z %112 = tt.broadcast %109 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.1850475Z %113 = arith.select %45, %112, %111 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.1850562Z %114 = tt.reshape %113 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked2> 2026-02-21T09:55:14.1850654Z %115 = arith.sitofp %114 : tensor<8x128xi8, #blocked2> to tensor<8x128xf32, #blocked2> 2026-02-21T09:55:14.1850771Z %116 = ttg.local_alloc %115 : (tensor<8x128xf32, #blocked2>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:14.1850938Z %117 = ttg.local_load %116 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:14.1851198Z %118 = tt.dot %91, %117, %89, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:14.1851294Z ttg.local_dealloc %46 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:55:14.1851380Z %119 = arith.truncf %118 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:14.1851519Z %120 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:14.1851575Z %121 = arith.muli %120, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:14.1851708Z %122 = tt.expand_dims %20 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:14.1851793Z %123 = tt.broadcast %121 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:14.1851888Z %124 = tt.broadcast %122 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:14.1851944Z %125 = arith.addi %123, %124 : tensor<128x128xi32, #mma> 2026-02-21T09:55:14.1852023Z %126 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:14.1852122Z %127 = tt.addptr %126, %125 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:14.1852185Z tt.store %127, %119 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:14.1852219Z tt.return 2026-02-21T09:55:14.1852268Z } 2026-02-21T09:55:14.1852299Z } 2026-02-21T09:55:14.1852304Z 2026-02-21T09:55:14.1852333Z {-# 2026-02-21T09:55:14.1852375Z external_resources: { 2026-02-21T09:55:14.1852411Z mlir_reproducer: { 2026-02-21T09:55:14.1853346Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:55:14.1853392Z disable_threading: false, 2026-02-21T09:55:14.1853427Z verify_each: true 2026-02-21T09:55:14.1853458Z } 2026-02-21T09:55:14.1853489Z } 2026-02-21T09:55:14.1853518Z #-} 2026-02-21T09:55:14.1853754Z /tmp/torchinductor_root/mm/cmmpqft2zdf5a5jb2psuhnxqw36eba67uxhlyoxpyotpi5naouph.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:55:14.1854180Z /tmp/torchinductor_root/mm/cmmpqft2zdf5a5jb2psuhnxqw36eba67uxhlyoxpyotpi5naouph.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:55:14.1854293Z [645s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:55:14.1854861Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:55:14.1854918Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:55:14.1854998Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:55:14.8219658Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:55:14.8222479Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T09:55:14.8222950Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [2, 32], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:55:14.8223257Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:55:14.8223550Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:55:14.8223822Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:55:14.8224072Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:55:14.8224301Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:55:14.8224530Z #smem = #ttg.shared_memory 2026-02-21T09:55:14.8224758Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:55:14.8225225Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:55:14.8225629Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:55:14.8225807Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:14.8225979Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:14.8226159Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:55:14.8226344Z %cst_3 = arith.constant dense<8192> : tensor<4x1xi32, #blocked1> 2026-02-21T09:55:14.8226525Z %cst_4 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:55:14.8226674Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:55:14.8226792Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:55:14.8226933Z %cst_5 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.8227080Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:55:14.8227194Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:55:14.8227305Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:55:14.8227493Z %cst_6 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.8227681Z %0 = tt.get_program_id x : i32 2026-02-21T09:55:14.8227792Z %1 = arith.divsi %0, %c128_i32 : i32 2026-02-21T09:55:14.8227907Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:55:14.8228050Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T09:55:14.8228164Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:55:14.8228273Z %5 = arith.remsi %0, %c128_i32 : i32 2026-02-21T09:55:14.8228391Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:55:14.8228498Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:55:14.8228603Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:55:14.8228707Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:55:14.8228910Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:14.8229189Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:14.8229458Z %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:14.8229729Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:14.8229975Z %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:14.8230190Z %15 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:14.8230399Z %16 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:55:14.8230606Z %17 = arith.addi %15, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:55:14.8230786Z %18 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:55:14.8230949Z %19 = tt.splat %18 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:14.8231156Z %20 = tt.splat %18 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:14.8231360Z %21 = arith.addi %19, %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:55:14.8231563Z %22 = arith.addi %20, %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:55:14.8231799Z %23 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:14.8232085Z %24 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:14.8232388Z %25 = tt.expand_dims %16 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:55:14.8232640Z %26 = arith.muli %25, %cst_4 : tensor<128x1xi32, #blocked2> 2026-02-21T09:55:14.8232833Z %27 = tt.broadcast %26 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:14.8233063Z %28 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:14.8233325Z %29 = tt.expand_dims %21 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:55:14.8233601Z %30 = tt.broadcast %29 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:14.8233809Z %31 = tt.splat %arg1 : !tt.ptr -> tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:14.8234080Z %32 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:55:14.8234487Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:55:14.8234878Z %34 = tt.expand_dims %33 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:14.8235127Z %35 = arith.cmpi eq, %34, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:14.8235321Z %36 = tt.broadcast %35 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:55:14.8235511Z %37 = arith.cmpi eq, %34, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:55:14.8235716Z %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:55:14.8235974Z %39 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst_2) -> (tensor<128x128xf32, #mma>) : i32 { 2026-02-21T09:55:14.8236239Z %49 = tt.splat %arg3 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:14.8236459Z %50 = arith.addi %49, %23 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:55:14.8236630Z %51 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:55:14.8236792Z %52 = tt.splat %51 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:14.8237000Z %53 = arith.addi %52, %24 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:55:14.8237265Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:55:14.8237533Z %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:55:14.8237719Z %56 = arith.addi %27, %55 : tensor<128x8xi32, #blocked2> 2026-02-21T09:55:14.8237917Z %57 = tt.addptr %28, %56 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:55:14.8238115Z %58 = tt.load %57 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:55:14.8238332Z %59 = ttg.local_alloc %58 : (tensor<128x8xbf16, #blocked2>) -> !ttg.memdesc<128x8xbf16, #shared, #smem> 2026-02-21T09:55:14.8238681Z %60 = ttg.local_load %59 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:14.8239082Z %61 = arith.extf %60 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:14.8239463Z %62 = tt.expand_dims %50 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:55:14.8239701Z %63 = arith.muli %62, %cst_3 : tensor<4x1xi32, #blocked1> 2026-02-21T09:55:14.8239907Z %64 = tt.broadcast %63 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:55:14.8240093Z %65 = arith.addi %64, %30 : tensor<4x128xi32, #blocked1> 2026-02-21T09:55:14.8240285Z %66 = tt.addptr %31, %65 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:55:14.8240479Z %67 = tt.load %66 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:55:14.8240750Z %68 = ttg.convert_layout %67 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.8241025Z %69 = arith.shli %68, %cst_6 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.8241255Z %70 = arith.shrsi %69, %cst_6 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.8241480Z %71 = arith.shrsi %68, %cst_6 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:55:14.8241761Z %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:14.8242092Z %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:55:14.8242370Z %74 = tt.broadcast %72 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.8242676Z %75 = arith.select %36, %74, %cst_5 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.8242906Z %76 = tt.broadcast %73 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.8243135Z %77 = arith.select %38, %76, %75 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:55:14.8243380Z %78 = tt.reshape %77 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:55:14.8243595Z %79 = arith.sitofp %78 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:55:14.8243846Z %80 = ttg.local_alloc %79 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:55:14.8244165Z %81 = ttg.local_load %80 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:55:14.8244638Z %82 = tt.dot %61, %81, %arg4, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:55:14.8244988Z scf.yield %82 : tensor<128x128xf32, #mma> 2026-02-21T09:55:14.8245134Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:55:14.8245322Z %40 = arith.truncf %39 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:55:14.8245586Z %41 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:55:14.8245820Z %42 = arith.muli %41, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:55:14.8246044Z %43 = tt.expand_dims %22 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:55:14.8246294Z %44 = tt.broadcast %42 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:14.8246510Z %45 = tt.broadcast %43 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:55:14.8246684Z %46 = arith.addi %44, %45 : tensor<128x128xi32, #mma> 2026-02-21T09:55:14.8246856Z %47 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:14.8247068Z %48 = tt.addptr %47, %46 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:55:14.8247259Z tt.store %48, %40 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:55:14.8247390Z tt.return 2026-02-21T09:55:14.8247469Z } 2026-02-21T09:55:14.8247540Z } 2026-02-21T09:55:14.8247583Z 2026-02-21T09:55:14.8247634Z {-# 2026-02-21T09:55:14.8247714Z external_resources: { 2026-02-21T09:55:14.8247810Z mlir_reproducer: { 2026-02-21T09:55:14.8248817Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:55:14.8249799Z disable_threading: false, 2026-02-21T09:55:14.8249904Z verify_each: true 2026-02-21T09:55:14.8249993Z } 2026-02-21T09:55:14.8250064Z } 2026-02-21T09:55:14.8250132Z #-} 2026-02-21T09:55:14.8250407Z /tmp/torchinductor_root/4s/c4s22dpmjftsnnhkpn6wss5xkukvnvtbi4gqvbg7rc7f6mabm3hf.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:55:14.8251088Z /tmp/torchinductor_root/4s/c4s22dpmjftsnnhkpn6wss5xkukvnvtbi4gqvbg7rc7f6mabm3hf.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:55:14.8251642Z [645s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:55:14.8252386Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:55:14.8253037Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:55:14.8253203Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:55:15.4173893Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 61/61 12.6 configs/s 2026-02-21T09:55:19.3748194Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 214/214 36.2 configs/s 2026-02-21T09:55:22.7464716Z [653s] Generation 10 complete: 2026-02-21T09:55:22.7464962Z error=10 2026-02-21T09:55:22.7465059Z ok=53 2026-02-21T09:55:22.7465190Z min=0.9404 2026-02-21T09:55:22.7465276Z mid=1.4120 2026-02-21T09:55:22.7465369Z max=29.3351 2026-02-21T09:55:22.7465476Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:55:22.7465654Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:55:22.7465814Z 'l2_groupings': [2], 2026-02-21T09:55:22.7465943Z 'load_eviction_policies': ['', ''], 2026-02-21T09:55:22.7466083Z 'loop_orders': [[0, 1]], 2026-02-21T09:55:22.7466226Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:55:22.7466344Z 'num_stages': 1, 2026-02-21T09:55:22.7466446Z 'num_warps': 4, 2026-02-21T09:55:22.7466551Z 'pid_type': 'flat', 2026-02-21T09:55:22.7466677Z 'range_flattens': [None, None], 2026-02-21T09:55:22.7466809Z 'range_multi_buffers': [None, False], 2026-02-21T09:55:22.7467204Z 'range_num_stages': [0, 1], 2026-02-21T09:55:22.7467334Z 'range_unroll_factors': [0, 0], 2026-02-21T09:55:22.7467462Z 'range_warp_specializes': [], 2026-02-21T09:55:22.7467588Z 'waves_per_eu': 2} 2026-02-21T09:55:22.7537408Z [653s] Fitting surrogate: 973 points, 973 targets 2026-02-21T09:55:23.3132565Z [654s] Generation 11 starting: 42 neighbors, 2 active search path(s) 2026-02-21T09:55:44.1868171Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 1.3 configs/s 2026-02-21T09:55:48.6099098Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 9.7 configs/s 2026-02-21T09:55:51.4537167Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━ 214/214 36.5 configs/s 2026-02-21T09:55:53.8901585Z [684s] Generation 11 complete: 2026-02-21T09:55:53.8901978Z error=2 2026-02-21T09:55:53.8902186Z ok=43 2026-02-21T09:55:53.8902392Z min=0.9382 2026-02-21T09:55:53.8902596Z mid=1.4550 2026-02-21T09:55:53.8902793Z max=32.6063 2026-02-21T09:55:53.8903021Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:55:53.8903434Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:55:53.8903790Z 'l2_groupings': [2], 2026-02-21T09:55:53.8904070Z 'load_eviction_policies': ['', ''], 2026-02-21T09:55:53.8904782Z 'loop_orders': [[0, 1]], 2026-02-21T09:55:53.8905060Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:55:53.8905327Z 'num_stages': 1, 2026-02-21T09:55:53.8905552Z 'num_warps': 4, 2026-02-21T09:55:53.8905787Z 'pid_type': 'flat', 2026-02-21T09:55:53.8906043Z 'range_flattens': [None, None], 2026-02-21T09:55:53.8906358Z 'range_multi_buffers': [None, False], 2026-02-21T09:55:53.8906664Z 'range_num_stages': [0, 1], 2026-02-21T09:55:53.8906942Z 'range_unroll_factors': [0, 0], 2026-02-21T09:55:53.8907246Z 'range_warp_specializes': [], 2026-02-21T09:55:53.8907520Z 'waves_per_eu': 2} 2026-02-21T09:55:53.8964464Z [684s] Fitting surrogate: 1018 points, 1018 targets 2026-02-21T09:55:54.4155352Z [685s] Generation 12 starting: 42 neighbors, 2 active search path(s) 2026-02-21T09:56:07.7178842Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 1.9 configs/s 2026-02-21T09:56:10.1147525Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:56:10.1157283Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}> 2026-02-21T09:56:10.1158535Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:56:10.1159217Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:56:10.1159896Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:56:10.1160525Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:56:10.1161107Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:56:10.1161633Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:56:10.1162043Z #smem = #ttg.shared_memory 2026-02-21T09:56:10.1162642Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:56:10.1163701Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:56:10.1164693Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:56:10.1165085Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:10.1165477Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:10.1165936Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:56:10.1166229Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:56:10.1166438Z %c504_i32 = arith.constant 504 : i32 2026-02-21T09:56:10.1166780Z %cst_3 = arith.constant dense<504> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1167232Z %cst_4 = arith.constant dense<508> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1167669Z %cst_5 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.1168053Z %cst_6 = arith.constant dense<8192> : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1168378Z %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.1168645Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:56:10.1168848Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:56:10.1169053Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:56:10.1169265Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:56:10.1169467Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:56:10.1169728Z %cst_8 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1170068Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:56:10.1170270Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:56:10.1170460Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:56:10.1170782Z %cst_9 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1171146Z %0 = tt.get_program_id x : i32 2026-02-21T09:56:10.1171350Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:56:10.1171562Z %2 = arith.minsi %1, %c8192_i32 : i32 2026-02-21T09:56:10.1171919Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.1172419Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.1172897Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.1173387Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.1173862Z %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1174337Z %8 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.1174774Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.1175141Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.1175640Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:56:10.1176276Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:56:10.1176812Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:10.1177154Z %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:10.1177415Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:56:10.1177676Z %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:10.1177959Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:56:10.1178232Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:10.1178450Z %19 = arith.subi %2, %0 : i32 2026-02-21T09:56:10.1178598Z %20 = arith.remsi %19, %c3_i32 : i32 2026-02-21T09:56:10.1178780Z %21 = arith.subi %19, %20 : i32 2026-02-21T09:56:10.1178935Z %22 = arith.addi %0, %21 : i32 2026-02-21T09:56:10.1179097Z scf.for %arg3 = %0 to %22 step %c3_i32 : i32 { 2026-02-21T09:56:10.1179296Z %23 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:56:10.1179459Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:56:10.1179622Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:56:10.1179779Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:56:10.1179947Z %27 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:56:10.1180107Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:56:10.1180259Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:56:10.1180417Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:56:10.1180566Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:56:10.1180792Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.1181082Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.1181368Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.1181674Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.1181896Z %36 = arith.muli %30, %c128_i32 : i32 2026-02-21T09:56:10.1182117Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.1182422Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.1182706Z %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.1182982Z %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.1183345Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.1183690Z %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.1183948Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1184323Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:10.1184685Z %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1184982Z %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.1185344Z %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:56:10.1185708Z %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1185929Z %49 = arith.addi %43, %48 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1186198Z %50 = tt.addptr %9, %49 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1186418Z %51 = tt.load %50 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.1186728Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1187118Z ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1187416Z %53 = arith.addi %8, %cst_5 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.1187722Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:56:10.1188013Z %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1188218Z %56 = arith.addi %43, %55 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1188440Z %57 = tt.addptr %9, %56 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1188659Z %58 = tt.load %57 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.1188954Z %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1189334Z ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1189892Z %60:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>) : i32 { 2026-02-21T09:56:10.1190399Z %281 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1190648Z %282 = arith.addi %281, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1190840Z %283 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:56:10.1190987Z %284 = arith.muli %283, %c2_i32 : i32 2026-02-21T09:56:10.1191173Z %285 = tt.splat %284 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.1191407Z %286 = arith.addi %285, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.1191722Z %287 = tt.expand_dims %286 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:56:10.1192025Z %288 = tt.broadcast %287 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1192238Z %289 = arith.addi %43, %288 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1192460Z %290 = tt.addptr %9, %289 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1192687Z %291 = tt.load %290 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.1193021Z %292 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1193494Z %293 = arith.extf %292 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1193903Z %294 = tt.expand_dims %282 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1194173Z %295 = arith.muli %294, %cst_6 : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1194380Z %296 = tt.broadcast %295 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1194588Z %297 = arith.addi %296, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1194811Z %298 = tt.addptr %10, %297 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1195026Z %299 = tt.load %298 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.1195293Z %300 = ttg.convert_layout %299 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1195593Z %301 = arith.shli %300, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1195851Z %302 = arith.shrsi %301, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1196107Z %303 = arith.shrsi %300, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1196432Z %304 = tt.expand_dims %302 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1196783Z %305 = tt.expand_dims %303 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1197098Z %306 = tt.broadcast %304 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1197349Z %307 = arith.select %15, %306, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1197596Z %308 = tt.broadcast %305 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1197835Z %309 = arith.select %17, %308, %307 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1198074Z %310 = tt.reshape %309 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.1198303Z %311 = arith.sitofp %310 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.1198567Z %312 = ttg.local_alloc %311 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.1198902Z %313 = ttg.local_load %312 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1199401Z %314 = tt.dot %293, %313, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.1199761Z %315 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:10.1199914Z %316 = arith.cmpi slt, %315, %c2_i32 : i32 2026-02-21T09:56:10.1200052Z %317 = arith.select %316, %315, %c0_i32 : i32 2026-02-21T09:56:10.1200332Z %318 = ttg.memdesc_index %46[%317] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1200694Z ttg.local_store %291, %318 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1201102Z scf.yield %314, %317, %arg8, %318 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1201452Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:56:10.1201667Z %61 = arith.addi %7, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1202018Z %62 = ttg.local_load %60#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1202445Z %63 = arith.extf %62 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1202863Z %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1203107Z %65 = arith.muli %64, %cst_6 : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1203293Z %66 = tt.broadcast %65 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1203485Z %67 = arith.addi %66, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1203680Z %68 = tt.addptr %10, %67 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1203874Z %69 = tt.load %68 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.1204115Z %70 = ttg.convert_layout %69 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1204405Z %71 = arith.shli %70, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1204634Z %72 = arith.shrsi %71, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1204861Z %73 = arith.shrsi %70, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1205157Z %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1205488Z %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1205765Z %76 = tt.broadcast %74 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1206002Z %77 = arith.select %15, %76, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1206232Z %78 = tt.broadcast %75 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1206458Z %79 = arith.select %17, %78, %77 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1206681Z %80 = tt.reshape %79 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.1206899Z %81 = arith.sitofp %80 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.1207147Z %82 = ttg.local_alloc %81 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.1207485Z %83 = ttg.local_load %82 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1207961Z %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.1208348Z %85 = arith.addi %7, %cst_4 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1208673Z %86 = ttg.local_load %60#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1209095Z %87 = arith.extf %86 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1209472Z %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1209710Z %89 = arith.muli %88, %cst_6 : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1209897Z %90 = tt.broadcast %89 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1210086Z %91 = arith.addi %90, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1210275Z %92 = tt.addptr %10, %91 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1210470Z %93 = tt.load %92 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.1210704Z %94 = ttg.convert_layout %93 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1210979Z %95 = arith.shli %94, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1211208Z %96 = arith.shrsi %95, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1211435Z %97 = arith.shrsi %94, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1211718Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1212047Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1212344Z %100 = tt.broadcast %98 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1212585Z %101 = arith.select %15, %100, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1212818Z %102 = tt.broadcast %99 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1213064Z %103 = arith.select %17, %102, %101 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1213296Z %104 = tt.reshape %103 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.1213522Z %105 = arith.sitofp %104 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.1213783Z %106 = ttg.local_alloc %105 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.1214111Z %107 = ttg.local_load %106 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1214580Z %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.1214972Z ttg.local_dealloc %46 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.1215190Z %109 = arith.truncf %108 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:10.1215478Z %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:10.1215714Z %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:10.1215959Z %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:10.1216222Z %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1216430Z %114 = tt.broadcast %112 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1216613Z %115 = arith.addi %113, %114 : tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1216805Z %116 = tt.addptr %18, %115 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1217009Z tt.store %116, %109 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:10.1217151Z %117 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:56:10.1217277Z %118 = arith.divsi %117, %c256_i32 : i32 2026-02-21T09:56:10.1217401Z %119 = arith.muli %118, %c4_i32 : i32 2026-02-21T09:56:10.1217519Z %120 = arith.subi %c128_i32, %119 : i32 2026-02-21T09:56:10.1217639Z %121 = arith.minsi %120, %c4_i32 : i32 2026-02-21T09:56:10.1217757Z %122 = arith.remsi %117, %c256_i32 : i32 2026-02-21T09:56:10.1217875Z %123 = arith.remsi %122, %121 : i32 2026-02-21T09:56:10.1217989Z %124 = arith.addi %119, %123 : i32 2026-02-21T09:56:10.1218104Z %125 = arith.divsi %122, %121 : i32 2026-02-21T09:56:10.1218220Z %126 = arith.muli %124, %c128_i32 : i32 2026-02-21T09:56:10.1218395Z %127 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.1218616Z %128 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.1218835Z %129 = arith.addi %127, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.1219052Z %130 = arith.addi %128, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.1219217Z %131 = arith.muli %125, %c128_i32 : i32 2026-02-21T09:56:10.1219388Z %132 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.1219606Z %133 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.1219837Z %134 = arith.addi %132, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.1220051Z %135 = arith.addi %133, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.1220322Z %136 = tt.expand_dims %129 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.1220596Z %137 = arith.muli %136, %cst_7 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.1220794Z %138 = tt.broadcast %137 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1221075Z %139 = tt.expand_dims %134 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:10.1221359Z %140 = tt.broadcast %139 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1221581Z %141 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.1221770Z %142 = arith.addi %138, %48 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1221971Z %143 = tt.addptr %9, %142 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1222176Z %144 = tt.load %143 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.1222465Z %145 = ttg.memdesc_index %141[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1222838Z ttg.local_store %144, %145 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1223082Z %146 = arith.addi %138, %55 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1223282Z %147 = tt.addptr %9, %146 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1223502Z %148 = tt.load %147 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.1223785Z %149 = ttg.memdesc_index %141[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1224140Z ttg.local_store %148, %149 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1224672Z %150:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %145, %arg8 = %149) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>) : i32 { 2026-02-21T09:56:10.1225151Z %281 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1225379Z %282 = arith.addi %281, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1225559Z %283 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:56:10.1225679Z %284 = arith.muli %283, %c2_i32 : i32 2026-02-21T09:56:10.1225851Z %285 = tt.splat %284 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.1226073Z %286 = arith.addi %285, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.1226348Z %287 = tt.expand_dims %286 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:56:10.1226628Z %288 = tt.broadcast %287 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1226825Z %289 = arith.addi %138, %288 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1227030Z %290 = tt.addptr %9, %289 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1227237Z %291 = tt.load %290 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.1227539Z %292 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1228005Z %293 = arith.extf %292 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1228388Z %294 = tt.expand_dims %282 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1228652Z %295 = arith.muli %294, %cst_6 : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1228847Z %296 = tt.broadcast %295 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1229038Z %297 = arith.addi %296, %140 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1229241Z %298 = tt.addptr %10, %297 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1229443Z %299 = tt.load %298 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.1229692Z %300 = ttg.convert_layout %299 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1229979Z %301 = arith.shli %300, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1230216Z %302 = arith.shrsi %301, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1230454Z %303 = arith.shrsi %300, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1230766Z %304 = tt.expand_dims %302 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1231102Z %305 = tt.expand_dims %303 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1231407Z %306 = tt.broadcast %304 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1231649Z %307 = arith.select %15, %306, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1231891Z %308 = tt.broadcast %305 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1232127Z %309 = arith.select %17, %308, %307 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1232359Z %310 = tt.reshape %309 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.1232589Z %311 = arith.sitofp %310 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.1232844Z %312 = ttg.local_alloc %311 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.1233176Z %313 = ttg.local_load %312 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1233655Z %314 = tt.dot %293, %313, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.1234006Z %315 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:10.1234135Z %316 = arith.cmpi slt, %315, %c2_i32 : i32 2026-02-21T09:56:10.1234268Z %317 = arith.select %316, %315, %c0_i32 : i32 2026-02-21T09:56:10.1234539Z %318 = ttg.memdesc_index %141[%317] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1234900Z ttg.local_store %291, %318 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1235297Z scf.yield %314, %317, %arg8, %318 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1235638Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:56:10.1235974Z %151 = ttg.local_load %150#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1236409Z %152 = arith.extf %151 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1236732Z %153 = arith.addi %66, %140 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1236933Z %154 = tt.addptr %10, %153 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1237135Z %155 = tt.load %154 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.1237378Z %156 = ttg.convert_layout %155 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1237657Z %157 = arith.shli %156, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1237897Z %158 = arith.shrsi %157, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1238130Z %159 = arith.shrsi %156, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1238422Z %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1238774Z %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1239057Z %162 = tt.broadcast %160 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1239297Z %163 = arith.select %15, %162, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1239547Z %164 = tt.broadcast %161 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1239784Z %165 = arith.select %17, %164, %163 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1240017Z %166 = tt.reshape %165 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.1240241Z %167 = arith.sitofp %166 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.1240497Z %168 = ttg.local_alloc %167 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.1240823Z %169 = ttg.local_load %168 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1241300Z %170 = tt.dot %152, %169, %150#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.1241803Z %171 = ttg.local_load %150#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1242234Z %172 = arith.extf %171 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1242533Z %173 = arith.addi %90, %140 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1242774Z %174 = tt.addptr %10, %173 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1242973Z %175 = tt.load %174 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.1243220Z %176 = ttg.convert_layout %175 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1243501Z %177 = arith.shli %176, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1243739Z %178 = arith.shrsi %177, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1244000Z %179 = arith.shrsi %176, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1244286Z %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1244639Z %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1244922Z %182 = tt.broadcast %180 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1245162Z %183 = arith.select %15, %182, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1245401Z %184 = tt.broadcast %181 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1245633Z %185 = arith.select %17, %184, %183 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1245867Z %186 = tt.reshape %185 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.1246088Z %187 = arith.sitofp %186 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.1246340Z %188 = ttg.local_alloc %187 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.1246669Z %189 = ttg.local_load %188 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1247152Z %190 = tt.dot %172, %189, %170, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.1247557Z ttg.local_dealloc %141 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.1247775Z %191 = arith.truncf %190 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:10.1248044Z %192 = tt.expand_dims %130 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:10.1248286Z %193 = arith.muli %192, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:10.1248515Z %194 = tt.expand_dims %135 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:10.1248776Z %195 = tt.broadcast %193 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1248983Z %196 = tt.broadcast %194 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1249168Z %197 = arith.addi %195, %196 : tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1249362Z %198 = tt.addptr %18, %197 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1249560Z tt.store %198, %191 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:10.1249705Z %199 = arith.addi %arg3, %c2_i32 : i32 2026-02-21T09:56:10.1249826Z %200 = arith.divsi %199, %c256_i32 : i32 2026-02-21T09:56:10.1249946Z %201 = arith.muli %200, %c4_i32 : i32 2026-02-21T09:56:10.1250067Z %202 = arith.subi %c128_i32, %201 : i32 2026-02-21T09:56:10.1250185Z %203 = arith.minsi %202, %c4_i32 : i32 2026-02-21T09:56:10.1250306Z %204 = arith.remsi %199, %c256_i32 : i32 2026-02-21T09:56:10.1250421Z %205 = arith.remsi %204, %203 : i32 2026-02-21T09:56:10.1250538Z %206 = arith.addi %201, %205 : i32 2026-02-21T09:56:10.1250651Z %207 = arith.divsi %204, %203 : i32 2026-02-21T09:56:10.1250771Z %208 = arith.muli %206, %c128_i32 : i32 2026-02-21T09:56:10.1250943Z %209 = tt.splat %208 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.1251162Z %210 = tt.splat %208 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.1251386Z %211 = arith.addi %209, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.1251617Z %212 = arith.addi %210, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.1251788Z %213 = arith.muli %207, %c128_i32 : i32 2026-02-21T09:56:10.1251954Z %214 = tt.splat %213 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.1252191Z %215 = tt.splat %213 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.1252410Z %216 = arith.addi %214, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.1252621Z %217 = arith.addi %215, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.1252901Z %218 = tt.expand_dims %211 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.1253156Z %219 = arith.muli %218, %cst_7 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.1253357Z %220 = tt.broadcast %219 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1253640Z %221 = tt.expand_dims %216 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:10.1253919Z %222 = tt.broadcast %221 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1254142Z %223 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.1254342Z %224 = arith.addi %220, %48 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1254541Z %225 = tt.addptr %9, %224 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1254747Z %226 = tt.load %225 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.1255051Z %227 = ttg.memdesc_index %223[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1255414Z ttg.local_store %226, %227 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1255654Z %228 = arith.addi %220, %55 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1255854Z %229 = tt.addptr %9, %228 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1256060Z %230 = tt.load %229 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.1256342Z %231 = ttg.memdesc_index %223[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1256704Z ttg.local_store %230, %231 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1257229Z %232:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %227, %arg8 = %231) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>) : i32 { 2026-02-21T09:56:10.1257704Z %281 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1257938Z %282 = arith.addi %281, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1258116Z %283 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:56:10.1258242Z %284 = arith.muli %283, %c2_i32 : i32 2026-02-21T09:56:10.1258412Z %285 = tt.splat %284 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.1258630Z %286 = arith.addi %285, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.1258911Z %287 = tt.expand_dims %286 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:56:10.1259187Z %288 = tt.broadcast %287 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1267175Z %289 = arith.addi %220, %288 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1267404Z %290 = tt.addptr %9, %289 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1267616Z %291 = tt.load %290 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.1267965Z %292 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1268407Z %293 = arith.extf %292 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1268793Z %294 = tt.expand_dims %282 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1269042Z %295 = arith.muli %294, %cst_6 : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1269239Z %296 = tt.broadcast %295 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1269431Z %297 = arith.addi %296, %222 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1269634Z %298 = tt.addptr %10, %297 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1269844Z %299 = tt.load %298 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.1270109Z %300 = ttg.convert_layout %299 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1270392Z %301 = arith.shli %300, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1270627Z %302 = arith.shrsi %301, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1270883Z %303 = arith.shrsi %300, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1271175Z %304 = tt.expand_dims %302 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1271512Z %305 = tt.expand_dims %303 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1271802Z %306 = tt.broadcast %304 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1272045Z %307 = arith.select %15, %306, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1272288Z %308 = tt.broadcast %305 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1272524Z %309 = arith.select %17, %308, %307 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1272755Z %310 = tt.reshape %309 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.1272980Z %311 = arith.sitofp %310 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.1273234Z %312 = ttg.local_alloc %311 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.1273566Z %313 = ttg.local_load %312 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1274050Z %314 = tt.dot %293, %313, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.1274402Z %315 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:10.1274532Z %316 = arith.cmpi slt, %315, %c2_i32 : i32 2026-02-21T09:56:10.1274671Z %317 = arith.select %316, %315, %c0_i32 : i32 2026-02-21T09:56:10.1274938Z %318 = ttg.memdesc_index %223[%317] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1275317Z ttg.local_store %291, %318 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1275715Z scf.yield %314, %317, %arg8, %318 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1276073Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:56:10.1276393Z %233 = ttg.local_load %232#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1276824Z %234 = arith.extf %233 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1277122Z %235 = arith.addi %66, %222 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1277328Z %236 = tt.addptr %10, %235 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1277527Z %237 = tt.load %236 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.1277773Z %238 = ttg.convert_layout %237 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1278054Z %239 = arith.shli %238, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1278305Z %240 = arith.shrsi %239, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1278539Z %241 = arith.shrsi %238, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1278845Z %242 = tt.expand_dims %240 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1279183Z %243 = tt.expand_dims %241 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1279467Z %244 = tt.broadcast %242 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1279709Z %245 = arith.select %15, %244, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1279951Z %246 = tt.broadcast %243 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1280187Z %247 = arith.select %17, %246, %245 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1280423Z %248 = tt.reshape %247 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.1280649Z %249 = arith.sitofp %248 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.1280908Z %250 = ttg.local_alloc %249 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.1281238Z %251 = ttg.local_load %250 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1281712Z %252 = tt.dot %234, %251, %232#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.1282215Z %253 = ttg.local_load %232#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1282693Z %254 = arith.extf %253 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1282992Z %255 = arith.addi %90, %222 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1283195Z %256 = tt.addptr %10, %255 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1283416Z %257 = tt.load %256 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.1283659Z %258 = ttg.convert_layout %257 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1283956Z %259 = arith.shli %258, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1284189Z %260 = arith.shrsi %259, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1284426Z %261 = arith.shrsi %258, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1284713Z %262 = tt.expand_dims %260 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1285051Z %263 = tt.expand_dims %261 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1285338Z %264 = tt.broadcast %262 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1285575Z %265 = arith.select %15, %264, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1285813Z %266 = tt.broadcast %263 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1286045Z %267 = arith.select %17, %266, %265 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1286291Z %268 = tt.reshape %267 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.1286518Z %269 = arith.sitofp %268 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.1286770Z %270 = ttg.local_alloc %269 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.1287114Z %271 = ttg.local_load %270 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1287585Z %272 = tt.dot %254, %271, %252, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.1287972Z ttg.local_dealloc %223 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.1288189Z %273 = arith.truncf %272 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:10.1288459Z %274 = tt.expand_dims %212 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:10.1288698Z %275 = arith.muli %274, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:10.1288931Z %276 = tt.expand_dims %217 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:10.1289191Z %277 = tt.broadcast %275 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1289399Z %278 = tt.broadcast %276 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1289580Z %279 = arith.addi %277, %278 : tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1289773Z %280 = tt.addptr %18, %279 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1289974Z tt.store %280, %273 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:10.1290108Z } 2026-02-21T09:56:10.1290207Z scf.for %arg3 = %22 to %2 step %c1_i32 : i32 { 2026-02-21T09:56:10.1290344Z %23 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:56:10.1290468Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:56:10.1290583Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:56:10.1290700Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:56:10.1290820Z %27 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:56:10.1290939Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:56:10.1291068Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:56:10.1291179Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:56:10.1291292Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:56:10.1291459Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.1291691Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.1291904Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.1292118Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.1292283Z %36 = arith.muli %30, %c128_i32 : i32 2026-02-21T09:56:10.1292449Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.1292661Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.1292872Z %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.1293083Z %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.1293352Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.1293608Z %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.1293816Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1294093Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:10.1294370Z %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1294610Z %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.1294882Z %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:56:10.1295149Z %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1295338Z %49 = arith.addi %43, %48 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1295537Z %50 = tt.addptr %9, %49 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1295736Z %51 = tt.load %50 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.1296019Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1296383Z ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1296650Z %53 = arith.addi %8, %cst_5 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.1296925Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:56:10.1297195Z %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1297383Z %56 = arith.addi %43, %55 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1297577Z %57 = tt.addptr %9, %56 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1297775Z %58 = tt.load %57 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.1298053Z %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1298406Z ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1298928Z %60:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>) : i32 { 2026-02-21T09:56:10.1299436Z %117 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1299663Z %118 = arith.addi %117, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1299841Z %119 = arith.addi %arg4, %c8_i32 : i32 2026-02-21T09:56:10.1299966Z %120 = arith.muli %119, %c2_i32 : i32 2026-02-21T09:56:10.1300134Z %121 = tt.splat %120 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.1300358Z %122 = arith.addi %121, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.1300630Z %123 = tt.expand_dims %122 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:56:10.1300908Z %124 = tt.broadcast %123 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1301105Z %125 = arith.addi %43, %124 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1301306Z %126 = tt.addptr %9, %125 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.1301513Z %127 = tt.load %126 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.1301832Z %128 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1302285Z %129 = arith.extf %128 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1302667Z %130 = tt.expand_dims %118 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1302914Z %131 = arith.muli %130, %cst_6 : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1303110Z %132 = tt.broadcast %131 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1303302Z %133 = arith.addi %132, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1303505Z %134 = tt.addptr %10, %133 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1303711Z %135 = tt.load %134 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.1303955Z %136 = ttg.convert_layout %135 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1304238Z %137 = arith.shli %136, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1304473Z %138 = arith.shrsi %137, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1304710Z %139 = arith.shrsi %136, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1305000Z %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1305336Z %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1305627Z %142 = tt.broadcast %140 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1305869Z %143 = arith.select %15, %142, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1306111Z %144 = tt.broadcast %141 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1306348Z %145 = arith.select %17, %144, %143 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1306596Z %146 = tt.reshape %145 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.1306822Z %147 = arith.sitofp %146 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.1307076Z %148 = ttg.local_alloc %147 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.1307421Z %149 = ttg.local_load %148 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1307900Z %150 = tt.dot %129, %149, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.1308246Z %151 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:10.1308379Z %152 = arith.cmpi slt, %151, %c2_i32 : i32 2026-02-21T09:56:10.1308513Z %153 = arith.select %152, %151, %c0_i32 : i32 2026-02-21T09:56:10.1308780Z %154 = ttg.memdesc_index %46[%153] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1309139Z ttg.local_store %127, %154 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1309552Z scf.yield %150, %153, %arg8, %154 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> 2026-02-21T09:56:10.1309891Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:56:10.1310105Z %61 = arith.addi %7, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1310446Z %62 = ttg.local_load %60#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1310874Z %63 = arith.extf %62 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1311252Z %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1311495Z %65 = arith.muli %64, %cst_6 : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1311685Z %66 = tt.broadcast %65 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1311873Z %67 = arith.addi %66, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1312065Z %68 = tt.addptr %10, %67 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1312261Z %69 = tt.load %68 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.1312499Z %70 = ttg.convert_layout %69 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1312773Z %71 = arith.shli %70, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1312998Z %72 = arith.shrsi %71, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1313228Z %73 = arith.shrsi %70, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1313509Z %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1313838Z %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1314116Z %76 = tt.broadcast %74 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1314349Z %77 = arith.select %15, %76, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1314597Z %78 = tt.broadcast %75 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1314821Z %79 = arith.select %17, %78, %77 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1315040Z %80 = tt.reshape %79 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.1315273Z %81 = arith.sitofp %80 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.1315517Z %82 = ttg.local_alloc %81 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.1315837Z %83 = ttg.local_load %82 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1316304Z %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.1316691Z %85 = arith.addi %7, %cst_4 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.1317014Z %86 = ttg.local_load %60#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1317438Z %87 = arith.extf %86 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1317825Z %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1318066Z %89 = arith.muli %88, %cst_6 : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.1318268Z %90 = tt.broadcast %89 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1318456Z %91 = arith.addi %90, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1318645Z %92 = tt.addptr %10, %91 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.1318839Z %93 = tt.load %92 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.1319075Z %94 = ttg.convert_layout %93 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1319346Z %95 = arith.shli %94, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1319575Z %96 = arith.shrsi %95, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1319799Z %97 = arith.shrsi %94, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.1320080Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1320409Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.1320687Z %100 = tt.broadcast %98 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1320925Z %101 = arith.select %15, %100, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1321161Z %102 = tt.broadcast %99 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1321389Z %103 = arith.select %17, %102, %101 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.1321622Z %104 = tt.reshape %103 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.1321842Z %105 = arith.sitofp %104 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.1322095Z %106 = ttg.local_alloc %105 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.1322418Z %107 = ttg.local_load %106 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.1322931Z %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.1323328Z ttg.local_dealloc %46 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.1323543Z %109 = arith.truncf %108 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:10.1323817Z %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:10.1324058Z %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:10.1324289Z %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:10.1324553Z %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1324761Z %114 = tt.broadcast %112 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1324944Z %115 = arith.addi %113, %114 : tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1325140Z %116 = tt.addptr %18, %115 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:10.1325339Z tt.store %116, %109 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:10.1325502Z } {tt.num_stages = 1 : i32} 2026-02-21T09:56:10.1325602Z tt.return 2026-02-21T09:56:10.1325682Z } 2026-02-21T09:56:10.1325759Z } 2026-02-21T09:56:10.1325805Z 2026-02-21T09:56:10.1325835Z {-# 2026-02-21T09:56:10.1325912Z external_resources: { 2026-02-21T09:56:10.1326010Z mlir_reproducer: { 2026-02-21T09:56:10.1327018Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:56:10.1328003Z disable_threading: false, 2026-02-21T09:56:10.1328106Z verify_each: true 2026-02-21T09:56:10.1328195Z } 2026-02-21T09:56:10.1328263Z } 2026-02-21T09:56:10.1328331Z #-} 2026-02-21T09:56:10.1328605Z /tmp/torchinductor_root/sm/csmwdsu7lmd7uxkmcghpo2ohq4qfl5uu3uhxadnjsecl6z57rpnl.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:56:10.1329304Z /tmp/torchinductor_root/sm/csmwdsu7lmd7uxkmcghpo2ohq4qfl5uu3uhxadnjsecl6z57rpnl.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:56:10.1329858Z [700s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:56:10.1330628Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[0, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:56:10.1331329Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:56:10.1331496Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:56:10.9030045Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:56:10.9036101Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}> 2026-02-21T09:56:10.9037028Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:56:10.9037665Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:56:10.9038273Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:56:10.9038853Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:56:10.9039374Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}> 2026-02-21T09:56:10.9039864Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:56:10.9040233Z #smem = #ttg.shared_memory 2026-02-21T09:56:10.9040705Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:56:10.9041743Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:56:10.9042524Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:56:10.9042952Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:10.9043409Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:10.9043780Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:56:10.9044116Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:56:10.9044499Z %cst_3 = arith.constant dense<508> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.9044952Z %cst_4 = arith.constant dense<8192> : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.9045324Z %cst_5 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.9045632Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:56:10.9045846Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:56:10.9046023Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:56:10.9046210Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:56:10.9046456Z %cst_6 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9046688Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:56:10.9046874Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:56:10.9047052Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:56:10.9047338Z %cst_7 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9047644Z %0 = tt.get_program_id x : i32 2026-02-21T09:56:10.9047830Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:56:10.9048010Z %2 = arith.minsi %1, %c8192_i32 : i32 2026-02-21T09:56:10.9048345Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.9048795Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.9049233Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.9049676Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.9050104Z %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.9050529Z %8 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.9050992Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.9051316Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.9051793Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:56:10.9052468Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:56:10.9053120Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:10.9053530Z %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:10.9053840Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:56:10.9054165Z %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:10.9054471Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked> 2026-02-21T09:56:10.9054812Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:10.9055067Z %19 = arith.subi %2, %0 : i32 2026-02-21T09:56:10.9055238Z %20 = arith.remsi %19, %c2_i32 : i32 2026-02-21T09:56:10.9055452Z %21 = arith.subi %19, %20 : i32 2026-02-21T09:56:10.9055630Z %22 = arith.addi %0, %21 : i32 2026-02-21T09:56:10.9055853Z scf.for %arg3 = %0 to %22 step %c2_i32 : i32 { 2026-02-21T09:56:10.9056022Z %23 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:56:10.9056167Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:56:10.9056339Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:56:10.9056476Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:56:10.9056624Z %27 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:56:10.9056768Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:56:10.9056908Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:56:10.9057036Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:56:10.9057178Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:56:10.9057386Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.9057649Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.9057911Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.9058169Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.9058369Z %36 = arith.muli %30, %c128_i32 : i32 2026-02-21T09:56:10.9058570Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.9058830Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.9059088Z %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.9059345Z %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.9059684Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.9059997Z %42 = arith.muli %41, %cst_5 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.9060239Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9060586Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:10.9060924Z %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9061220Z %46 = ttg.local_alloc : () -> !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.9061583Z %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:56:10.9061934Z %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9062168Z %49 = arith.addi %43, %48 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9062408Z %50 = tt.addptr %9, %49 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9062661Z %51 = tt.load %50 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.9063014Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9063463Z ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9064002Z %53:3 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c0_i32, %arg7 = %52) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>) : i32 { 2026-02-21T09:56:10.9064486Z %144 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.9064765Z %145 = arith.addi %144, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.9065004Z %146 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:56:10.9065153Z %147 = arith.muli %146, %c2_i32 : i32 2026-02-21T09:56:10.9065361Z %148 = tt.splat %147 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.9065644Z %149 = arith.addi %148, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.9065923Z %150 = tt.expand_dims %149 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:56:10.9066204Z %151 = tt.broadcast %150 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9066399Z %152 = arith.addi %43, %151 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9066606Z %153 = tt.addptr %9, %152 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9066814Z %154 = tt.load %153 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.9067123Z %155 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9067572Z %156 = arith.extf %155 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9067961Z %157 = tt.expand_dims %145 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.9068216Z %158 = arith.muli %157, %cst_4 : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.9068412Z %159 = tt.broadcast %158 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9068610Z %160 = arith.addi %159, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9068812Z %161 = tt.addptr %10, %160 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9069017Z %162 = tt.load %161 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.9069286Z %163 = ttg.convert_layout %162 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9069571Z %164 = arith.shli %163, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9069814Z %165 = arith.shrsi %164, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9070068Z %166 = arith.shrsi %163, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9070363Z %167 = tt.expand_dims %165 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.9070720Z %168 = tt.expand_dims %166 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.9071009Z %169 = tt.broadcast %167 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9071256Z %170 = arith.select %15, %169, %cst_6 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9071498Z %171 = tt.broadcast %168 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9071737Z %172 = arith.select %17, %171, %170 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9071973Z %173 = tt.reshape %172 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.9072199Z %174 = arith.sitofp %173 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.9072458Z %175 = ttg.local_alloc %174 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.9072809Z %176 = ttg.local_load %175 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9073292Z %177 = tt.dot %156, %176, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.9073676Z %178 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:10.9073805Z %179 = arith.cmpi slt, %178, %c1_i32 : i32 2026-02-21T09:56:10.9073944Z %180 = arith.select %179, %178, %c0_i32 : i32 2026-02-21T09:56:10.9074215Z %181 = ttg.memdesc_index %46[%180] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9074576Z ttg.local_store %154, %181 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9074894Z scf.yield %177, %180, %181 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9075137Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:56:10.9075338Z %54 = arith.addi %7, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.9075670Z %55 = ttg.local_load %53#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9076100Z %56 = arith.extf %55 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9076484Z %57 = tt.expand_dims %54 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.9076726Z %58 = arith.muli %57, %cst_4 : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.9076918Z %59 = tt.broadcast %58 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9077112Z %60 = arith.addi %59, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9077304Z %61 = tt.addptr %10, %60 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9077504Z %62 = tt.load %61 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.9077743Z %63 = ttg.convert_layout %62 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9078040Z %64 = arith.shli %63, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9078269Z %65 = arith.shrsi %64, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9078500Z %66 = arith.shrsi %63, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9078804Z %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.9079135Z %68 = tt.expand_dims %66 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.9079417Z %69 = tt.broadcast %67 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9079657Z %70 = arith.select %15, %69, %cst_6 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9079889Z %71 = tt.broadcast %68 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9080121Z %72 = arith.select %17, %71, %70 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9080345Z %73 = tt.reshape %72 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.9080570Z %74 = arith.sitofp %73 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.9080825Z %75 = ttg.local_alloc %74 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.9081164Z %76 = ttg.local_load %75 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9081655Z %77 = tt.dot %56, %76, %53#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.9082046Z ttg.local_dealloc %46 : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.9082261Z %78 = arith.truncf %77 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:10.9082536Z %79 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:10.9082822Z %80 = arith.muli %79, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:10.9083059Z %81 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:10.9083326Z %82 = tt.broadcast %80 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.9083530Z %83 = tt.broadcast %81 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.9083717Z %84 = arith.addi %82, %83 : tensor<128x128xi32, #mma> 2026-02-21T09:56:10.9083906Z %85 = tt.addptr %18, %84 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:10.9084113Z tt.store %85, %78 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:10.9084259Z %86 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:56:10.9084389Z %87 = arith.divsi %86, %c256_i32 : i32 2026-02-21T09:56:10.9084516Z %88 = arith.muli %87, %c4_i32 : i32 2026-02-21T09:56:10.9084637Z %89 = arith.subi %c128_i32, %88 : i32 2026-02-21T09:56:10.9084762Z %90 = arith.minsi %89, %c4_i32 : i32 2026-02-21T09:56:10.9084883Z %91 = arith.remsi %86, %c256_i32 : i32 2026-02-21T09:56:10.9085006Z %92 = arith.remsi %91, %90 : i32 2026-02-21T09:56:10.9085122Z %93 = arith.addi %88, %92 : i32 2026-02-21T09:56:10.9085241Z %94 = arith.divsi %91, %90 : i32 2026-02-21T09:56:10.9085356Z %95 = arith.muli %93, %c128_i32 : i32 2026-02-21T09:56:10.9085529Z %96 = tt.splat %95 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.9085750Z %97 = tt.splat %95 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.9085990Z %98 = arith.addi %96, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.9086210Z %99 = arith.addi %97, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.9086395Z %100 = arith.muli %94, %c128_i32 : i32 2026-02-21T09:56:10.9086576Z %101 = tt.splat %100 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.9086803Z %102 = tt.splat %100 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.9087026Z %103 = arith.addi %101, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.9087247Z %104 = arith.addi %102, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.9087528Z %105 = tt.expand_dims %98 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.9087794Z %106 = arith.muli %105, %cst_5 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.9088000Z %107 = tt.broadcast %106 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9088288Z %108 = tt.expand_dims %103 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:10.9088582Z %109 = tt.broadcast %108 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9088826Z %110 = ttg.local_alloc : () -> !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.9089021Z %111 = arith.addi %107, %48 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9089229Z %112 = tt.addptr %9, %111 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9089458Z %113 = tt.load %112 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.9089754Z %114 = ttg.memdesc_index %110[%c0_i32] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9090127Z ttg.local_store %113, %114 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9090574Z %115:3 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c0_i32, %arg7 = %114) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>) : i32 { 2026-02-21T09:56:10.9090968Z %144 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.9091200Z %145 = arith.addi %144, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.9091383Z %146 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:56:10.9091509Z %147 = arith.muli %146, %c2_i32 : i32 2026-02-21T09:56:10.9091686Z %148 = tt.splat %147 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.9091915Z %149 = arith.addi %148, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.9092193Z %150 = tt.expand_dims %149 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:56:10.9092479Z %151 = tt.broadcast %150 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9092679Z %152 = arith.addi %107, %151 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9092892Z %153 = tt.addptr %9, %152 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9093107Z %154 = tt.load %153 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.9093418Z %155 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9093874Z %156 = arith.extf %155 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9094286Z %157 = tt.expand_dims %145 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.9094564Z %158 = arith.muli %157, %cst_4 : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.9094765Z %159 = tt.broadcast %158 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9094962Z %160 = arith.addi %159, %109 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9095172Z %161 = tt.addptr %10, %160 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9095379Z %162 = tt.load %161 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.9095635Z %163 = ttg.convert_layout %162 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9095928Z %164 = arith.shli %163, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9096167Z %165 = arith.shrsi %164, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9096411Z %166 = arith.shrsi %163, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9096711Z %167 = tt.expand_dims %165 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.9097076Z %168 = tt.expand_dims %166 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.9097370Z %169 = tt.broadcast %167 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9097631Z %170 = arith.select %15, %169, %cst_6 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9097882Z %171 = tt.broadcast %168 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9098121Z %172 = arith.select %17, %171, %170 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9098360Z %173 = tt.reshape %172 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.9098596Z %174 = arith.sitofp %173 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.9098856Z %175 = ttg.local_alloc %174 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.9099196Z %176 = ttg.local_load %175 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9099684Z %177 = tt.dot %156, %176, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.9100037Z %178 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:10.9100177Z %179 = arith.cmpi slt, %178, %c1_i32 : i32 2026-02-21T09:56:10.9100314Z %180 = arith.select %179, %178, %c0_i32 : i32 2026-02-21T09:56:10.9100593Z %181 = ttg.memdesc_index %110[%180] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9100965Z ttg.local_store %154, %181 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9101284Z scf.yield %177, %180, %181 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9101533Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:56:10.9101843Z %116 = ttg.local_load %115#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9102298Z %117 = arith.extf %116 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9102606Z %118 = arith.addi %59, %109 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9102825Z %119 = tt.addptr %10, %118 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9103036Z %120 = tt.load %119 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.9103283Z %121 = ttg.convert_layout %120 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9103572Z %122 = arith.shli %121, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9103818Z %123 = arith.shrsi %122, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9104060Z %124 = arith.shrsi %121, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9104358Z %125 = tt.expand_dims %123 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.9104704Z %126 = tt.expand_dims %124 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.9105002Z %127 = tt.broadcast %125 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9105273Z %128 = arith.select %15, %127, %cst_6 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9105515Z %129 = tt.broadcast %126 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9105771Z %130 = arith.select %17, %129, %128 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9106009Z %131 = tt.reshape %130 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.9106238Z %132 = arith.sitofp %131 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.9106499Z %133 = ttg.local_alloc %132 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.9106828Z %134 = ttg.local_load %133 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9107310Z %135 = tt.dot %117, %134, %115#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.9107707Z ttg.local_dealloc %110 : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.9107928Z %136 = arith.truncf %135 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:10.9108210Z %137 = tt.expand_dims %99 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:10.9108452Z %138 = arith.muli %137, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:10.9108691Z %139 = tt.expand_dims %104 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:10.9108961Z %140 = tt.broadcast %138 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.9109171Z %141 = tt.broadcast %139 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.9109360Z %142 = arith.addi %140, %141 : tensor<128x128xi32, #mma> 2026-02-21T09:56:10.9109555Z %143 = tt.addptr %18, %142 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:10.9109766Z tt.store %143, %136 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:10.9109907Z } 2026-02-21T09:56:10.9110005Z scf.for %arg3 = %22 to %2 step %c1_i32 : i32 { 2026-02-21T09:56:10.9110167Z %23 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:56:10.9110292Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:56:10.9110417Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:56:10.9110538Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:56:10.9110680Z %27 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:56:10.9110801Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:56:10.9110919Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:56:10.9111037Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:56:10.9111152Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:56:10.9111325Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.9111541Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.9111765Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:10.9111983Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:10.9112155Z %36 = arith.muli %30, %c128_i32 : i32 2026-02-21T09:56:10.9112325Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.9112539Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.9112760Z %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:10.9112987Z %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:10.9113265Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.9113539Z %42 = arith.muli %41, %cst_5 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:10.9113733Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9114018Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:10.9114298Z %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9114522Z %46 = ttg.local_alloc : () -> !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.9114798Z %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:56:10.9115071Z %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9115268Z %49 = arith.addi %43, %48 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9115468Z %50 = tt.addptr %9, %49 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9115675Z %51 = tt.load %50 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.9115968Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9116326Z ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9116762Z %53:3 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c0_i32, %arg7 = %52) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>) : i32 { 2026-02-21T09:56:10.9117143Z %86 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.9117375Z %87 = arith.addi %86, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.9117556Z %88 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:56:10.9117678Z %89 = arith.muli %88, %c2_i32 : i32 2026-02-21T09:56:10.9117864Z %90 = tt.splat %89 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.9118077Z %91 = arith.addi %90, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:10.9118350Z %92 = tt.expand_dims %91 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2> 2026-02-21T09:56:10.9118638Z %93 = tt.broadcast %92 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9118828Z %94 = arith.addi %43, %93 : tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9119025Z %95 = tt.addptr %9, %94 : tensor<128x8x!tt.ptr, #blocked2>, tensor<128x8xi32, #blocked2> 2026-02-21T09:56:10.9119225Z %96 = tt.load %95 : tensor<128x8x!tt.ptr, #blocked2> 2026-02-21T09:56:10.9119529Z %97 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9119967Z %98 = arith.extf %97 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9120346Z %99 = tt.expand_dims %87 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.9120592Z %100 = arith.muli %99, %cst_4 : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.9120803Z %101 = tt.broadcast %100 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9120999Z %102 = arith.addi %101, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9121204Z %103 = tt.addptr %10, %102 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9121423Z %104 = tt.load %103 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.9121671Z %105 = ttg.convert_layout %104 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9121958Z %106 = arith.shli %105, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9122199Z %107 = arith.shrsi %106, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9122440Z %108 = arith.shrsi %105, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9122765Z %109 = tt.expand_dims %107 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.9123109Z %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.9123399Z %111 = tt.broadcast %109 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9123644Z %112 = arith.select %15, %111, %cst_6 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9123889Z %113 = tt.broadcast %110 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9124125Z %114 = arith.select %17, %113, %112 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9124360Z %115 = tt.reshape %114 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.9124585Z %116 = arith.sitofp %115 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.9124841Z %117 = ttg.local_alloc %116 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.9125173Z %118 = ttg.local_load %117 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9125651Z %119 = tt.dot %98, %118, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.9126027Z %120 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:10.9126157Z %121 = arith.cmpi slt, %120, %c1_i32 : i32 2026-02-21T09:56:10.9126290Z %122 = arith.select %121, %120, %c0_i32 : i32 2026-02-21T09:56:10.9126576Z %123 = ttg.memdesc_index %46[%122] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9126935Z ttg.local_store %96, %123 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9127251Z scf.yield %119, %122, %123 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> 2026-02-21T09:56:10.9127494Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T09:56:10.9127690Z %54 = arith.addi %7, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:10.9128027Z %55 = ttg.local_load %53#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9128456Z %56 = arith.extf %55 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9128853Z %57 = tt.expand_dims %54 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.9129098Z %58 = arith.muli %57, %cst_4 : tensor<4x1xi32, #blocked1> 2026-02-21T09:56:10.9129286Z %59 = tt.broadcast %58 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9129496Z %60 = arith.addi %59, %45 : tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9129687Z %61 = tt.addptr %10, %60 : tensor<4x128x!tt.ptr, #blocked1>, tensor<4x128xi32, #blocked1> 2026-02-21T09:56:10.9129888Z %62 = tt.load %61 : tensor<4x128x!tt.ptr, #blocked1> 2026-02-21T09:56:10.9130129Z %63 = ttg.convert_layout %62 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9130402Z %64 = arith.shli %63, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9130632Z %65 = arith.shrsi %64, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9130862Z %66 = arith.shrsi %63, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:10.9131145Z %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.9131479Z %68 = tt.expand_dims %66 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked> 2026-02-21T09:56:10.9131755Z %69 = tt.broadcast %67 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9131991Z %70 = arith.select %15, %69, %cst_6 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9132220Z %71 = tt.broadcast %68 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9132449Z %72 = arith.select %17, %71, %70 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked> 2026-02-21T09:56:10.9132674Z %73 = tt.reshape %72 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3> 2026-02-21T09:56:10.9132891Z %74 = arith.sitofp %73 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3> 2026-02-21T09:56:10.9133140Z %75 = ttg.local_alloc %74 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem> 2026-02-21T09:56:10.9133466Z %76 = ttg.local_load %75 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:10.9133951Z %77 = tt.dot %56, %76, %53#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:10.9134337Z ttg.local_dealloc %46 : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> 2026-02-21T09:56:10.9134571Z %78 = arith.truncf %77 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:10.9134842Z %79 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:10.9135077Z %80 = arith.muli %79, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:10.9135302Z %81 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:10.9135563Z %82 = tt.broadcast %80 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.9135767Z %83 = tt.broadcast %81 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:10.9135947Z %84 = arith.addi %82, %83 : tensor<128x128xi32, #mma> 2026-02-21T09:56:10.9136133Z %85 = tt.addptr %18, %84 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:10.9136330Z tt.store %85, %78 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:10.9136471Z } {tt.num_stages = 1 : i32} 2026-02-21T09:56:10.9136576Z tt.return 2026-02-21T09:56:10.9136660Z } 2026-02-21T09:56:10.9136734Z } 2026-02-21T09:56:10.9136794Z 2026-02-21T09:56:10.9136826Z {-# 2026-02-21T09:56:10.9136903Z external_resources: { 2026-02-21T09:56:10.9137003Z mlir_reproducer: { 2026-02-21T09:56:10.9138028Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:56:10.9139018Z disable_threading: false, 2026-02-21T09:56:10.9139124Z verify_each: true 2026-02-21T09:56:10.9139215Z } 2026-02-21T09:56:10.9139285Z } 2026-02-21T09:56:10.9139355Z #-} 2026-02-21T09:56:10.9139629Z /tmp/torchinductor_root/lj/cljgqngrf6gusqnmjzzg3742xw7s5ov2mbua3vqk27mj2ej4ybxp.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:56:10.9140309Z /tmp/torchinductor_root/lj/cljgqngrf6gusqnmjzzg3742xw7s5ov2mbua3vqk27mj2ej4ybxp.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:56:10.9140856Z [701s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:56:10.9141632Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[0, 2], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:56:10.9142336Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:56:10.9142505Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:56:11.0128334Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:56:11.0133929Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T09:56:11.0134856Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:56:11.0135672Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:56:11.0135968Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:56:11.0136251Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:56:11.0136504Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:56:11.0136740Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:56:11.0136922Z #smem = #ttg.shared_memory 2026-02-21T09:56:11.0137157Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:56:11.0137628Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:56:11.0138000Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:56:11.0138203Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.0138375Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.0138558Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:56:11.0138770Z %cst_3 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.0138951Z %cst_4 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.0139107Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:56:11.0139224Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:56:11.0139341Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:56:11.0139455Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:56:11.0139606Z %cst_5 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0139750Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:56:11.0139865Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:56:11.0139982Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:56:11.0140090Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:56:11.0140269Z %cst_6 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0140456Z %0 = tt.get_program_id x : i32 2026-02-21T09:56:11.0140573Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:56:11.0140687Z %2 = arith.minsi %1, %c8192_i32 : i32 2026-02-21T09:56:11.0140890Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.0141167Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.0141438Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.0141708Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.0141971Z %7 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.0142235Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.0142476Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.0142676Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.0142969Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:56:11.0143381Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:56:11.0143798Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.0144053Z %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.0144247Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:56:11.0144446Z %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.0144633Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:56:11.0144844Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:11.0145002Z %19 = arith.subi %2, %0 : i32 2026-02-21T09:56:11.0145111Z %20 = arith.remsi %19, %c2_i32 : i32 2026-02-21T09:56:11.0145227Z %21 = arith.subi %19, %20 : i32 2026-02-21T09:56:11.0145338Z %22 = arith.addi %0, %21 : i32 2026-02-21T09:56:11.0145461Z scf.for %arg3 = %0 to %22 step %c2_i32 : i32 { 2026-02-21T09:56:11.0145597Z %23 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:56:11.0145737Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:56:11.0145856Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:56:11.0145971Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:56:11.0146091Z %27 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:56:11.0146225Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:56:11.0146339Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:56:11.0146448Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:56:11.0146563Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:56:11.0146753Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.0146967Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.0147179Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.0147393Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.0147554Z %36 = arith.muli %30, %c128_i32 : i32 2026-02-21T09:56:11.0147721Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.0147928Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.0148140Z %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.0148351Z %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.0148621Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.0148875Z %42 = arith.muli %41, %cst_4 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.0149067Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0149349Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:11.0149627Z %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0149893Z %46 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<128x128xf32, #mma>) : i32 { 2026-02-21T09:56:11.0150162Z %88 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.0150413Z %89 = arith.addi %88, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.0150587Z %90 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:56:11.0150754Z %91 = tt.splat %90 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.0150983Z %92 = arith.addi %91, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.0151254Z %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.0151525Z %94 = tt.broadcast %93 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0151719Z %95 = arith.addi %43, %94 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0151920Z %96 = tt.addptr %9, %95 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0152123Z %97 = tt.load %96 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.0152347Z %98 = ttg.local_alloc %97 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:56:11.0152675Z %99 = ttg.local_load %98 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0153107Z %100 = arith.extf %99 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0153494Z %101 = tt.expand_dims %89 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.0153740Z %102 = arith.muli %101, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.0153954Z %103 = tt.broadcast %102 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0154152Z %104 = arith.addi %103, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0154353Z %105 = tt.addptr %10, %104 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0154559Z %106 = tt.load %105 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.0154805Z %107 = ttg.convert_layout %106 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0155093Z %108 = arith.shli %107, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0155331Z %109 = arith.shrsi %108, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0155571Z %110 = arith.shrsi %107, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0155865Z %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.0156206Z %112 = tt.expand_dims %110 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.0156497Z %113 = tt.broadcast %111 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0156747Z %114 = arith.select %15, %113, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0156990Z %115 = tt.broadcast %112 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0157232Z %116 = arith.select %17, %115, %114 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0157463Z %117 = tt.reshape %116 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.0157695Z %118 = arith.sitofp %117 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.0157952Z %119 = ttg.local_alloc %118 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.0158308Z %120 = ttg.local_load %119 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0158792Z %121 = tt.dot %100, %120, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.0159163Z %122 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:56:11.0159338Z %123 = tt.splat %122 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.0159565Z %124 = arith.addi %123, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.0159739Z %125 = arith.muli %122, %c2_i32 : i32 2026-02-21T09:56:11.0159910Z %126 = tt.splat %125 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.0160132Z %127 = arith.addi %126, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.0160410Z %128 = tt.expand_dims %127 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.0160692Z %129 = tt.broadcast %128 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0160889Z %130 = arith.addi %43, %129 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0169004Z %131 = tt.addptr %9, %130 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0169221Z %132 = tt.load %131 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.0169452Z %133 = ttg.local_alloc %132 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:56:11.0169810Z %134 = ttg.local_load %133 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0170228Z %135 = arith.extf %134 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0170620Z %136 = tt.expand_dims %124 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.0170875Z %137 = arith.muli %136, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.0171079Z %138 = tt.broadcast %137 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0171277Z %139 = arith.addi %138, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0171478Z %140 = tt.addptr %10, %139 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0171686Z %141 = tt.load %140 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.0171930Z %142 = ttg.convert_layout %141 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0172215Z %143 = arith.shli %142, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0172454Z %144 = arith.shrsi %143, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0172691Z %145 = arith.shrsi %142, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0172984Z %146 = tt.expand_dims %144 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.0173322Z %147 = tt.expand_dims %145 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.0173613Z %148 = tt.broadcast %146 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0173860Z %149 = arith.select %15, %148, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0174122Z %150 = tt.broadcast %147 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0174360Z %151 = arith.select %17, %150, %149 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0174607Z %152 = tt.reshape %151 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.0174836Z %153 = arith.sitofp %152 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.0175095Z %154 = ttg.local_alloc %153 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.0175423Z %155 = ttg.local_load %154 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0175906Z %156 = tt.dot %135, %155, %121, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.0176270Z scf.yield %156 : tensor<128x128xf32, #mma> 2026-02-21T09:56:11.0176406Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:56:11.0176581Z %47 = arith.truncf %46 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:11.0176852Z %48 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:11.0177111Z %49 = arith.muli %48, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:11.0177337Z %50 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:11.0177613Z %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.0177818Z %52 = tt.broadcast %50 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.0177996Z %53 = arith.addi %51, %52 : tensor<128x128xi32, #mma> 2026-02-21T09:56:11.0178189Z %54 = tt.addptr %18, %53 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:11.0178383Z tt.store %54, %47 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:11.0178527Z %55 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:56:11.0178652Z %56 = arith.divsi %55, %c256_i32 : i32 2026-02-21T09:56:11.0178769Z %57 = arith.muli %56, %c4_i32 : i32 2026-02-21T09:56:11.0178890Z %58 = arith.subi %c128_i32, %57 : i32 2026-02-21T09:56:11.0179004Z %59 = arith.minsi %58, %c4_i32 : i32 2026-02-21T09:56:11.0179122Z %60 = arith.remsi %55, %c256_i32 : i32 2026-02-21T09:56:11.0179236Z %61 = arith.remsi %60, %59 : i32 2026-02-21T09:56:11.0179353Z %62 = arith.addi %57, %61 : i32 2026-02-21T09:56:11.0179464Z %63 = arith.divsi %60, %59 : i32 2026-02-21T09:56:11.0179577Z %64 = arith.muli %62, %c128_i32 : i32 2026-02-21T09:56:11.0179744Z %65 = tt.splat %64 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.0179958Z %66 = tt.splat %64 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.0180172Z %67 = arith.addi %65, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.0180383Z %68 = arith.addi %66, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.0180547Z %69 = arith.muli %63, %c128_i32 : i32 2026-02-21T09:56:11.0180712Z %70 = tt.splat %69 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.0180921Z %71 = tt.splat %69 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.0181133Z %72 = arith.addi %70, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.0181342Z %73 = arith.addi %71, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.0181630Z %74 = tt.expand_dims %67 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.0181880Z %75 = arith.muli %74, %cst_4 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.0182074Z %76 = tt.broadcast %75 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0182367Z %77 = tt.expand_dims %72 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:11.0182645Z %78 = tt.broadcast %77 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0182912Z %79 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<128x128xf32, #mma>) : i32 { 2026-02-21T09:56:11.0183180Z %88 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.0183402Z %89 = arith.addi %88, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.0183580Z %90 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:56:11.0183745Z %91 = tt.splat %90 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.0183960Z %92 = arith.addi %91, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.0184232Z %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.0184527Z %94 = tt.broadcast %93 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0184720Z %95 = arith.addi %76, %94 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0184915Z %96 = tt.addptr %9, %95 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0185136Z %97 = tt.load %96 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.0185353Z %98 = ttg.local_alloc %97 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:56:11.0185684Z %99 = ttg.local_load %98 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0186095Z %100 = arith.extf %99 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0186479Z %101 = tt.expand_dims %89 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.0186732Z %102 = arith.muli %101, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.0186928Z %103 = tt.broadcast %102 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0187125Z %104 = arith.addi %103, %78 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0187328Z %105 = tt.addptr %10, %104 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0187534Z %106 = tt.load %105 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.0187780Z %107 = ttg.convert_layout %106 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0188066Z %108 = arith.shli %107, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0188306Z %109 = arith.shrsi %108, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0188544Z %110 = arith.shrsi %107, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0188832Z %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.0189171Z %112 = tt.expand_dims %110 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.0189477Z %113 = tt.broadcast %111 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0189717Z %114 = arith.select %15, %113, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0189975Z %115 = tt.broadcast %112 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0190210Z %116 = arith.select %17, %115, %114 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0190446Z %117 = tt.reshape %116 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.0190673Z %118 = arith.sitofp %117 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.0190927Z %119 = ttg.local_alloc %118 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.0191257Z %120 = ttg.local_load %119 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0191742Z %121 = tt.dot %100, %120, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.0192096Z %122 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:56:11.0192287Z %123 = tt.splat %122 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.0192512Z %124 = arith.addi %123, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.0192689Z %125 = arith.muli %122, %c2_i32 : i32 2026-02-21T09:56:11.0192885Z %126 = tt.splat %125 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.0193108Z %127 = arith.addi %126, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.0193387Z %128 = tt.expand_dims %127 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.0193666Z %129 = tt.broadcast %128 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0193864Z %130 = arith.addi %76, %129 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0194066Z %131 = tt.addptr %9, %130 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0194278Z %132 = tt.load %131 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.0194506Z %133 = ttg.local_alloc %132 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:56:11.0194844Z %134 = ttg.local_load %133 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0195255Z %135 = arith.extf %134 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0195641Z %136 = tt.expand_dims %124 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.0195894Z %137 = arith.muli %136, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.0196090Z %138 = tt.broadcast %137 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0196284Z %139 = arith.addi %138, %78 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0196483Z %140 = tt.addptr %10, %139 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0196687Z %141 = tt.load %140 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.0196935Z %142 = ttg.convert_layout %141 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0197246Z %143 = arith.shli %142, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0197482Z %144 = arith.shrsi %143, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0197719Z %145 = arith.shrsi %142, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0198033Z %146 = tt.expand_dims %144 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.0198376Z %147 = tt.expand_dims %145 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.0198664Z %148 = tt.broadcast %146 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0198908Z %149 = arith.select %15, %148, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0199149Z %150 = tt.broadcast %147 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0199384Z %151 = arith.select %17, %150, %149 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0199620Z %152 = tt.reshape %151 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.0199850Z %153 = arith.sitofp %152 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.0200126Z %154 = ttg.local_alloc %153 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.0200458Z %155 = ttg.local_load %154 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0200949Z %156 = tt.dot %135, %155, %121, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.0201302Z scf.yield %156 : tensor<128x128xf32, #mma> 2026-02-21T09:56:11.0201439Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:56:11.0201607Z %80 = arith.truncf %79 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:11.0201878Z %81 = tt.expand_dims %68 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:11.0202115Z %82 = arith.muli %81, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:11.0202343Z %83 = tt.expand_dims %73 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:11.0202659Z %84 = tt.broadcast %82 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.0202860Z %85 = tt.broadcast %83 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.0203040Z %86 = arith.addi %84, %85 : tensor<128x128xi32, #mma> 2026-02-21T09:56:11.0203227Z %87 = tt.addptr %18, %86 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:11.0203423Z tt.store %87, %80 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:11.0203554Z } 2026-02-21T09:56:11.0203649Z scf.for %arg3 = %22 to %2 step %c1_i32 : i32 { 2026-02-21T09:56:11.0203786Z %23 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:56:11.0203905Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:56:11.0204025Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:56:11.0204142Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:56:11.0204263Z %27 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:56:11.0204382Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:56:11.0204492Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:56:11.0204604Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:56:11.0204715Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:56:11.0204883Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.0205115Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.0205331Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.0205560Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.0205723Z %36 = arith.muli %30, %c128_i32 : i32 2026-02-21T09:56:11.0205888Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.0206095Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.0206306Z %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.0206515Z %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.0206787Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.0207044Z %42 = arith.muli %41, %cst_4 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.0207236Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0207522Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:11.0207824Z %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0208093Z %46 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<128x128xf32, #mma>) : i32 { 2026-02-21T09:56:11.0208380Z %55 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.0208601Z %56 = arith.addi %55, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.0208780Z %57 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:56:11.0208945Z %58 = tt.splat %57 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.0209162Z %59 = arith.addi %58, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.0209434Z %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.0209705Z %61 = tt.broadcast %60 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0209898Z %62 = arith.addi %43, %61 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0210097Z %63 = tt.addptr %9, %62 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0210302Z %64 = tt.load %63 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.0210521Z %65 = ttg.local_alloc %64 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:56:11.0210849Z %66 = ttg.local_load %65 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0211255Z %67 = arith.extf %66 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0211643Z %68 = tt.expand_dims %56 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.0211888Z %69 = arith.muli %68, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.0212075Z %70 = tt.broadcast %69 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0212266Z %71 = arith.addi %70, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0212457Z %72 = tt.addptr %10, %71 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0212671Z %73 = tt.load %72 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.0212911Z %74 = ttg.convert_layout %73 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0213201Z %75 = arith.shli %74, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0213432Z %76 = arith.shrsi %75, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0213664Z %77 = arith.shrsi %74, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0213948Z %78 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.0214284Z %79 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.0214583Z %80 = tt.broadcast %78 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0214817Z %81 = arith.select %15, %80, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0215051Z %82 = tt.broadcast %79 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0215278Z %83 = arith.select %17, %82, %81 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0215526Z %84 = tt.reshape %83 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.0215747Z %85 = arith.sitofp %84 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.0215994Z %86 = ttg.local_alloc %85 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.0216331Z %87 = ttg.local_load %86 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0216799Z %88 = tt.dot %67, %87, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.0217145Z %89 = arith.addi %arg4, %c2_i32 : i32 2026-02-21T09:56:11.0217315Z %90 = tt.splat %89 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.0217532Z %91 = arith.addi %90, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.0217703Z %92 = arith.muli %89, %c2_i32 : i32 2026-02-21T09:56:11.0217865Z %93 = tt.splat %92 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.0218082Z %94 = arith.addi %93, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.0218352Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.0218621Z %96 = tt.broadcast %95 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0218811Z %97 = arith.addi %43, %96 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0219005Z %98 = tt.addptr %9, %97 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.0219206Z %99 = tt.load %98 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.0219426Z %100 = ttg.local_alloc %99 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem> 2026-02-21T09:56:11.0219756Z %101 = ttg.local_load %100 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0220166Z %102 = arith.extf %101 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0220564Z %103 = tt.expand_dims %91 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.0220811Z %104 = arith.muli %103, %cst_3 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.0221025Z %105 = tt.broadcast %104 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0221218Z %106 = arith.addi %105, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0221420Z %107 = tt.addptr %10, %106 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.0221622Z %108 = tt.load %107 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.0221867Z %109 = ttg.convert_layout %108 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0222151Z %110 = arith.shli %109, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0222392Z %111 = arith.shrsi %110, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0222630Z %112 = arith.shrsi %109, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.0222920Z %113 = tt.expand_dims %111 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.0223274Z %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.0223561Z %115 = tt.broadcast %113 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0223801Z %116 = arith.select %15, %115, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0224064Z %117 = tt.broadcast %114 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0224299Z %118 = arith.select %17, %117, %116 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.0224531Z %119 = tt.reshape %118 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.0224758Z %120 = arith.sitofp %119 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.0225011Z %121 = ttg.local_alloc %120 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.0225339Z %122 = ttg.local_load %121 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.0225816Z %123 = tt.dot %102, %122, %88, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.0226164Z scf.yield %123 : tensor<128x128xf32, #mma> 2026-02-21T09:56:11.0226298Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:56:11.0226465Z %47 = arith.truncf %46 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:11.0226730Z %48 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:11.0226964Z %49 = arith.muli %48, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:11.0227194Z %50 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:11.0227450Z %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.0227650Z %52 = tt.broadcast %50 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.0227827Z %53 = arith.addi %51, %52 : tensor<128x128xi32, #mma> 2026-02-21T09:56:11.0228012Z %54 = tt.addptr %18, %53 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:11.0228222Z tt.store %54, %47 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:11.0228359Z } {tt.num_stages = 1 : i32} 2026-02-21T09:56:11.0228460Z tt.return 2026-02-21T09:56:11.0228539Z } 2026-02-21T09:56:11.0228609Z } 2026-02-21T09:56:11.0228652Z 2026-02-21T09:56:11.0228699Z {-# 2026-02-21T09:56:11.0228775Z external_resources: { 2026-02-21T09:56:11.0228874Z mlir_reproducer: { 2026-02-21T09:56:11.0229871Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:56:11.0230859Z disable_threading: false, 2026-02-21T09:56:11.0230963Z verify_each: true 2026-02-21T09:56:11.0231050Z } 2026-02-21T09:56:11.0231119Z } 2026-02-21T09:56:11.0231186Z #-} 2026-02-21T09:56:11.0231461Z /tmp/torchinductor_root/6o/c6ofv26dcuti2kka3crljighg33iusjmljdwlhhl27molm6wtgw6.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:56:11.0232165Z /tmp/torchinductor_root/6o/c6ofv26dcuti2kka3crljighg33iusjmljdwlhhl27molm6wtgw6.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:56:11.0232707Z [701s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:56:11.0233498Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[0, 3], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:56:11.0234203Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:56:11.0234367Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:56:11.5973539Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:56:11.5980977Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T09:56:11.5981327Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:56:11.5981638Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:56:11.5981930Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:56:11.5982212Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:56:11.5982466Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:56:11.5982697Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:56:11.5982876Z #smem = #ttg.shared_memory 2026-02-21T09:56:11.5983107Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:56:11.5983580Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:56:11.5984156Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:56:11.5984331Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.5984575Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.5984756Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:56:11.5984918Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:56:11.5985037Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:56:11.5985224Z %cst_3 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.5985480Z %cst_4 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.5985727Z %cst_5 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.5985943Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.5986118Z %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.5986268Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:56:11.5986386Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:56:11.5986500Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:56:11.5986647Z %cst_8 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.5986831Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:56:11.5986941Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:56:11.5987114Z %cst_9 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.5987301Z %0 = tt.get_program_id x : i32 2026-02-21T09:56:11.5987487Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:56:11.5987602Z %2 = arith.minsi %1, %c8192_i32 : i32 2026-02-21T09:56:11.5987805Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.5988078Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.5988348Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.5988615Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.5988877Z %7 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.5989139Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.5989380Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.5989579Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.5989848Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:56:11.5990263Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:56:11.5990663Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.5990912Z %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.5991106Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:56:11.5991306Z %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.5991493Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:56:11.5991719Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:11.5991875Z %19 = arith.subi %2, %0 : i32 2026-02-21T09:56:11.5992005Z %20 = arith.remsi %19, %c2_i32 : i32 2026-02-21T09:56:11.5992148Z %21 = arith.subi %19, %20 : i32 2026-02-21T09:56:11.5992256Z %22 = arith.addi %0, %21 : i32 2026-02-21T09:56:11.5992378Z scf.for %arg3 = %0 to %22 step %c2_i32 : i32 { 2026-02-21T09:56:11.5992515Z %23 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:56:11.5992637Z %24 = arith.muli %23, %c2_i32 : i32 2026-02-21T09:56:11.5992754Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:56:11.5992867Z %26 = arith.minsi %25, %c2_i32 : i32 2026-02-21T09:56:11.5992986Z %27 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:56:11.5993103Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:56:11.5993217Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:56:11.5993329Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:56:11.5993438Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:56:11.5993607Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.5993817Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.5994034Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.5994263Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.5994426Z %36 = arith.muli %30, %c128_i32 : i32 2026-02-21T09:56:11.5994590Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.5994796Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.5995026Z %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.5995237Z %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.5995506Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.5995761Z %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.5995952Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.5996231Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:11.5996507Z %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.5996728Z %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:11.5996992Z %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.5997261Z %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.5997452Z %49 = arith.addi %43, %48 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.5997650Z %50 = tt.addptr %9, %49 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.5997875Z %51 = tt.load %50 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.5998161Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.5998515Z ttg.local_store %51, %52 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.5998790Z %53 = arith.addi %8, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.5999061Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.5999349Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.5999537Z %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.5999744Z %57 = tt.addptr %9, %56 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.5999945Z %58 = tt.load %57 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.6000222Z %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6000578Z ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6001105Z %60:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:56:11.6001583Z %199 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.6001817Z %200 = arith.addi %199, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.6001993Z %201 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:56:11.6002113Z %202 = arith.muli %201, %c2_i32 : i32 2026-02-21T09:56:11.6002302Z %203 = tt.splat %202 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.6002521Z %204 = arith.addi %203, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.6002895Z %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.6003171Z %206 = tt.broadcast %205 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6003369Z %207 = arith.addi %43, %206 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6003572Z %208 = tt.addptr %9, %207 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6003780Z %209 = tt.load %208 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.6004087Z %210 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6004526Z %211 = arith.extf %210 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6004915Z %212 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6005165Z %213 = arith.muli %212, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6005360Z %214 = tt.broadcast %213 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6005554Z %215 = arith.addi %214, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6005756Z %216 = tt.addptr %10, %215 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6005957Z %217 = tt.load %216 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.6006206Z %218 = ttg.convert_layout %217 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6006488Z %219 = arith.shli %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6006726Z %220 = arith.shrsi %219, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6006963Z %221 = arith.shrsi %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6007271Z %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6007609Z %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6007914Z %224 = tt.broadcast %222 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6008161Z %225 = arith.select %15, %224, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6008404Z %226 = tt.broadcast %223 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6008641Z %227 = arith.select %17, %226, %225 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6008879Z %228 = tt.reshape %227 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.6009107Z %229 = arith.sitofp %228 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.6009372Z %230 = ttg.local_alloc %229 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.6009710Z %231 = ttg.local_load %230 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6010231Z %232 = tt.dot %211, %231, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.6010589Z %233 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:11.6010720Z %234 = arith.cmpi slt, %233, %c2_i32 : i32 2026-02-21T09:56:11.6010881Z %235 = arith.select %234, %233, %c0_i32 : i32 2026-02-21T09:56:11.6011156Z %236 = ttg.memdesc_index %46[%235] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6011522Z ttg.local_store %209, %236 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6011930Z scf.yield %232, %235, %arg8, %236 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6012280Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:56:11.6012495Z %61 = arith.addi %7, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.6012829Z %62 = ttg.local_load %60#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6013258Z %63 = arith.extf %62 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6013646Z %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6013898Z %65 = arith.muli %64, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6014089Z %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6014283Z %67 = arith.addi %66, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6014479Z %68 = tt.addptr %10, %67 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6014682Z %69 = tt.load %68 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.6014924Z %70 = ttg.convert_layout %69 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6015200Z %71 = arith.shli %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6015451Z %72 = arith.shrsi %71, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6015681Z %73 = arith.shrsi %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6015986Z %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6016323Z %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6016603Z %76 = tt.broadcast %74 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6016845Z %77 = arith.select %15, %76, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6017080Z %78 = tt.broadcast %75 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6017314Z %79 = arith.select %17, %78, %77 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6017543Z %80 = tt.reshape %79 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.6017764Z %81 = arith.sitofp %80 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.6018019Z %82 = ttg.local_alloc %81 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.6018359Z %83 = ttg.local_load %82 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6018849Z %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.6019244Z %85 = arith.addi %7, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.6019577Z %86 = ttg.local_load %60#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6020011Z %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6020396Z %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6020641Z %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6020834Z %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6021023Z %91 = arith.addi %90, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6021219Z %92 = tt.addptr %10, %91 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6021422Z %93 = tt.load %92 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.6021661Z %94 = ttg.convert_layout %93 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6021941Z %95 = arith.shli %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6022172Z %96 = arith.shrsi %95, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6022407Z %97 = arith.shrsi %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6022695Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6023026Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6023309Z %100 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6023571Z %101 = arith.select %15, %100, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6023813Z %102 = tt.broadcast %99 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6024068Z %103 = arith.select %17, %102, %101 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6024303Z %104 = tt.reshape %103 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.6024535Z %105 = arith.sitofp %104 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.6024790Z %106 = ttg.local_alloc %105 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.6025124Z %107 = ttg.local_load %106 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6025600Z %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.6025983Z ttg.local_dealloc %46 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:11.6026206Z %109 = arith.truncf %108 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:11.6026501Z %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:11.6026743Z %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:11.6026996Z %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:11.6027258Z %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.6027472Z %114 = tt.broadcast %112 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.6027656Z %115 = arith.addi %113, %114 : tensor<128x128xi32, #mma> 2026-02-21T09:56:11.6027857Z %116 = tt.addptr %18, %115 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:11.6028064Z tt.store %116, %109 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:11.6028211Z %117 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:56:11.6028341Z %118 = arith.divsi %117, %c128_i32 : i32 2026-02-21T09:56:11.6028468Z %119 = arith.muli %118, %c2_i32 : i32 2026-02-21T09:56:11.6028596Z %120 = arith.subi %c128_i32, %119 : i32 2026-02-21T09:56:11.6028720Z %121 = arith.minsi %120, %c2_i32 : i32 2026-02-21T09:56:11.6028846Z %122 = arith.remsi %117, %c128_i32 : i32 2026-02-21T09:56:11.6028971Z %123 = arith.remsi %122, %121 : i32 2026-02-21T09:56:11.6029089Z %124 = arith.addi %119, %123 : i32 2026-02-21T09:56:11.6029211Z %125 = arith.divsi %122, %121 : i32 2026-02-21T09:56:11.6029332Z %126 = arith.muli %124, %c128_i32 : i32 2026-02-21T09:56:11.6029511Z %127 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.6029731Z %128 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.6029958Z %129 = arith.addi %127, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.6030178Z %130 = arith.addi %128, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.6030350Z %131 = arith.muli %125, %c128_i32 : i32 2026-02-21T09:56:11.6030525Z %132 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.6030743Z %133 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.6030965Z %134 = arith.addi %132, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.6031203Z %135 = arith.addi %133, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.6031478Z %136 = tt.expand_dims %129 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.6031755Z %137 = arith.muli %136, %cst_7 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.6031955Z %138 = tt.broadcast %137 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6032246Z %139 = tt.expand_dims %134 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:11.6032534Z %140 = tt.broadcast %139 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6032759Z %141 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:11.6032952Z %142 = arith.addi %138, %48 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6033153Z %143 = tt.addptr %9, %142 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6033365Z %144 = tt.load %143 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.6033656Z %145 = ttg.memdesc_index %141[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6034040Z ttg.local_store %144, %145 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6034287Z %146 = arith.addi %138, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6034489Z %147 = tt.addptr %9, %146 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6034714Z %148 = tt.load %147 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.6035006Z %149 = ttg.memdesc_index %141[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6035371Z ttg.local_store %148, %149 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6035903Z %150:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %145, %arg8 = %149) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:56:11.6036390Z %199 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.6036620Z %200 = arith.addi %199, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.6036804Z %201 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:56:11.6036930Z %202 = arith.muli %201, %c2_i32 : i32 2026-02-21T09:56:11.6037106Z %203 = tt.splat %202 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.6037328Z %204 = arith.addi %203, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.6037611Z %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.6037896Z %206 = tt.broadcast %205 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6038094Z %207 = arith.addi %138, %206 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6038303Z %208 = tt.addptr %9, %207 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6038513Z %209 = tt.load %208 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.6038823Z %210 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6039285Z %211 = arith.extf %210 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6039670Z %212 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6039945Z %213 = arith.muli %212, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6040146Z %214 = tt.broadcast %213 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6040342Z %215 = arith.addi %214, %140 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6040549Z %216 = tt.addptr %10, %215 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6040755Z %217 = tt.load %216 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.6041007Z %218 = ttg.convert_layout %217 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6041292Z %219 = arith.shli %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6041534Z %220 = arith.shrsi %219, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6041776Z %221 = arith.shrsi %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6042088Z %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6042434Z %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6042800Z %224 = tt.broadcast %222 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6043044Z %225 = arith.select %15, %224, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6043295Z %226 = tt.broadcast %223 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6043533Z %227 = arith.select %17, %226, %225 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6043774Z %228 = tt.reshape %227 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.6044007Z %229 = arith.sitofp %228 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.6044266Z %230 = ttg.local_alloc %229 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.6044601Z %231 = ttg.local_load %230 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6045084Z %232 = tt.dot %211, %231, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.6045440Z %233 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:11.6045574Z %234 = arith.cmpi slt, %233, %c2_i32 : i32 2026-02-21T09:56:11.6045710Z %235 = arith.select %234, %233, %c0_i32 : i32 2026-02-21T09:56:11.6045985Z %236 = ttg.memdesc_index %141[%235] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6046371Z ttg.local_store %209, %236 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6046778Z scf.yield %232, %235, %arg8, %236 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6047120Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:56:11.6047464Z %151 = ttg.local_load %150#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6047904Z %152 = arith.extf %151 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6048234Z %153 = arith.addi %66, %140 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6048441Z %154 = tt.addptr %10, %153 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6048644Z %155 = tt.load %154 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.6048894Z %156 = ttg.convert_layout %155 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6049187Z %157 = arith.shli %156, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6049425Z %158 = arith.shrsi %157, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6049660Z %159 = arith.shrsi %156, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6049951Z %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6050308Z %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6050594Z %162 = tt.broadcast %160 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6050833Z %163 = arith.select %15, %162, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6051087Z %164 = tt.broadcast %161 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6051319Z %165 = arith.select %17, %164, %163 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6051558Z %166 = tt.reshape %165 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.6051786Z %167 = arith.sitofp %166 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.6052040Z %168 = ttg.local_alloc %167 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.6052369Z %169 = ttg.local_load %168 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6052841Z %170 = tt.dot %152, %169, %150#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.6053339Z %171 = ttg.local_load %150#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6053772Z %172 = arith.extf %171 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6054069Z %173 = arith.addi %90, %140 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6054269Z %174 = tt.addptr %10, %173 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6054473Z %175 = tt.load %174 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.6054715Z %176 = ttg.convert_layout %175 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6054996Z %177 = arith.shli %176, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6055230Z %178 = arith.shrsi %177, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6055488Z %179 = arith.shrsi %176, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6055776Z %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6056125Z %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6056412Z %182 = tt.broadcast %180 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6056649Z %183 = arith.select %15, %182, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6056887Z %184 = tt.broadcast %181 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6057123Z %185 = arith.select %17, %184, %183 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6057353Z %186 = tt.reshape %185 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.6057583Z %187 = arith.sitofp %186 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.6057834Z %188 = ttg.local_alloc %187 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.6058163Z %189 = ttg.local_load %188 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6058650Z %190 = tt.dot %172, %189, %170, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.6059037Z ttg.local_dealloc %141 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:11.6059271Z %191 = arith.truncf %190 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:11.6059545Z %192 = tt.expand_dims %130 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:11.6059786Z %193 = arith.muli %192, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:11.6060021Z %194 = tt.expand_dims %135 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:11.6060281Z %195 = tt.broadcast %193 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.6060492Z %196 = tt.broadcast %194 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.6060676Z %197 = arith.addi %195, %196 : tensor<128x128xi32, #mma> 2026-02-21T09:56:11.6060867Z %198 = tt.addptr %18, %197 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:11.6061069Z tt.store %198, %191 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:11.6061201Z } 2026-02-21T09:56:11.6061298Z scf.for %arg3 = %22 to %2 step %c1_i32 : i32 { 2026-02-21T09:56:11.6061434Z %23 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:56:11.6061559Z %24 = arith.muli %23, %c2_i32 : i32 2026-02-21T09:56:11.6061677Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:56:11.6061796Z %26 = arith.minsi %25, %c2_i32 : i32 2026-02-21T09:56:11.6061917Z %27 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:56:11.6062033Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:56:11.6062146Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:56:11.6062257Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:56:11.6062370Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:56:11.6062535Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.6062750Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.6062966Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.6063195Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.6063358Z %36 = arith.muli %30, %c128_i32 : i32 2026-02-21T09:56:11.6063519Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.6063746Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.6063956Z %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.6064165Z %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.6064436Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.6064686Z %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.6064879Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6065158Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:11.6065435Z %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6065655Z %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:11.6065934Z %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.6066203Z %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6066392Z %49 = arith.addi %43, %48 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6066600Z %50 = tt.addptr %9, %49 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6066801Z %51 = tt.load %50 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.6067083Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6067439Z ttg.local_store %51, %52 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6067713Z %53 = arith.addi %8, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.6067986Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.6068256Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6068444Z %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6068640Z %57 = tt.addptr %9, %56 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6068838Z %58 = tt.load %57 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.6069117Z %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6069475Z ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6069994Z %60:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:56:11.6070467Z %117 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.6070700Z %118 = arith.addi %117, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.6070875Z %119 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:56:11.6071022Z %120 = arith.muli %119, %c2_i32 : i32 2026-02-21T09:56:11.6071190Z %121 = tt.splat %120 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.6071413Z %122 = arith.addi %121, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.6071706Z %123 = tt.expand_dims %122 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.6071981Z %124 = tt.broadcast %123 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6072177Z %125 = arith.addi %43, %124 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6072377Z %126 = tt.addptr %9, %125 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.6072588Z %127 = tt.load %126 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.6072892Z %128 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6073331Z %129 = arith.extf %128 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6073718Z %130 = tt.expand_dims %118 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6073981Z %131 = arith.muli %130, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6074177Z %132 = tt.broadcast %131 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6074376Z %133 = arith.addi %132, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6074592Z %134 = tt.addptr %10, %133 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6074796Z %135 = tt.load %134 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.6075044Z %136 = ttg.convert_layout %135 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6075327Z %137 = arith.shli %136, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6075566Z %138 = arith.shrsi %137, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6075802Z %139 = arith.shrsi %136, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6076093Z %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6076432Z %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6076722Z %142 = tt.broadcast %140 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6076971Z %143 = arith.select %15, %142, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6077212Z %144 = tt.broadcast %141 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6077452Z %145 = arith.select %17, %144, %143 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6077682Z %146 = tt.reshape %145 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.6077909Z %147 = arith.sitofp %146 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.6078164Z %148 = ttg.local_alloc %147 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.6078493Z %149 = ttg.local_load %148 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6078988Z %150 = tt.dot %129, %149, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.6079340Z %151 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:11.6079484Z %152 = arith.cmpi slt, %151, %c2_i32 : i32 2026-02-21T09:56:11.6079619Z %153 = arith.select %152, %151, %c0_i32 : i32 2026-02-21T09:56:11.6079885Z %154 = ttg.memdesc_index %46[%153] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6080244Z ttg.local_store %127, %154 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6080646Z scf.yield %150, %153, %arg8, %154 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.6080987Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:56:11.6081201Z %61 = arith.addi %7, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.6081524Z %62 = ttg.local_load %60#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6081965Z %63 = arith.extf %62 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6082343Z %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6082661Z %65 = arith.muli %64, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6082852Z %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6083044Z %67 = arith.addi %66, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6083231Z %68 = tt.addptr %10, %67 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6083430Z %69 = tt.load %68 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.6083666Z %70 = ttg.convert_layout %69 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6083944Z %71 = arith.shli %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6084172Z %72 = arith.shrsi %71, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6084404Z %73 = arith.shrsi %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6084687Z %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6085014Z %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6085295Z %76 = tt.broadcast %74 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6085531Z %77 = arith.select %15, %76, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6085761Z %78 = tt.broadcast %75 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6085989Z %79 = arith.select %17, %78, %77 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6086210Z %80 = tt.reshape %79 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.6086430Z %81 = arith.sitofp %80 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.6086680Z %82 = ttg.local_alloc %81 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.6087028Z %83 = ttg.local_load %82 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6087493Z %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.6087899Z %85 = arith.addi %7, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.6088225Z %86 = ttg.local_load %60#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6088652Z %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6089025Z %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6089269Z %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.6089460Z %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6089647Z %91 = arith.addi %90, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6089839Z %92 = tt.addptr %10, %91 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.6090052Z %93 = tt.load %92 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.6090290Z %94 = ttg.convert_layout %93 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6090580Z %95 = arith.shli %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6090809Z %96 = arith.shrsi %95, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6091040Z %97 = arith.shrsi %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.6091320Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6091652Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.6091933Z %100 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6092171Z %101 = arith.select %15, %100, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6092407Z %102 = tt.broadcast %99 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6092640Z %103 = arith.select %17, %102, %101 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.6092872Z %104 = tt.reshape %103 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.6093096Z %105 = arith.sitofp %104 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.6093347Z %106 = ttg.local_alloc %105 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.6093676Z %107 = ttg.local_load %106 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.6094141Z %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.6094522Z ttg.local_dealloc %46 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:11.6094737Z %109 = arith.truncf %108 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:11.6095027Z %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:11.6095265Z %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:11.6095511Z %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:11.6095772Z %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.6095981Z %114 = tt.broadcast %112 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.6096161Z %115 = arith.addi %113, %114 : tensor<128x128xi32, #mma> 2026-02-21T09:56:11.6096354Z %116 = tt.addptr %18, %115 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:11.6096553Z tt.store %116, %109 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:11.6096695Z } {tt.num_stages = 1 : i32} 2026-02-21T09:56:11.6096799Z tt.return 2026-02-21T09:56:11.6096880Z } 2026-02-21T09:56:11.6096958Z } 2026-02-21T09:56:11.6097001Z 2026-02-21T09:56:11.6097033Z {-# 2026-02-21T09:56:11.6097116Z external_resources: { 2026-02-21T09:56:11.6097216Z mlir_reproducer: { 2026-02-21T09:56:11.6098244Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:56:11.6099239Z disable_threading: false, 2026-02-21T09:56:11.6099346Z verify_each: true 2026-02-21T09:56:11.6099437Z } 2026-02-21T09:56:11.6099507Z } 2026-02-21T09:56:11.6099581Z #-} 2026-02-21T09:56:11.6099856Z /tmp/torchinductor_root/gb/cgbsjwqgofmihm2crsdbzbb7mfznt7moa5ficlcmjnnpxx5kut7y.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:56:11.6100547Z /tmp/torchinductor_root/gb/cgbsjwqgofmihm2crsdbzbb7mfznt7moa5ficlcmjnnpxx5kut7y.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:56:11.6101097Z [702s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:56:11.6101872Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[0, 3], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:56:11.6102573Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:56:11.6102741Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:56:11.7584434Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:56:11.7591464Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T09:56:11.7592004Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:56:11.7592485Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T09:56:11.7593167Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T09:56:11.7593600Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}> 2026-02-21T09:56:11.7594054Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> 2026-02-21T09:56:11.7594417Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:56:11.7594701Z #smem = #ttg.shared_memory 2026-02-21T09:56:11.7595071Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:56:11.7595800Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:56:11.7607252Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:56:11.7607456Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.7607646Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.7607855Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma> 2026-02-21T09:56:11.7608040Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:56:11.7608307Z %cst_3 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.7608595Z %cst_4 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.7608867Z %cst_5 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.7609157Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7609356Z %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.7609525Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:56:11.7609651Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:56:11.7609780Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:56:11.7609906Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:56:11.7610075Z %cst_8 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7610241Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:56:11.7610364Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:56:11.7610494Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:56:11.7610691Z %cst_9 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7610898Z %0 = tt.get_program_id x : i32 2026-02-21T09:56:11.7611023Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:56:11.7611154Z %2 = arith.minsi %1, %c8192_i32 : i32 2026-02-21T09:56:11.7611379Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.7611692Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.7611999Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.7612306Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.7612609Z %7 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.7612909Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.7613178Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.7613408Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.7613780Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:56:11.7614242Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:56:11.7614711Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.7615000Z %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.7615216Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:56:11.7615435Z %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:11.7615627Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked> 2026-02-21T09:56:11.7615835Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:11.7615992Z %19 = arith.subi %2, %0 : i32 2026-02-21T09:56:11.7616100Z %20 = arith.remsi %19, %c2_i32 : i32 2026-02-21T09:56:11.7616215Z %21 = arith.subi %19, %20 : i32 2026-02-21T09:56:11.7616323Z %22 = arith.addi %0, %21 : i32 2026-02-21T09:56:11.7616445Z scf.for %arg3 = %0 to %22 step %c2_i32 : i32 { 2026-02-21T09:56:11.7616580Z %23 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:56:11.7616701Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:56:11.7616831Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:56:11.7616947Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:56:11.7617065Z %27 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:56:11.7617180Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:56:11.7617309Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:56:11.7617416Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:56:11.7617529Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:56:11.7617697Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.7617911Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.7618124Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.7618331Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.7618495Z %36 = arith.muli %30, %c128_i32 : i32 2026-02-21T09:56:11.7618655Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.7618864Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.7619070Z %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.7619280Z %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.7619549Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.7619796Z %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.7619989Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7620269Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:11.7620545Z %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7620764Z %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:11.7621028Z %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.7621323Z %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7621510Z %49 = arith.addi %43, %48 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7621706Z %50 = tt.addptr %9, %49 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7621932Z %51 = tt.load %50 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.7622213Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7622571Z ttg.local_store %51, %52 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7622841Z %53 = arith.addi %8, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.7623114Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.7623386Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7623573Z %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7623770Z %57 = tt.addptr %9, %56 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7623969Z %58 = tt.load %57 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.7624277Z %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7624632Z ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7625181Z %60:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:56:11.7625677Z %199 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.7625905Z %200 = arith.addi %199, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.7626079Z %201 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:56:11.7626204Z %202 = arith.muli %201, %c2_i32 : i32 2026-02-21T09:56:11.7626373Z %203 = tt.splat %202 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.7626593Z %204 = arith.addi %203, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.7626868Z %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.7627143Z %206 = tt.broadcast %205 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7627339Z %207 = arith.addi %43, %206 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7627539Z %208 = tt.addptr %9, %207 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7627746Z %209 = tt.load %208 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.7628050Z %210 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7628492Z %211 = arith.extf %210 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7628875Z %212 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7629123Z %213 = arith.muli %212, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7629345Z %214 = tt.broadcast %213 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7629537Z %215 = arith.addi %214, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7629737Z %216 = tt.addptr %10, %215 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7629957Z %217 = tt.load %216 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.7630199Z %218 = ttg.convert_layout %217 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7630485Z %219 = arith.shli %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7630723Z %220 = arith.shrsi %219, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7630959Z %221 = arith.shrsi %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7631250Z %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7631586Z %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7631872Z %224 = tt.broadcast %222 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7632116Z %225 = arith.select %15, %224, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7632377Z %226 = tt.broadcast %223 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7632615Z %227 = arith.select %17, %226, %225 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7632861Z %228 = tt.reshape %227 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.7633089Z %229 = arith.sitofp %228 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.7633347Z %230 = ttg.local_alloc %229 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.7633673Z %231 = ttg.local_load %230 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7634161Z %232 = tt.dot %211, %231, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.7634516Z %233 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:11.7634643Z %234 = arith.cmpi slt, %233, %c2_i32 : i32 2026-02-21T09:56:11.7634778Z %235 = arith.select %234, %233, %c0_i32 : i32 2026-02-21T09:56:11.7635050Z %236 = ttg.memdesc_index %46[%235] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7635414Z ttg.local_store %209, %236 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7635819Z scf.yield %232, %235, %arg8, %236 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7636202Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:56:11.7636462Z %61 = arith.addi %7, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.7636789Z %62 = ttg.local_load %60#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7637219Z %63 = arith.extf %62 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7637621Z %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7637861Z %65 = arith.muli %64, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7638180Z %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7638367Z %67 = arith.addi %66, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7638557Z %68 = tt.addptr %10, %67 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7638755Z %69 = tt.load %68 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.7638990Z %70 = ttg.convert_layout %69 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7639263Z %71 = arith.shli %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7639491Z %72 = arith.shrsi %71, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7639719Z %73 = arith.shrsi %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7640002Z %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7640347Z %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7640624Z %76 = tt.broadcast %74 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7640857Z %77 = arith.select %15, %76, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7641103Z %78 = tt.broadcast %75 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7641331Z %79 = arith.select %17, %78, %77 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7641553Z %80 = tt.reshape %79 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.7641772Z %81 = arith.sitofp %80 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.7642016Z %82 = ttg.local_alloc %81 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.7642336Z %83 = ttg.local_load %82 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7642859Z %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.7643244Z %85 = arith.addi %7, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.7643573Z %86 = ttg.local_load %60#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7644007Z %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7644381Z %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7644627Z %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7644815Z %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7645005Z %91 = arith.addi %90, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7645198Z %92 = tt.addptr %10, %91 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7645423Z %93 = tt.load %92 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.7645660Z %94 = ttg.convert_layout %93 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7645931Z %95 = arith.shli %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7646177Z %96 = arith.shrsi %95, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7646405Z %97 = arith.shrsi %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7646683Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7647019Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7647296Z %100 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7647542Z %101 = arith.select %15, %100, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7647774Z %102 = tt.broadcast %99 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7648009Z %103 = arith.select %17, %102, %101 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7648242Z %104 = tt.reshape %103 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.7648478Z %105 = arith.sitofp %104 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.7648730Z %106 = ttg.local_alloc %105 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.7649071Z %107 = ttg.local_load %106 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7649536Z %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.7649916Z ttg.local_dealloc %46 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:11.7650129Z %109 = arith.truncf %108 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:11.7650400Z %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:11.7650636Z %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:11.7650863Z %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:11.7651123Z %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.7651327Z %114 = tt.broadcast %112 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.7651511Z %115 = arith.addi %113, %114 : tensor<128x128xi32, #mma> 2026-02-21T09:56:11.7651705Z %116 = tt.addptr %18, %115 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:11.7651907Z tt.store %116, %109 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:11.7652050Z %117 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:56:11.7652171Z %118 = arith.divsi %117, %c256_i32 : i32 2026-02-21T09:56:11.7652294Z %119 = arith.muli %118, %c4_i32 : i32 2026-02-21T09:56:11.7652413Z %120 = arith.subi %c128_i32, %119 : i32 2026-02-21T09:56:11.7652532Z %121 = arith.minsi %120, %c4_i32 : i32 2026-02-21T09:56:11.7652650Z %122 = arith.remsi %117, %c256_i32 : i32 2026-02-21T09:56:11.7652770Z %123 = arith.remsi %122, %121 : i32 2026-02-21T09:56:11.7652885Z %124 = arith.addi %119, %123 : i32 2026-02-21T09:56:11.7652996Z %125 = arith.divsi %122, %121 : i32 2026-02-21T09:56:11.7677557Z %126 = arith.muli %124, %c128_i32 : i32 2026-02-21T09:56:11.7677727Z %127 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.7677947Z %128 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.7678191Z %129 = arith.addi %127, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.7678409Z %130 = arith.addi %128, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.7678578Z %131 = arith.muli %125, %c128_i32 : i32 2026-02-21T09:56:11.7678744Z %132 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.7678961Z %133 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.7679174Z %134 = arith.addi %132, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.7679389Z %135 = arith.addi %133, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.7679664Z %136 = tt.expand_dims %129 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.7679922Z %137 = arith.muli %136, %cst_7 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.7680122Z %138 = tt.broadcast %137 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7680425Z %139 = tt.expand_dims %134 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:11.7680706Z %140 = tt.broadcast %139 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7680944Z %141 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:11.7681128Z %142 = arith.addi %138, %48 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7681330Z %143 = tt.addptr %9, %142 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7681533Z %144 = tt.load %143 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.7681822Z %145 = ttg.memdesc_index %141[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7682187Z ttg.local_store %144, %145 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7682428Z %146 = arith.addi %138, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7682674Z %147 = tt.addptr %9, %146 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7682878Z %148 = tt.load %147 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.7683162Z %149 = ttg.memdesc_index %141[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7683521Z ttg.local_store %148, %149 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7684049Z %150:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %145, %arg8 = %149) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:56:11.7684527Z %199 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.7684759Z %200 = arith.addi %199, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.7684934Z %201 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:56:11.7685058Z %202 = arith.muli %201, %c2_i32 : i32 2026-02-21T09:56:11.7685225Z %203 = tt.splat %202 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.7685471Z %204 = arith.addi %203, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.7685743Z %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.7686038Z %206 = tt.broadcast %205 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7686233Z %207 = arith.addi %138, %206 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7686435Z %208 = tt.addptr %9, %207 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7686644Z %209 = tt.load %208 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.7686945Z %210 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7687386Z %211 = arith.extf %210 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7687768Z %212 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7688015Z %213 = arith.muli %212, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7688207Z %214 = tt.broadcast %213 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7688420Z %215 = arith.addi %214, %140 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7688621Z %216 = tt.addptr %10, %215 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7688825Z %217 = tt.load %216 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.7689094Z %218 = ttg.convert_layout %217 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7689378Z %219 = arith.shli %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7689612Z %220 = arith.shrsi %219, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7689851Z %221 = arith.shrsi %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7690142Z %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7690478Z %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7690763Z %224 = tt.broadcast %222 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7691009Z %225 = arith.select %15, %224, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7691247Z %226 = tt.broadcast %223 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7691486Z %227 = arith.select %17, %226, %225 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7691714Z %228 = tt.reshape %227 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.7691940Z %229 = arith.sitofp %228 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.7692196Z %230 = ttg.local_alloc %229 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.7692521Z %231 = ttg.local_load %230 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7692997Z %232 = tt.dot %211, %231, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.7693365Z %233 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:11.7693495Z %234 = arith.cmpi slt, %233, %c2_i32 : i32 2026-02-21T09:56:11.7693629Z %235 = arith.select %234, %233, %c0_i32 : i32 2026-02-21T09:56:11.7693913Z %236 = ttg.memdesc_index %141[%235] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7694275Z ttg.local_store %209, %236 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7694671Z scf.yield %232, %235, %arg8, %236 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7695057Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:56:11.7695425Z %151 = ttg.local_load %150#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7695858Z %152 = arith.extf %151 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7696158Z %153 = arith.addi %66, %140 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7696281Z %154 = tt.addptr %10, %153 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7696345Z %155 = tt.load %154 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.7696489Z %156 = ttg.convert_layout %155 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7696600Z %157 = arith.shli %156, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7696704Z %158 = arith.shrsi %157, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7696799Z %159 = arith.shrsi %156, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7696950Z %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7697101Z %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7697197Z %162 = tt.broadcast %160 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7697299Z %163 = arith.select %15, %162, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7697396Z %164 = tt.broadcast %161 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7697494Z %165 = arith.select %17, %164, %163 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7697586Z %166 = tt.reshape %165 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.7697682Z %167 = arith.sitofp %166 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.7697800Z %168 = ttg.local_alloc %167 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.7697969Z %169 = ttg.local_load %168 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7698236Z %170 = tt.dot %152, %169, %150#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.7698431Z %171 = ttg.local_load %150#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7698639Z %172 = arith.extf %171 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7698716Z %173 = arith.addi %90, %140 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7698816Z %174 = tt.addptr %10, %173 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7698876Z %175 = tt.load %174 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.7699021Z %176 = ttg.convert_layout %175 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7699118Z %177 = arith.shli %176, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7699216Z %178 = arith.shrsi %177, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7699314Z %179 = arith.shrsi %176, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7699462Z %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7699607Z %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7699703Z %182 = tt.broadcast %180 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7699822Z %183 = arith.select %15, %182, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7699914Z %184 = tt.broadcast %181 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7700029Z %185 = arith.select %17, %184, %183 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7700119Z %186 = tt.reshape %185 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.7700212Z %187 = arith.sitofp %186 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.7700331Z %188 = ttg.local_alloc %187 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.7700499Z %189 = ttg.local_load %188 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7700760Z %190 = tt.dot %172, %189, %170, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.7700848Z ttg.local_dealloc %141 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:11.7700938Z %191 = arith.truncf %190 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:11.7701077Z %192 = tt.expand_dims %130 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:11.7701138Z %193 = arith.muli %192, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:11.7701276Z %194 = tt.expand_dims %135 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:11.7701360Z %195 = tt.broadcast %193 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.7701443Z %196 = tt.broadcast %194 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.7701502Z %197 = arith.addi %195, %196 : tensor<128x128xi32, #mma> 2026-02-21T09:56:11.7701599Z %198 = tt.addptr %18, %197 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:11.7701665Z tt.store %198, %191 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:11.7701709Z } {tt.num_stages = 1 : i32} 2026-02-21T09:56:11.7701776Z scf.for %arg3 = %22 to %2 step %c1_i32 : i32 { 2026-02-21T09:56:11.7701821Z %23 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:56:11.7701866Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:56:11.7701907Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:56:11.7701947Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:56:11.7702041Z %27 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:56:11.7702081Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:56:11.7702121Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:56:11.7702160Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:56:11.7702203Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:56:11.7702293Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.7702377Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.7702469Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:11.7702551Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:11.7702592Z %36 = arith.muli %30, %c128_i32 : i32 2026-02-21T09:56:11.7702683Z %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.7702763Z %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.7702849Z %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:11.7702945Z %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:11.7703090Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.7703171Z %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:11.7703268Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7703412Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1> 2026-02-21T09:56:11.7703501Z %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7703589Z %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:11.7703728Z %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.7703814Z %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7703875Z %49 = arith.addi %43, %48 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7703975Z %50 = tt.addptr %9, %49 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7704035Z %51 = tt.load %50 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.7704218Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7704359Z ttg.local_store %51, %52 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7704450Z %53 = arith.addi %8, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.7704591Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.7704680Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7704739Z %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7704839Z %57 = tt.addptr %9, %56 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7704915Z %58 = tt.load %57 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.7705093Z %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7705229Z ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7705584Z %60:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:56:11.7705682Z %117 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.7705774Z %118 = arith.addi %117, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.7705820Z %119 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:56:11.7705863Z %120 = arith.muli %119, %c2_i32 : i32 2026-02-21T09:56:11.7705954Z %121 = tt.splat %120 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.7706045Z %122 = arith.addi %121, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:11.7706191Z %123 = tt.expand_dims %122 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:11.7706303Z %124 = tt.broadcast %123 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7706368Z %125 = arith.addi %43, %124 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7706469Z %126 = tt.addptr %9, %125 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:11.7706546Z %127 = tt.load %126 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:11.7706748Z %128 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7706945Z %129 = arith.extf %128 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7707090Z %130 = tt.expand_dims %118 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7707157Z %131 = arith.muli %130, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7707248Z %132 = tt.broadcast %131 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7707307Z %133 = arith.addi %132, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7707411Z %134 = tt.addptr %10, %133 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7707473Z %135 = tt.load %134 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.7707619Z %136 = ttg.convert_layout %135 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7707718Z %137 = arith.shli %136, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7707817Z %138 = arith.shrsi %137, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7707914Z %139 = arith.shrsi %136, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7708063Z %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7708209Z %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7708305Z %142 = tt.broadcast %140 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7708426Z %143 = arith.select %15, %142, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7708520Z %144 = tt.broadcast %141 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7708633Z %145 = arith.select %17, %144, %143 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7708725Z %146 = tt.reshape %145 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.7708819Z %147 = arith.sitofp %146 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.7708937Z %148 = ttg.local_alloc %147 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.7709111Z %149 = ttg.local_load %148 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7709376Z %150 = tt.dot %129, %149, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.7709423Z %151 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:11.7709473Z %152 = arith.cmpi slt, %151, %c2_i32 : i32 2026-02-21T09:56:11.7709523Z %153 = arith.select %152, %151, %c0_i32 : i32 2026-02-21T09:56:11.7709715Z %154 = ttg.memdesc_index %46[%153] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7709858Z ttg.local_store %127, %154 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7710091Z scf.yield %150, %153, %arg8, %154 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:11.7710214Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:56:11.7710308Z %61 = arith.addi %7, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.7710506Z %62 = ttg.local_load %60#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7710701Z %63 = arith.extf %62 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7710846Z %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7710907Z %65 = arith.muli %64, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7710995Z %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7711057Z %67 = arith.addi %66, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7711153Z %68 = tt.addptr %10, %67 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7711211Z %69 = tt.load %68 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.7711355Z %70 = ttg.convert_layout %69 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7711449Z %71 = arith.shli %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7711545Z %72 = arith.shrsi %71, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7711639Z %73 = arith.shrsi %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7711786Z %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7711943Z %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7712037Z %76 = tt.broadcast %74 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7712149Z %77 = arith.select %15, %76, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7712240Z %78 = tt.broadcast %75 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7712336Z %79 = arith.select %17, %78, %77 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7712422Z %80 = tt.reshape %79 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.7712511Z %81 = arith.sitofp %80 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.7712625Z %82 = ttg.local_alloc %81 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.7712796Z %83 = ttg.local_load %82 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7713052Z %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.7713158Z %85 = arith.addi %7, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:11.7713350Z %86 = ttg.local_load %60#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7713555Z %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7713699Z %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7713759Z %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:11.7713846Z %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7713905Z %91 = arith.addi %90, %45 : tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7714003Z %92 = tt.addptr %10, %91 : tensor<2x128x!tt.ptr, #blocked1>, tensor<2x128xi32, #blocked1> 2026-02-21T09:56:11.7714059Z %93 = tt.load %92 : tensor<2x128x!tt.ptr, #blocked1> 2026-02-21T09:56:11.7714200Z %94 = ttg.convert_layout %93 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7714294Z %95 = arith.shli %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7714391Z %96 = arith.shrsi %95, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7714482Z %97 = arith.shrsi %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:11.7714630Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7714773Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked> 2026-02-21T09:56:11.7714867Z %100 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7714972Z %101 = arith.select %15, %100, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7715063Z %102 = tt.broadcast %99 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7715162Z %103 = arith.select %17, %102, %101 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked> 2026-02-21T09:56:11.7715269Z %104 = tt.reshape %103 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3> 2026-02-21T09:56:11.7715360Z %105 = arith.sitofp %104 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3> 2026-02-21T09:56:11.7715499Z %106 = ttg.local_alloc %105 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem> 2026-02-21T09:56:11.7715670Z %107 = ttg.local_load %106 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:11.7715929Z %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma> 2026-02-21T09:56:11.7716015Z ttg.local_dealloc %46 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:11.7716108Z %109 = arith.truncf %108 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma> 2026-02-21T09:56:11.7716246Z %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:11.7716304Z %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:11.7716442Z %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma> 2026-02-21T09:56:11.7716538Z %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.7716619Z %114 = tt.broadcast %112 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma> 2026-02-21T09:56:11.7716678Z %115 = arith.addi %113, %114 : tensor<128x128xi32, #mma> 2026-02-21T09:56:11.7716787Z %116 = tt.addptr %18, %115 : tensor<128x128x!tt.ptr, #mma>, tensor<128x128xi32, #mma> 2026-02-21T09:56:11.7716850Z tt.store %116, %109 : tensor<128x128x!tt.ptr, #mma> 2026-02-21T09:56:11.7716895Z } {tt.num_stages = 1 : i32} 2026-02-21T09:56:11.7716931Z tt.return 2026-02-21T09:56:11.7716962Z } 2026-02-21T09:56:11.7716996Z } 2026-02-21T09:56:11.7717003Z 2026-02-21T09:56:11.7717035Z {-# 2026-02-21T09:56:11.7717076Z external_resources: { 2026-02-21T09:56:11.7717114Z mlir_reproducer: { 2026-02-21T09:56:11.7718057Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:56:11.7718100Z disable_threading: false, 2026-02-21T09:56:11.7718139Z verify_each: true 2026-02-21T09:56:11.7718170Z } 2026-02-21T09:56:11.7718200Z } 2026-02-21T09:56:11.7718229Z #-} 2026-02-21T09:56:11.7718469Z /tmp/torchinductor_root/ua/cuarvwdhzhdpgdiqi4kmk3qzuzohxvqdbtvdai46jfdu5h44yne7.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:56:11.7718888Z /tmp/torchinductor_root/ua/cuarvwdhzhdpgdiqi4kmk3qzuzohxvqdbtvdai46jfdu5h44yne7.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:56:11.7719001Z [702s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:56:11.7719627Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[1, 3], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:56:11.7719717Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:56:11.7719801Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:56:11.9500620Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 42/42 10.2 configs/s 2026-02-21T09:56:15.6226814Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━ 214/214 33.0 configs/s 2026-02-21T09:56:18.2676565Z [709s] Generation 12 complete: 2026-02-21T09:56:18.2676944Z error=6 2026-02-21T09:56:18.2677153Z ok=39 2026-02-21T09:56:18.2677359Z min=0.8961 2026-02-21T09:56:18.2677590Z mid=1.1081 2026-02-21T09:56:18.2677792Z max=32.6472 2026-02-21T09:56:18.2678033Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:56:18.2678434Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:56:18.2678794Z 'l2_groupings': [2], 2026-02-21T09:56:18.2679066Z 'load_eviction_policies': ['', ''], 2026-02-21T09:56:18.2679374Z 'loop_orders': [[0, 1]], 2026-02-21T09:56:18.2679656Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:56:18.2679937Z 'num_stages': 1, 2026-02-21T09:56:18.2680170Z 'num_warps': 4, 2026-02-21T09:56:18.2680407Z 'pid_type': 'flat', 2026-02-21T09:56:18.2680665Z 'range_flattens': [None, None], 2026-02-21T09:56:18.2680975Z 'range_multi_buffers': [None, False], 2026-02-21T09:56:18.2681298Z 'range_num_stages': [0, 1], 2026-02-21T09:56:18.2681582Z 'range_unroll_factors': [0, 0], 2026-02-21T09:56:18.2681880Z 'range_warp_specializes': [], 2026-02-21T09:56:18.2682159Z 'waves_per_eu': 2} 2026-02-21T09:56:18.2762566Z [709s] Fitting surrogate: 1063 points, 1063 targets 2026-02-21T09:56:18.6038077Z [709s] Generation 13 starting: 21 neighbors, 1 active search path(s) 2026-02-21T09:56:23.5749151Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 4.1 configs/s 2026-02-21T09:56:25.3396425Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 21/21 12.0 configs/s 2026-02-21T09:56:25.6386785Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━ 223/223 270.2 configs/s 2026-02-21T09:56:28.7577544Z [719s] Generation 13 complete: 2026-02-21T09:56:28.7577765Z ok=23 2026-02-21T09:56:28.7577855Z min=0.9161 2026-02-21T09:56:28.7577946Z mid=1.3263 2026-02-21T09:56:28.7578027Z max=13.0921 2026-02-21T09:56:28.7578122Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:56:28.7578294Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:56:28.7578437Z 'l2_groupings': [2], 2026-02-21T09:56:28.7578543Z 'load_eviction_policies': ['', ''], 2026-02-21T09:56:28.7578665Z 'loop_orders': [[0, 1]], 2026-02-21T09:56:28.7578773Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:56:28.7578890Z 'num_stages': 1, 2026-02-21T09:56:28.7578980Z 'num_warps': 4, 2026-02-21T09:56:28.7579085Z 'pid_type': 'flat', 2026-02-21T09:56:28.7579209Z 'range_flattens': [None, None], 2026-02-21T09:56:28.7579330Z 'range_multi_buffers': [None, False], 2026-02-21T09:56:28.7579454Z 'range_num_stages': [0, 1], 2026-02-21T09:56:28.7579563Z 'range_unroll_factors': [0, 0], 2026-02-21T09:56:28.7579682Z 'range_warp_specializes': [], 2026-02-21T09:56:28.7580087Z 'waves_per_eu': 2} 2026-02-21T09:56:28.7629289Z [719s] Fitting surrogate: 1086 points, 1086 targets 2026-02-21T09:56:29.0900377Z [719s] Generation 14 starting: 22 neighbors, 1 active search path(s) 2026-02-21T09:56:34.7399453Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 2.5 configs/s 2026-02-21T09:56:36.6326768Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed. 2026-02-21T09:56:36.6333512Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}> 2026-02-21T09:56:36.6334347Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T09:56:36.6334717Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T09:56:36.6335099Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}> 2026-02-21T09:56:36.6335533Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}> 2026-02-21T09:56:36.6335852Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:56:36.6336150Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}> 2026-02-21T09:56:36.6336377Z #smem = #ttg.shared_memory 2026-02-21T09:56:36.6336668Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T09:56:36.6337278Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:56:36.6337764Z %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma> 2026-02-21T09:56:36.6337990Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:36.6338205Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:36.6338435Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma> 2026-02-21T09:56:36.6338638Z %c508_i32 = arith.constant 508 : i32 2026-02-21T09:56:36.6338862Z %cst_3 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:36.6339280Z %cst_4 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:36.6339583Z %cst_5 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:36.6339853Z %cst_6 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6340080Z %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:36.6340269Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:56:36.6340413Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:56:36.6340558Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:56:36.6340709Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:56:36.6340853Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:56:36.6341046Z %cst_8 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6341224Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:56:36.6341368Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:56:36.6341589Z %cst_9 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6341822Z %0 = tt.get_program_id x : i32 2026-02-21T09:56:36.6341970Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:56:36.6342113Z %2 = arith.minsi %1, %c4096_i32 : i32 2026-02-21T09:56:36.6342375Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:36.6342799Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:36.6346349Z %5 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:36.6346645Z %6 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:36.6346927Z %7 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:36.6347212Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:36.6347501Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:36.6347719Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:56:36.6348014Z %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> 2026-02-21T09:56:36.6348477Z %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T09:56:36.6348920Z %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:36.6349195Z %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:36.6349409Z %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T09:56:36.6349620Z %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked> 2026-02-21T09:56:36.6349826Z %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked> 2026-02-21T09:56:36.6350047Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:56:36.6350221Z %19 = arith.subi %2, %0 : i32 2026-02-21T09:56:36.6350347Z %20 = arith.remsi %19, %c2_i32 : i32 2026-02-21T09:56:36.6350474Z %21 = arith.subi %19, %20 : i32 2026-02-21T09:56:36.6350597Z %22 = arith.addi %0, %21 : i32 2026-02-21T09:56:36.6350733Z scf.for %arg3 = %0 to %22 step %c2_i32 : i32 { 2026-02-21T09:56:36.6350885Z %23 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:56:36.6351016Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:56:36.6351151Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:56:36.6351302Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:56:36.6351437Z %27 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:56:36.6351570Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:56:36.6351692Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:56:36.6351813Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:56:36.6351933Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:56:36.6352121Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:36.6352350Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:36.6352586Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:36.6352813Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:36.6352976Z %36 = arith.muli %30, %c256_i32 : i32 2026-02-21T09:56:36.6353147Z %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:36.6353358Z %38 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:36.6353576Z %39 = arith.addi %37, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:36.6353792Z %40 = arith.addi %38, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:36.6354067Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:36.6354326Z %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:36.6354618Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6354893Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:56:36.6355173Z %45 = tt.broadcast %44 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6355392Z %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:36.6355682Z %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:36.6355954Z %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6356142Z %49 = arith.addi %43, %48 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6356340Z %50 = tt.addptr %9, %49 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6356561Z %51 = tt.load %50 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:36.6356844Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6357202Z ttg.local_store %51, %52 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6357473Z %53 = arith.addi %8, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:36.6357747Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:36.6358013Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6358206Z %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6358402Z %57 = tt.addptr %9, %56 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6358601Z %58 = tt.load %57 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:36.6358879Z %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6359254Z ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6359782Z %60:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:56:36.6360256Z %199 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:36.6360482Z %200 = arith.addi %199, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:36.6360661Z %201 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:56:36.6360787Z %202 = arith.muli %201, %c2_i32 : i32 2026-02-21T09:56:36.6360954Z %203 = tt.splat %202 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:36.6361178Z %204 = arith.addi %203, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:36.6361453Z %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:36.6361737Z %206 = tt.broadcast %205 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6361932Z %207 = arith.addi %43, %206 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6362134Z %208 = tt.addptr %9, %207 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6362367Z %209 = tt.load %208 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:36.6362752Z %210 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6363202Z %211 = arith.extf %210 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6363588Z %212 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6363861Z %213 = arith.muli %212, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6364057Z %214 = tt.broadcast %213 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6364249Z %215 = arith.addi %214, %45 : tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6364453Z %216 = tt.addptr %10, %215 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6364694Z %217 = tt.load %216 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:56:36.6364939Z %218 = ttg.convert_layout %217 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6365224Z %219 = arith.shli %218, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6365459Z %220 = arith.shrsi %219, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6365700Z %221 = arith.shrsi %218, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6365993Z %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6366330Z %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6366620Z %224 = tt.broadcast %222 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6366862Z %225 = arith.select %15, %224, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6367125Z %226 = tt.broadcast %223 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6367393Z %227 = arith.select %17, %226, %225 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6367627Z %228 = tt.reshape %227 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:56:36.6367855Z %229 = arith.sitofp %228 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:56:36.6368108Z %230 = ttg.local_alloc %229 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:56:36.6368442Z %231 = ttg.local_load %230 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6368924Z %232 = tt.dot %211, %231, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:56:36.6369275Z %233 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:36.6369405Z %234 = arith.cmpi slt, %233, %c2_i32 : i32 2026-02-21T09:56:36.6369542Z %235 = arith.select %234, %233, %c0_i32 : i32 2026-02-21T09:56:36.6369809Z %236 = ttg.memdesc_index %46[%235] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6370173Z ttg.local_store %209, %236 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6370574Z scf.yield %232, %235, %arg8, %236 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6370934Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:56:36.6371150Z %61 = arith.addi %7, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:36.6371474Z %62 = ttg.local_load %60#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6371920Z %63 = arith.extf %62 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6372297Z %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6372537Z %65 = arith.muli %64, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6372745Z %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6372933Z %67 = arith.addi %66, %45 : tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6373128Z %68 = tt.addptr %10, %67 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6373325Z %69 = tt.load %68 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:56:36.6373564Z %70 = ttg.convert_layout %69 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6373843Z %71 = arith.shli %70, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6374074Z %72 = arith.shrsi %71, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6374305Z %73 = arith.shrsi %70, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6374589Z %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6374921Z %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6375215Z %76 = tt.broadcast %74 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6375450Z %77 = arith.select %15, %76, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6375685Z %78 = tt.broadcast %75 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6375911Z %79 = arith.select %17, %78, %77 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6376137Z %80 = tt.reshape %79 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:56:36.6376357Z %81 = arith.sitofp %80 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:56:36.6376605Z %82 = ttg.local_alloc %81 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:56:36.6376926Z %83 = ttg.local_load %82 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6377395Z %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:56:36.6377783Z %85 = arith.addi %7, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:36.6378113Z %86 = ttg.local_load %60#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6378537Z %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6378927Z %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6379171Z %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6379362Z %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6379551Z %91 = arith.addi %90, %45 : tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6379760Z %92 = tt.addptr %10, %91 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6379956Z %93 = tt.load %92 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:56:36.6380193Z %94 = ttg.convert_layout %93 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6380470Z %95 = arith.shli %94, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6380714Z %96 = arith.shrsi %95, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6380940Z %97 = arith.shrsi %94, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6381226Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6381557Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6381839Z %100 = tt.broadcast %98 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6382079Z %101 = arith.select %15, %100, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6382314Z %102 = tt.broadcast %99 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6382547Z %103 = arith.select %17, %102, %101 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6382779Z %104 = tt.reshape %103 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:56:36.6383003Z %105 = arith.sitofp %104 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:56:36.6394185Z %106 = ttg.local_alloc %105 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:56:36.6394528Z %107 = ttg.local_load %106 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6394992Z %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:56:36.6395377Z ttg.local_dealloc %46 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:36.6395596Z %109 = arith.truncf %108 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:56:36.6395870Z %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:36.6396111Z %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:36.6396341Z %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:56:36.6396605Z %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:56:36.6396812Z %114 = tt.broadcast %112 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:56:36.6396996Z %115 = arith.addi %113, %114 : tensor<128x256xi32, #mma> 2026-02-21T09:56:36.6397192Z %116 = tt.addptr %18, %115 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi32, #mma> 2026-02-21T09:56:36.6397416Z tt.store %116, %109 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:56:36.6397563Z %117 = arith.addi %arg3, %c1_i32 : i32 2026-02-21T09:56:36.6397685Z %118 = arith.divsi %117, %c128_i32 : i32 2026-02-21T09:56:36.6397807Z %119 = arith.muli %118, %c4_i32 : i32 2026-02-21T09:56:36.6397926Z %120 = arith.subi %c128_i32, %119 : i32 2026-02-21T09:56:36.6398047Z %121 = arith.minsi %120, %c4_i32 : i32 2026-02-21T09:56:36.6398167Z %122 = arith.remsi %117, %c128_i32 : i32 2026-02-21T09:56:36.6398302Z %123 = arith.remsi %122, %121 : i32 2026-02-21T09:56:36.6398419Z %124 = arith.addi %119, %123 : i32 2026-02-21T09:56:36.6398532Z %125 = arith.divsi %122, %121 : i32 2026-02-21T09:56:36.6398651Z %126 = arith.muli %124, %c128_i32 : i32 2026-02-21T09:56:36.6398822Z %127 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:36.6399043Z %128 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:36.6399289Z %129 = arith.addi %127, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:36.6399502Z %130 = arith.addi %128, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:36.6399671Z %131 = arith.muli %125, %c256_i32 : i32 2026-02-21T09:56:36.6399841Z %132 = tt.splat %131 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:36.6400057Z %133 = tt.splat %131 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:36.6400274Z %134 = arith.addi %132, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:36.6400488Z %135 = arith.addi %133, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:36.6400763Z %136 = tt.expand_dims %129 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:36.6401023Z %137 = arith.muli %136, %cst_7 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:36.6401221Z %138 = tt.broadcast %137 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6401504Z %139 = tt.expand_dims %134 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:56:36.6401801Z %140 = tt.broadcast %139 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6402025Z %141 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:36.6402209Z %142 = arith.addi %138, %48 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6402410Z %143 = tt.addptr %9, %142 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6402678Z %144 = tt.load %143 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:36.6402972Z %145 = ttg.memdesc_index %141[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6403336Z ttg.local_store %144, %145 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6403577Z %146 = arith.addi %138, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6403778Z %147 = tt.addptr %9, %146 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6403980Z %148 = tt.load %147 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:36.6404264Z %149 = ttg.memdesc_index %141[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6404621Z ttg.local_store %148, %149 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6405149Z %150:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %145, %arg8 = %149) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:56:36.6405646Z %199 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:36.6405877Z %200 = arith.addi %199, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:36.6406051Z %201 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:56:36.6406200Z %202 = arith.muli %201, %c2_i32 : i32 2026-02-21T09:56:36.6406370Z %203 = tt.splat %202 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:36.6406590Z %204 = arith.addi %203, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:36.6406864Z %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:36.6407161Z %206 = tt.broadcast %205 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6407357Z %207 = arith.addi %138, %206 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6407559Z %208 = tt.addptr %9, %207 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6407768Z %209 = tt.load %208 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:36.6408070Z %210 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6408510Z %211 = arith.extf %210 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6408894Z %212 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6409144Z %213 = arith.muli %212, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6409340Z %214 = tt.broadcast %213 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6409534Z %215 = arith.addi %214, %140 : tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6409752Z %216 = tt.addptr %10, %215 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6409958Z %217 = tt.load %216 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:56:36.6410205Z %218 = ttg.convert_layout %217 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6410490Z %219 = arith.shli %218, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6410727Z %220 = arith.shrsi %219, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6410964Z %221 = arith.shrsi %218, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6411256Z %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6411595Z %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6411882Z %224 = tt.broadcast %222 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6412129Z %225 = arith.select %15, %224, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6412371Z %226 = tt.broadcast %223 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6412609Z %227 = arith.select %17, %226, %225 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6412845Z %228 = tt.reshape %227 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:56:36.6413092Z %229 = arith.sitofp %228 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:56:36.6413348Z %230 = ttg.local_alloc %229 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:56:36.6413676Z %231 = ttg.local_load %230 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6414176Z %232 = tt.dot %211, %231, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:56:36.6414528Z %233 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:36.6414656Z %234 = arith.cmpi slt, %233, %c2_i32 : i32 2026-02-21T09:56:36.6414793Z %235 = arith.select %234, %233, %c0_i32 : i32 2026-02-21T09:56:36.6415078Z %236 = ttg.memdesc_index %141[%235] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6415442Z ttg.local_store %209, %236 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6415845Z scf.yield %232, %235, %arg8, %236 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6416185Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:56:36.6416512Z %151 = ttg.local_load %150#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6416943Z %152 = arith.extf %151 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6417246Z %153 = arith.addi %66, %140 : tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6417448Z %154 = tt.addptr %10, %153 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6417649Z %155 = tt.load %154 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:56:36.6417911Z %156 = ttg.convert_layout %155 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6418194Z %157 = arith.shli %156, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6418430Z %158 = arith.shrsi %157, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6418665Z %159 = arith.shrsi %156, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6418955Z %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6419297Z %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6419584Z %162 = tt.broadcast %160 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6419825Z %163 = arith.select %15, %162, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6420067Z %164 = tt.broadcast %161 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6420297Z %165 = arith.select %17, %164, %163 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6420534Z %166 = tt.reshape %165 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:56:36.6420760Z %167 = arith.sitofp %166 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:56:36.6421030Z %168 = ttg.local_alloc %167 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:56:36.6421357Z %169 = ttg.local_load %168 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6421831Z %170 = tt.dot %152, %169, %150#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:56:36.6422348Z %171 = ttg.local_load %150#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6422780Z %172 = arith.extf %171 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6423080Z %173 = arith.addi %90, %140 : tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6423296Z %174 = tt.addptr %10, %173 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6423500Z %175 = tt.load %174 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:56:36.6423745Z %176 = ttg.convert_layout %175 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6431378Z %177 = arith.shli %176, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6431646Z %178 = arith.shrsi %177, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6431889Z %179 = arith.shrsi %176, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6432188Z %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6432536Z %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6432830Z %182 = tt.broadcast %180 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6433078Z %183 = arith.select %15, %182, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6433373Z %184 = tt.broadcast %181 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6433610Z %185 = arith.select %17, %184, %183 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6433841Z %186 = tt.reshape %185 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:56:36.6434067Z %187 = arith.sitofp %186 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:56:36.6434320Z %188 = ttg.local_alloc %187 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:56:36.6434649Z %189 = ttg.local_load %188 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6435128Z %190 = tt.dot %172, %189, %170, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:56:36.6435519Z ttg.local_dealloc %141 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:36.6435739Z %191 = arith.truncf %190 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:56:36.6436015Z %192 = tt.expand_dims %130 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:36.6436260Z %193 = arith.muli %192, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:36.6436495Z %194 = tt.expand_dims %135 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:56:36.6436774Z %195 = tt.broadcast %193 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:56:36.6436982Z %196 = tt.broadcast %194 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:56:36.6437166Z %197 = arith.addi %195, %196 : tensor<128x256xi32, #mma> 2026-02-21T09:56:36.6437360Z %198 = tt.addptr %18, %197 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi32, #mma> 2026-02-21T09:56:36.6437580Z tt.store %198, %191 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:56:36.6437712Z } 2026-02-21T09:56:36.6437810Z scf.for %arg3 = %22 to %2 step %c1_i32 : i32 { 2026-02-21T09:56:36.6437946Z %23 = arith.divsi %arg3, %c128_i32 : i32 2026-02-21T09:56:36.6438072Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T09:56:36.6438191Z %25 = arith.subi %c128_i32, %24 : i32 2026-02-21T09:56:36.6438304Z %26 = arith.minsi %25, %c4_i32 : i32 2026-02-21T09:56:36.6438427Z %27 = arith.remsi %arg3, %c128_i32 : i32 2026-02-21T09:56:36.6438558Z %28 = arith.remsi %27, %26 : i32 2026-02-21T09:56:36.6438671Z %29 = arith.addi %24, %28 : i32 2026-02-21T09:56:36.6438779Z %30 = arith.divsi %27, %26 : i32 2026-02-21T09:56:36.6438893Z %31 = arith.muli %29, %c128_i32 : i32 2026-02-21T09:56:36.6439061Z %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:36.6439275Z %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:36.6439491Z %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> 2026-02-21T09:56:36.6439701Z %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> 2026-02-21T09:56:36.6439864Z %36 = arith.muli %30, %c256_i32 : i32 2026-02-21T09:56:36.6440024Z %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:36.6440239Z %38 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:36.6440449Z %39 = arith.addi %37, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> 2026-02-21T09:56:36.6440657Z %40 = arith.addi %38, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> 2026-02-21T09:56:36.6440946Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2> 2026-02-21T09:56:36.6441199Z %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2> 2026-02-21T09:56:36.6441390Z %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6441671Z %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1> 2026-02-21T09:56:36.6441947Z %45 = tt.broadcast %44 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6442165Z %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:36.6442429Z %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:36.6442752Z %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6442946Z %49 = arith.addi %43, %48 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6443141Z %50 = tt.addptr %9, %49 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6443348Z %51 = tt.load %50 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:36.6443631Z %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6443989Z ttg.local_store %51, %52 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6444280Z %53 = arith.addi %8, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:36.6444548Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:36.6444819Z %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6445004Z %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6445218Z %57 = tt.addptr %9, %56 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6445419Z %58 = tt.load %57 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:36.6445695Z %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6446050Z ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6446592Z %60:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>) : i32 { 2026-02-21T09:56:36.6447066Z %117 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:36.6447299Z %118 = arith.addi %117, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:36.6447475Z %119 = arith.addi %arg4, %c4_i32 : i32 2026-02-21T09:56:36.6447599Z %120 = arith.muli %119, %c2_i32 : i32 2026-02-21T09:56:36.6447767Z %121 = tt.splat %120 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:36.6447989Z %122 = arith.addi %121, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> 2026-02-21T09:56:36.6448268Z %123 = tt.expand_dims %122 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2> 2026-02-21T09:56:36.6448543Z %124 = tt.broadcast %123 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6448739Z %125 = arith.addi %43, %124 : tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6448961Z %126 = tt.addptr %9, %125 : tensor<128x4x!tt.ptr, #blocked2>, tensor<128x4xi32, #blocked2> 2026-02-21T09:56:36.6449172Z %127 = tt.load %126 : tensor<128x4x!tt.ptr, #blocked2> 2026-02-21T09:56:36.6449475Z %128 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6449911Z %129 = arith.extf %128 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6450295Z %130 = tt.expand_dims %118 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6450549Z %131 = arith.muli %130, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6450744Z %132 = tt.broadcast %131 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6450939Z %133 = arith.addi %132, %45 : tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6451139Z %134 = tt.addptr %10, %133 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6451344Z %135 = tt.load %134 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:56:36.6451591Z %136 = ttg.convert_layout %135 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6451872Z %137 = arith.shli %136, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6452125Z %138 = arith.shrsi %137, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6452361Z %139 = arith.shrsi %136, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6452654Z %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6452992Z %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6453297Z %142 = tt.broadcast %140 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6453542Z %143 = arith.select %15, %142, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6453781Z %144 = tt.broadcast %141 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6454036Z %145 = arith.select %17, %144, %143 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6454271Z %146 = tt.reshape %145 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:56:36.6454495Z %147 = arith.sitofp %146 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:56:36.6454752Z %148 = ttg.local_alloc %147 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:56:36.6455082Z %149 = ttg.local_load %148 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6455561Z %150 = tt.dot %129, %149, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:56:36.6455912Z %151 = arith.addi %arg6, %c1_i32 : i32 2026-02-21T09:56:36.6456041Z %152 = arith.cmpi slt, %151, %c2_i32 : i32 2026-02-21T09:56:36.6456177Z %153 = arith.select %152, %151, %c0_i32 : i32 2026-02-21T09:56:36.6456444Z %154 = ttg.memdesc_index %46[%153] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6456826Z ttg.local_store %127, %154 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6457230Z scf.yield %150, %153, %arg8, %154 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> 2026-02-21T09:56:36.6457570Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:56:36.6457785Z %61 = arith.addi %7, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:36.6458113Z %62 = ttg.local_load %60#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6458541Z %63 = arith.extf %62 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6458920Z %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6459162Z %65 = arith.muli %64, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6459352Z %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6459543Z %67 = arith.addi %66, %45 : tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6459735Z %68 = tt.addptr %10, %67 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6459933Z %69 = tt.load %68 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:56:36.6460184Z %70 = ttg.convert_layout %69 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6460461Z %71 = arith.shli %70, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6460691Z %72 = arith.shrsi %71, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6460919Z %73 = arith.shrsi %70, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6461225Z %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6461552Z %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6461832Z %76 = tt.broadcast %74 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6462085Z %77 = arith.select %15, %76, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6462315Z %78 = tt.broadcast %75 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6462544Z %79 = arith.select %17, %78, %77 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6462767Z %80 = tt.reshape %79 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:56:36.6462987Z %81 = arith.sitofp %80 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:56:36.6463239Z %82 = ttg.local_alloc %81 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:56:36.6463557Z %83 = ttg.local_load %82 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6464021Z %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:56:36.6464408Z %85 = arith.addi %7, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> 2026-02-21T09:56:36.6464750Z %86 = ttg.local_load %60#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6465177Z %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6465552Z %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6465796Z %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1> 2026-02-21T09:56:36.6465987Z %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6466178Z %91 = arith.addi %90, %45 : tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6466373Z %92 = tt.addptr %10, %91 : tensor<2x256x!tt.ptr, #blocked1>, tensor<2x256xi32, #blocked1> 2026-02-21T09:56:36.6466566Z %93 = tt.load %92 : tensor<2x256x!tt.ptr, #blocked1> 2026-02-21T09:56:36.6466804Z %94 = ttg.convert_layout %93 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6467078Z %95 = arith.shli %94, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6467306Z %96 = arith.shrsi %95, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6467534Z %97 = arith.shrsi %94, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> 2026-02-21T09:56:36.6467811Z %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6468156Z %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked> 2026-02-21T09:56:36.6468439Z %100 = tt.broadcast %98 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6468679Z %101 = arith.select %15, %100, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6468916Z %102 = tt.broadcast %99 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6469162Z %103 = arith.select %17, %102, %101 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked> 2026-02-21T09:56:36.6469395Z %104 = tt.reshape %103 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3> 2026-02-21T09:56:36.6469620Z %105 = arith.sitofp %104 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3> 2026-02-21T09:56:36.6469872Z %106 = ttg.local_alloc %105 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem> 2026-02-21T09:56:36.6470214Z %107 = ttg.local_load %106 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> 2026-02-21T09:56:36.6470681Z %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma> 2026-02-21T09:56:36.6471064Z ttg.local_dealloc %46 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> 2026-02-21T09:56:36.6471282Z %109 = arith.truncf %108 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma> 2026-02-21T09:56:36.6471554Z %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma> 2026-02-21T09:56:36.6471794Z %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma> 2026-02-21T09:56:36.6472028Z %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma> 2026-02-21T09:56:36.6472288Z %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:56:36.6472496Z %114 = tt.broadcast %112 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma> 2026-02-21T09:56:36.6472690Z %115 = arith.addi %113, %114 : tensor<128x256xi32, #mma> 2026-02-21T09:56:36.6472885Z %116 = tt.addptr %18, %115 : tensor<128x256x!tt.ptr, #mma>, tensor<128x256xi32, #mma> 2026-02-21T09:56:36.6473086Z tt.store %116, %109 : tensor<128x256x!tt.ptr, #mma> 2026-02-21T09:56:36.6473227Z } {tt.num_stages = 1 : i32} 2026-02-21T09:56:36.6473330Z tt.return 2026-02-21T09:56:36.6473410Z } 2026-02-21T09:56:36.6473487Z } 2026-02-21T09:56:36.6473531Z 2026-02-21T09:56:36.6473562Z {-# 2026-02-21T09:56:36.6473645Z external_resources: { 2026-02-21T09:56:36.6473743Z mlir_reproducer: { 2026-02-21T09:56:36.6474744Z pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})", 2026-02-21T09:56:36.6475731Z disable_threading: false, 2026-02-21T09:56:36.6475839Z verify_each: true 2026-02-21T09:56:36.6475928Z } 2026-02-21T09:56:36.6476000Z } 2026-02-21T09:56:36.6476068Z #-} 2026-02-21T09:56:36.6476349Z /tmp/torchinductor_root/3r/c3rtn5cq6dgpubecss5ujyjziuowuzxapwooes2cp2khsntpawqe.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:56:36.6477057Z /tmp/torchinductor_root/3r/c3rtn5cq6dgpubecss5ujyjziuowuzxapwooes2cp2khsntpawqe.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:56:36.6477609Z [727s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:56:36.6478385Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:56:36.6479106Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:56:36.6479274Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:56:36.9305009Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 22/22 10.1 configs/s 2026-02-21T09:56:37.5833563Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━ 223/223 2027.1 2026-02-21T09:56:37.5833939Z configs/s 2026-02-21T09:56:40.4600339Z [731s] Generation 14 complete: 2026-02-21T09:56:40.4600736Z error=3 2026-02-21T09:56:40.4600953Z ok=21 2026-02-21T09:56:40.4601141Z min=0.9121 2026-02-21T09:56:40.4601330Z mid=1.2230 2026-02-21T09:56:40.4601509Z max=38.3666 2026-02-21T09:56:40.4601719Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:56:40.4602057Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:56:40.4602383Z 'l2_groupings': [2], 2026-02-21T09:56:40.4602719Z 'load_eviction_policies': ['', ''], 2026-02-21T09:56:40.4603030Z 'loop_orders': [[0, 1]], 2026-02-21T09:56:40.4603307Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:56:40.4603552Z 'num_stages': 1, 2026-02-21T09:56:40.4603769Z 'num_warps': 4, 2026-02-21T09:56:40.4603975Z 'pid_type': 'flat', 2026-02-21T09:56:40.4604214Z 'range_flattens': [None, None], 2026-02-21T09:56:40.4604485Z 'range_multi_buffers': [None, False], 2026-02-21T09:56:40.4604774Z 'range_num_stages': [0, 1], 2026-02-21T09:56:40.4605404Z 'range_unroll_factors': [0, 0], 2026-02-21T09:56:40.4605674Z 'range_warp_specializes': [], 2026-02-21T09:56:40.4605924Z 'waves_per_eu': 2} 2026-02-21T09:56:40.4644951Z [731s] Fitting surrogate: 1110 points, 1110 targets 2026-02-21T09:56:41.7829554Z [732s] Generation 15 starting: 23 neighbors, 1 active search path(s) 2026-02-21T09:56:46.2743455Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 4.8 configs/s 2026-02-21T09:56:49.2089557Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 8.7 configs/s 2026-02-21T09:56:49.5817978Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━ 223/223 38.9 configs/s 2026-02-21T09:56:52.1177904Z [742s] Generation 15 complete: 2026-02-21T09:56:52.1178384Z error=5 2026-02-21T09:56:52.1178584Z ok=20 2026-02-21T09:56:52.1178789Z min=0.9190 2026-02-21T09:56:52.1178993Z mid=1.2972 2026-02-21T09:56:52.1179196Z max=35.8059 2026-02-21T09:56:52.1179433Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:56:52.1179823Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:56:52.1180190Z 'l2_groupings': [2], 2026-02-21T09:56:52.1180467Z 'load_eviction_policies': ['', ''], 2026-02-21T09:56:52.1180801Z 'loop_orders': [[0, 1]], 2026-02-21T09:56:52.1181086Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:56:52.1181285Z 'num_stages': 1, 2026-02-21T09:56:52.1181430Z 'num_warps': 4, 2026-02-21T09:56:52.1181580Z 'pid_type': 'flat', 2026-02-21T09:56:52.1181744Z 'range_flattens': [None, None], 2026-02-21T09:56:52.1181938Z 'range_multi_buffers': [None, False], 2026-02-21T09:56:52.1182159Z 'range_num_stages': [0, 1], 2026-02-21T09:56:52.1182337Z 'range_unroll_factors': [0, 0], 2026-02-21T09:56:52.1182944Z 'range_warp_specializes': [], 2026-02-21T09:56:52.1183126Z 'waves_per_eu': 2} 2026-02-21T09:56:52.1243529Z [742s] Fitting surrogate: 1135 points, 1135 targets 2026-02-21T09:56:52.5091517Z [743s] Generation 16 starting: 24 neighbors, 1 active search path(s) 2026-02-21T09:56:57.0986589Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 8.2 configs/s 2026-02-21T09:57:00.8350578Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 24/24 6.3 configs/s 2026-02-21T09:57:01.2005378Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━ 223/223 44.1 configs/s 2026-02-21T09:57:03.7989713Z [754s] Generation 16 complete: 2026-02-21T09:57:03.7990093Z ok=26 2026-02-21T09:57:03.7990279Z min=0.9103 2026-02-21T09:57:03.7990479Z mid=1.5053 2026-02-21T09:57:03.7990662Z max=56.1355 2026-02-21T09:57:03.7990877Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:57:03.7991206Z 'indexing': ['block_ptr', 'pointer', 'pointer'], 2026-02-21T09:57:03.7991580Z 'l2_groupings': [2], 2026-02-21T09:57:03.7991828Z 'load_eviction_policies': ['', ''], 2026-02-21T09:57:03.7992613Z 'loop_orders': [[0, 1]], 2026-02-21T09:57:03.7992891Z 'matrix_instr_nonkdim': 16, 2026-02-21T09:57:03.7993138Z 'num_stages': 1, 2026-02-21T09:57:03.7993343Z 'num_warps': 4, 2026-02-21T09:57:03.7993551Z 'pid_type': 'flat', 2026-02-21T09:57:03.7993793Z 'range_flattens': [None, None], 2026-02-21T09:57:03.7994077Z 'range_multi_buffers': [None, False], 2026-02-21T09:57:03.7994364Z 'range_num_stages': [0, 1], 2026-02-21T09:57:03.7994633Z 'range_unroll_factors': [0, 0], 2026-02-21T09:57:03.7994895Z 'range_warp_specializes': [], 2026-02-21T09:57:03.7995145Z 'waves_per_eu': 2} 2026-02-21T09:57:03.8065936Z [754s] Fitting surrogate: 1161 points, 1161 targets 2026-02-21T09:57:03.9572012Z [754s] Autotuning complete in 754.8s after searching 1110 configs. 2026-02-21T09:57:03.9572246Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:57:03.9572958Z @helion.kernel(config=helion.Config(block_sizes=[16, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T09:57:03.9573906Z 2026-02-21T09:57:03.9574071Z [754s] Code of selected kernel: /tmp/torchinductor_root/ib/cibwt6vl2xsbx2eg55v3tglyzonuw3nw5eikfaoheyb5o3uggj2c.py 2026-02-21T09:57:03.9775980Z from __future__ import annotations 2026-02-21T09:57:03.9776088Z 2026-02-21T09:57:03.9776128Z import torch 2026-02-21T09:57:03.9776215Z import triton 2026-02-21T09:57:03.9776312Z import triton.language as tl 2026-02-21T09:57:03.9776461Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:57:03.9776586Z 2026-02-21T09:57:03.9776634Z _BLOCK_SIZE_1 = tl.constexpr(128) 2026-02-21T09:57:03.9776755Z _BLOCK_SIZE_2 = tl.constexpr(128) 2026-02-21T09:57:03.9776877Z _BLOCK_SIZE_0 = tl.constexpr(16) 2026-02-21T09:57:03.9776950Z 2026-02-21T09:57:03.9776990Z @triton.jit 2026-02-21T09:57:03.9777141Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr): 2026-02-21T09:57:03.9777369Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:57:03.9777533Z num_pid_m = tl.cdiv(16384, _BLOCK_SIZE_1) 2026-02-21T09:57:03.9777667Z num_pid_n = tl.cdiv(8192, _BLOCK_SIZE_2) 2026-02-21T09:57:03.9777800Z inner_2d_pid = tl.program_id(0) 2026-02-21T09:57:03.9777925Z num_pid_in_group = 2 * num_pid_n 2026-02-21T09:57:03.9778053Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T09:57:03.9778190Z first_pid_m = group_id * 2 2026-02-21T09:57:03.9778326Z group_size_m = min(num_pid_m - first_pid_m, 2) 2026-02-21T09:57:03.9778515Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T09:57:03.9778830Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T09:57:03.9778979Z offset_1 = pid_0 * _BLOCK_SIZE_1 2026-02-21T09:57:03.9779127Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T09:57:03.9779286Z offset_2 = pid_1 * _BLOCK_SIZE_2 2026-02-21T09:57:03.9779430Z indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32) 2026-02-21T09:57:03.9779633Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:57:03.9779832Z acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32) 2026-02-21T09:57:03.9780140Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:57:03.9780409Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:57:03.9780667Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:57:03.9780851Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:57:03.9781095Z for offset_3 in tl.range(0, 512, _BLOCK_SIZE_0, num_stages=1, disallow_acc_multi_buffer=True): 2026-02-21T09:57:03.9781323Z indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T09:57:03.9781475Z acc_copy = acc 2026-02-21T09:57:03.9781577Z acc_copy_0 = acc_copy 2026-02-21T09:57:03.9781725Z # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2 2026-02-21T09:57:03.9781872Z mul = 2 * offset_3 2026-02-21T09:57:03.9782044Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:57:03.9782228Z iota = mul + tl.arange(0, mul_1) 2026-02-21T09:57:03.9782393Z load = tl.load(A + (indices_1[:, None] * 1024 + iota[None, :] * 1), None) 2026-02-21T09:57:03.9782609Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:57:03.9782794Z # src[int4_gemm.py:66]: torch.float32 2026-02-21T09:57:03.9782952Z # src[int4_gemm.py:67]: ) # [BLOCK_SIZE_M, BLOCK_SIZE_K] 2026-02-21T09:57:03.9783093Z v_0 = tl.cast(load, tl.float32) 2026-02-21T09:57:03.9783273Z # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n] # [BLOCK_SIZE_K//2, BLOCK_SIZE_N] 2026-02-21T09:57:03.9783566Z b_tile = tl.load(B + (indices_3[:, None] * 8192 + indices_2[None, :] * 1), None) 2026-02-21T09:57:03.9783794Z # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8) # Sign-extend low 4 bits 2026-02-21T09:57:03.9783975Z v_1 = tl.full([], 4, tl.int8) 2026-02-21T09:57:03.9784087Z v_2 = b_tile << v_1 2026-02-21T09:57:03.9784197Z v_3 = tl.full([], 4, tl.int8) 2026-02-21T09:57:03.9784305Z v_4 = v_2 >> v_3 2026-02-21T09:57:03.9784460Z # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8) # Sign-extend high 4 bits 2026-02-21T09:57:03.9784628Z v_5 = tl.full([], 4, tl.int8) 2026-02-21T09:57:03.9784740Z v_6 = b_tile >> v_5 2026-02-21T09:57:03.9784878Z # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1) 2026-02-21T09:57:03.9785037Z stack_idx = tl.arange(0, 2) 2026-02-21T09:57:03.9785164Z broadcast_idx = stack_idx[None, :, None] 2026-02-21T09:57:03.9785298Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T09:57:03.9785426Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T09:57:03.9785557Z stacked_result = tl.zeros_like(expanded_0) 2026-02-21T09:57:03.9785692Z mask_0 = broadcast_idx == 0 2026-02-21T09:57:03.9785838Z stacked_result = tl.where(mask_0, expanded_0, stacked_result) 2026-02-21T09:57:03.9786000Z mask_1 = broadcast_idx == 1 2026-02-21T09:57:03.9786146Z stacked_result = tl.where(mask_1, expanded_1, stacked_result) 2026-02-21T09:57:03.9786322Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:57:03.9786509Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:57:03.9786707Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:57:03.9786873Z view = tl.reshape(stacked_result, [_SHAPE_DIM_2, _BLOCK_SIZE_2]) 2026-02-21T09:57:03.9787026Z v_7 = tl.cast(view, tl.float32) 2026-02-21T09:57:03.9787202Z # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2) # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1] 2026-02-21T09:57:03.9787383Z a_tile_1 = v_0[:, :, None] 2026-02-21T09:57:03.9787524Z # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0) 2026-02-21T09:57:03.9787710Z b_unpacked_1 = v_7[None, :, :] 2026-02-21T09:57:03.9787895Z # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1) # [BLOCK_SIZE_M, BLOCK_SIZE_N] 2026-02-21T09:57:03.9788093Z v_8 = a_tile_1 * b_unpacked_1 2026-02-21T09:57:03.9788216Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:57:03.9788349Z acc = acc_copy_0 + sum_1 2026-02-21T09:57:03.9788495Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T09:57:03.9788672Z v_10 = tl.cast(acc, tl.bfloat16) 2026-02-21T09:57:03.9788838Z tl.store(C + (indices_1[:, None] * 8192 + indices_2[None, :] * 1), v_10, None) 2026-02-21T09:57:03.9788964Z 2026-02-21T09:57:03.9789052Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher): 2026-02-21T09:57:03.9789213Z """ 2026-02-21T09:57:03.9789327Z BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T09:57:03.9789433Z 2026-02-21T09:57:03.9789497Z This kernel performs matrix multiplication where: 2026-02-21T09:57:03.9789653Z - A is a bfloat16 matrix of shape [M, K] 2026-02-21T09:57:03.9789818Z - B is an int8 matrix of shape [K//2, N] containing packed int4 values 2026-02-21T09:57:03.9789992Z (two 4-bit values packed into each int8) 2026-02-21T09:57:03.9790079Z 2026-02-21T09:57:03.9790114Z Args: 2026-02-21T09:57:03.9790234Z A (Tensor): Input tensor of shape [M, K] in bfloat16 format. 2026-02-21T09:57:03.9790414Z B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format. 2026-02-21T09:57:03.9790534Z 2026-02-21T09:57:03.9790569Z Returns: 2026-02-21T09:57:03.9790693Z Tensor: Output tensor of shape [M, N] in bfloat16 format. 2026-02-21T09:57:03.9790830Z """ 2026-02-21T09:57:03.9790923Z # src[int4_gemm.py:50]: M, K = A.shape 2026-02-21T09:57:03.9791037Z M, K = A.shape 2026-02-21T09:57:03.9791169Z # src[int4_gemm.py:51]: _, N = B.shape 2026-02-21T09:57:03.9791279Z _, N = B.shape 2026-02-21T09:57:03.9791429Z # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:57:03.9791630Z C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:57:03.9791811Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:57:03.9791960Z _BLOCK_SIZE_1 = 128 2026-02-21T09:57:03.9792057Z _BLOCK_SIZE_2 = 128 2026-02-21T09:57:03.9792226Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:57:03.9792491Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:57:03.9792748Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:57:03.9792930Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:57:03.9793040Z _BLOCK_SIZE_0 = 16 2026-02-21T09:57:03.9793167Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:57:03.9793346Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:57:03.9793517Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:57:03.9793644Z _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0 2026-02-21T09:57:03.9793789Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:57:03.9793978Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:57:03.9794149Z # src[int4_gemm.py:57-91]: ... 2026-02-21T09:57:03.9794316Z _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0) 2026-02-21T09:57:03.9794686Z _launcher(_helion_matmul_bf16_int4, (triton.cdiv(16384, _BLOCK_SIZE_1) * triton.cdiv(8192, _BLOCK_SIZE_2),), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, num_warps=4, num_stages=1, waves_per_eu=2, matrix_instr_nonkdim=16) 2026-02-21T09:57:03.9795037Z # src[int4_gemm.py:93]: return C 2026-02-21T09:57:03.9795147Z return C 2026-02-21T09:57:04.8969725Z WARNING:tritonbench.utils.triton_op:Completed input ID 21: 2026-02-21T09:57:04.8970415Z x_val 2026-02-21T09:57:04.8970645Z --------------------- 2026-02-21T09:57:04.8970904Z (4, 4096, 8192, 1024) 2026-02-21T09:57:04.8971069Z 2026-02-21T09:57:04.8984761Z 70%|███████ | 7/10 [54:26<24:18, 486.07s/it]WARNING:tritonbench.utils.triton_op:Running input ID 24: 2026-02-21T09:57:04.8985208Z x_val 2026-02-21T09:57:04.8985398Z ---------------------- 2026-02-21T09:57:04.8985623Z (16, 4096, 1280, 8192) 2026-02-21T09:57:04.8987994Z INFO:tritonbench.utils.triton_op:Took 0.18ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T09:57:05.8923974Z INFO:tritonbench.utils.triton_op:Took 2.84ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T09:57:07.2664270Z INFO:tritonbench.utils.triton_op:Took 0.18ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T09:57:07.2674312Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:57:07.2674745Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:57:07.2675055Z 'dtype': 'torch.bfloat16', 2026-02-21T09:57:07.2675336Z 'shape': (16, 4096, 8192), 2026-02-21T09:57:07.2675620Z 'stride': (33554432, 8192, 1)}, 2026-02-21T09:57:07.2675896Z { 'device': 'cuda:0', 2026-02-21T09:57:07.2676160Z 'dtype': 'torch.int32', 2026-02-21T09:57:07.2676429Z 'shape': (8192, 1280), 2026-02-21T09:57:07.2676681Z 'stride': (1280, 1)}), 2026-02-21T09:57:07.2676935Z 'kwargs': {}} 2026-02-21T09:57:07.2717462Z INFO:tritonbench.utils.triton_op:Took 4.49ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T09:57:07.4578995Z [0s] Autotune random seed: 2138032649 2026-02-21T09:57:07.5313238Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:57:46.5740749Z [39s] Timeout after 30s compiling Config(block_sizes=[128, 4096, 1], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[1, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:57:46.7600062Z [39s] Timeout after 30s compiling Config(block_sizes=[512, 4, 128], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:57:47.6078925Z [40s] Timeout after 30s compiling Config(block_sizes=[2048, 4, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:57:48.3950732Z [40s] Timeout after 30s compiling Config(block_sizes=[512, 4, 256], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[4, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:57:49.2557159Z [41s] Timeout after 30s compiling Config(block_sizes=[16, 8192, 1], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[0, 1], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T09:57:49.9929676Z [42s] Timeout after 30s compiling Config(block_sizes=[128, 512, 1], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[2, 0], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:57:50.5377436Z [43s] Timeout after 30s compiling Config(block_sizes=[256, 2, 256], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:57:51.4436006Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 32768, 4], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:57:51.4464738Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s 2026-02-21T09:58:29.7975757Z Initial population exploring neighbors 9% ━ 9/100 0.1 configs/s 2026-02-21T09:58:29.7979787Z WARNING:tritonbench.utils.triton_op:Completed input ID 24: 2026-02-21T09:58:29.7980734Z x_val 2026-02-21T09:58:29.7980952Z ---------------------- 2026-02-21T09:58:29.7981192Z (16, 4096, 1280, 8192) 2026-02-21T09:58:29.7981339Z 2026-02-21T09:58:29.7998633Z 80%|████████ | 8/10 [55:51<11:56, 358.35s/it]WARNING:tritonbench.utils.triton_op:Running input ID 28: 2026-02-21T09:58:29.7999022Z x_val 2026-02-21T09:58:29.7999197Z ---------------------- 2026-02-21T09:58:29.7999384Z (64, 4096, 1280, 8192) 2026-02-21T09:58:29.8040318Z INFO:tritonbench.utils.triton_op:Took 0.23ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T09:58:30.7955218Z INFO:tritonbench.utils.triton_op:Took 2.04ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T09:58:33.2696114Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T09:58:33.2705621Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:58:33.2706007Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:58:33.2706336Z 'dtype': 'torch.bfloat16', 2026-02-21T09:58:33.2706651Z 'shape': (64, 4096, 8192), 2026-02-21T09:58:33.2706967Z 'stride': (33554432, 8192, 1)}, 2026-02-21T09:58:33.2707295Z { 'device': 'cuda:0', 2026-02-21T09:58:33.2707583Z 'dtype': 'torch.int32', 2026-02-21T09:58:33.2707877Z 'shape': (8192, 1280), 2026-02-21T09:58:33.2708157Z 'stride': (1280, 1)}), 2026-02-21T09:58:33.2708434Z 'kwargs': {}} 2026-02-21T09:58:33.2723380Z INFO:tritonbench.utils.triton_op:Took 1.96ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T09:58:33.4552902Z [0s] Autotune random seed: 2138032649 2026-02-21T09:58:33.6719796Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:59:08.0436758Z [34s] Timeout after 30s compiling Config(block_sizes=[16, 8192, 4], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[2, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:59:11.6909466Z [38s] Timeout after 30s compiling Config(block_sizes=[128, 4096, 1], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[1, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:59:12.1398018Z [38s] Timeout after 30s compiling Config(block_sizes=[512, 4, 128], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:59:12.8805255Z [39s] Timeout after 30s compiling Config(block_sizes=[2048, 4, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T09:59:13.5618988Z [39s] Timeout after 30s compiling Config(block_sizes=[512, 4, 256], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[4, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:59:14.2619290Z [40s] Timeout after 30s compiling Config(block_sizes=[16, 8192, 1], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[0, 1], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T09:59:14.9414949Z [41s] Timeout after 30s compiling Config(block_sizes=[128, 512, 1], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[2, 0], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:59:15.4541188Z [41s] Timeout after 30s compiling Config(block_sizes=[256, 2, 256], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:59:16.2589306Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 32768, 4], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T09:59:16.2613844Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s 2026-02-21T10:06:23.7350822Z Initial population exploring neighbors 9% ━ 9/100 0.0 configs/s 2026-02-21T10:06:23.7355627Z WARNING:tritonbench.utils.triton_op:Completed input ID 28: 2026-02-21T10:06:23.7356049Z x_val 2026-02-21T10:06:23.7356289Z ---------------------- 2026-02-21T10:06:23.7356548Z (64, 4096, 1280, 8192) 2026-02-21T10:06:23.7356705Z 2026-02-21T10:06:23.7375744Z 90%|█████████ | 9/10 [1:03:44<06:34, 394.49s/it]WARNING:tritonbench.utils.triton_op:Running input ID 31: 2026-02-21T10:06:23.7376130Z x_val 2026-02-21T10:06:23.7376277Z ---------------------- 2026-02-21T10:06:23.7376792Z (64, 4096, 8192, 3584) 2026-02-21T10:06:23.7425009Z INFO:tritonbench.utils.triton_op:Took 0.35ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T10:06:24.7379083Z INFO:tritonbench.utils.triton_op:Took 2.11ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T10:06:26.9004879Z INFO:tritonbench.utils.triton_op:Took 0.19ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T10:06:26.9016782Z WARNING:__main__:Input tensor metadata: 2026-02-21T10:06:26.9017030Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T10:06:26.9017245Z 'dtype': 'torch.bfloat16', 2026-02-21T10:06:26.9017460Z 'shape': (64, 4096, 3584), 2026-02-21T10:06:26.9017673Z 'stride': (14680064, 3584, 1)}, 2026-02-21T10:06:26.9017883Z { 'device': 'cuda:0', 2026-02-21T10:06:26.9018076Z 'dtype': 'torch.int32', 2026-02-21T10:06:26.9018284Z 'shape': (3584, 8192), 2026-02-21T10:06:26.9018480Z 'stride': (8192, 1)}), 2026-02-21T10:06:26.9018669Z 'kwargs': {}} 2026-02-21T10:06:26.9035969Z INFO:tritonbench.utils.triton_op:Took 2.15ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T10:06:27.0861738Z [0s] Autotune random seed: 2138032649 2026-02-21T10:06:27.6553364Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T10:07:02.2713267Z [34s] Timeout after 30s compiling Config(block_sizes=[16, 8192, 4], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[2, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T10:07:04.6673111Z [37s] Timeout after 30s compiling Config(block_sizes=[256, 1024, 1], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[3, 3], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T10:07:04.8617632Z [37s] Timeout after 31s compiling Config(block_sizes=[512, 8, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[2, 0], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T10:07:07.6965103Z [40s] Timeout after 30s compiling Config(block_sizes=[2048, 4, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T10:07:07.6991892Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s 2026-02-21T10:34:01.5247524Z Memory access fault by GPU node-2 (Agent handle: 0x5638f4c450c0) on address 0x7f273a000000. Reason: Unknown. 2026-02-21T10:44:03.9163612Z /__w/_temp/1e980714-c608-4472-8874-5b583ec57a31.sh: line 10: 185332 Aborted (core dumped) HELION_PRINT_OUTPUT_CODE=1 HELION_ASSERT_CACHE_HIT=1 python benchmarks/run.py --op $kernel --metrics speedup,accuracy --latency-measure-mode triton_do_bench --cudagraph --only $IMPLS --only-match-mode prefix-with-baseline --baseline $BASELINE --atol 1e-2 --rtol 1e-2 --input-sample-mode equally-spaced-k --output "$TEST_REPORTS_DIR/helionbench.json" --append-to-output --keep-going 2026-02-21T10:44:03.9166071Z ✅ Completed benchmark for kernel: int4_gemm 2026-02-21T10:44:03.9166346Z ========================================== 2026-02-21T10:44:03.9178514Z Running benchmark for kernel: flash_attention 2026-02-21T10:44:03.9178715Z ========================================== 2026-02-21T10:44:15.8028931Z Using baseline: aten 2026-02-21T10:44:15.8029540Z Available implementations for flash_attention: flex_attention,helion_attention,triton_tutorial_flash_v2_tma_ws_persistent 2026-02-21T10:44:26.1863190Z Applying custom args for flash_attention: {'d_head': 128, 'num_inputs': 6} 2026-02-21T10:44:26.2135896Z INFO:root:TMA benchmarks will be running without grid constant TMA descriptor. 2026-02-21T10:44:26.2222978Z TMA benchmarks will be running without grid constant TMA descriptor. 2026-02-21T10:44:26.2349575Z Running flash_attention benchmark with Helion implementation... 2026-02-21T10:44:26.2349864Z 2026-02-21T10:44:26.3861417Z Equally-spaced-k mode: Selected 6 equally spaced inputs (total available: 7) 2026-02-21T10:44:26.3861887Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 1, 2, 4, 5, 6] 2026-02-21T10:44:26.3866689Z 2026-02-21T10:44:26.3873440Z 0%| | 0/6 [00:00>) : (tensor<1x128x16xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 1], order = [1, 0, 2]}>>) -> tensor<128x16xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [0, 1]}>> 2026-02-21T10:44:55.0563230Z 2026-02-21T10:44:55.0564316Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T10:44:55.0565571Z ^ 2026-02-21T10:44:55.0566083Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T10:44:55.0566826Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 4, 16], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:44:55.0567710Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:44:55.0568563Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:44:55.0569427Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:44:55.0569951Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [16, 4, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:44:55.0570439Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:44:55.0570798Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T10:44:55.0571147Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T10:44:55.0571484Z #blocked8 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T10:44:55.0571818Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T10:44:55.0572191Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [16, 4, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:44:55.0572555Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T10:44:55.0572920Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:44:55.0573294Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:44:55.0573676Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:44:55.0574048Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:44:55.0574485Z #blocked16 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T10:44:55.0574876Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T10:44:55.0575478Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T10:44:55.0575973Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T10:44:55.0576122Z %c192_i64 = arith.constant 192 : i64 2026-02-21T10:44:55.0576270Z %c0_i64 = arith.constant 0 : i64 2026-02-21T10:44:55.0576415Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T10:44:55.0576610Z %cst = arith.constant dense<0.000000e+00> : tensor<1x128x16xbf16, #blocked> 2026-02-21T10:44:55.0576855Z %cst_0 = arith.constant dense<0> : tensor<1x1x16xi64, #blocked1> 2026-02-21T10:44:55.0577090Z %cst_1 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:55.0577306Z %cst_2 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:55.0577520Z %cst_3 = arith.constant dense<128> : tensor<1x1x16xi64, #blocked1> 2026-02-21T10:44:55.0577757Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<1x4x128xbf16, #blocked3> 2026-02-21T10:44:55.0577992Z %cst_5 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:55.0578202Z %cst_6 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:55.0578411Z %cst_7 = arith.constant dense<0> : tensor<1x4x1xi64, #blocked4> 2026-02-21T10:44:55.0578613Z %cst_8 = arith.constant dense<128> : tensor<1x4x1xi64, #blocked4> 2026-02-21T10:44:55.0578793Z %c16_i32 = arith.constant 16 : i32 2026-02-21T10:44:55.0578930Z %c128_i32 = arith.constant 128 : i32 2026-02-21T10:44:55.0579105Z %cst_9 = arith.constant dense<128> : tensor<1x4x1xi32, #blocked4> 2026-02-21T10:44:55.0579317Z %cst_10 = arith.constant dense<128> : tensor<1x16x1xi32, #blocked5> 2026-02-21T10:44:55.0579510Z %cst_11 = arith.constant dense<0.127517432> : tensor<1x4x16xf32, #blocked> 2026-02-21T10:44:55.0579711Z %cst_12 = arith.constant dense<0.127517432> : tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0579932Z %cst_13 = arith.constant dense<0.000000e+00> : tensor<4x16xf32, #blocked7> 2026-02-21T10:44:55.0580103Z %c0_i32 = arith.constant 0 : i32 2026-02-21T10:44:55.0580265Z %cst_14 = arith.constant dense<0.000000e+00> : tensor<1x4x128xf32, #blocked3> 2026-02-21T10:44:55.0580474Z %cst_15 = arith.constant dense<1.000000e+00> : tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0580675Z %cst_16 = arith.constant dense<0xFF800000> : tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0580836Z %c4_i32 = arith.constant 4 : i32 2026-02-21T10:44:55.0580958Z %c192_i32 = arith.constant 192 : i32 2026-02-21T10:44:55.0581080Z %0 = tt.get_program_id x : i32 2026-02-21T10:44:55.0581201Z %1 = arith.divsi %0, %c128_i32 : i32 2026-02-21T10:44:55.0581317Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T10:44:55.0581434Z %3 = arith.subi %c192_i32, %2 : i32 2026-02-21T10:44:55.0581554Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T10:44:55.0581672Z %5 = arith.remsi %0, %c128_i32 : i32 2026-02-21T10:44:55.0581790Z %6 = arith.remsi %5, %4 : i32 2026-02-21T10:44:55.0581902Z %7 = arith.addi %2, %6 : i32 2026-02-21T10:44:55.0582015Z %8 = arith.divsi %5, %4 : i32 2026-02-21T10:44:55.0582125Z %9 = arith.muli %8, %c4_i32 : i32 2026-02-21T10:44:55.0582283Z %10 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked8> 2026-02-21T10:44:55.0582462Z %11 = tt.splat %9 : i32 -> tensor<4xi32, #blocked8> 2026-02-21T10:44:55.0582616Z %12 = arith.addi %11, %10 : tensor<4xi32, #blocked8> 2026-02-21T10:44:55.0582794Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked8> 2026-02-21T10:44:55.0582981Z %14 = arith.extsi %7 : i32 to i64 2026-02-21T10:44:55.0583097Z %15 = arith.extsi %9 : i32 to i64 2026-02-21T10:44:55.0583255Z %16 = tt.splat %arg0 : !tt.ptr -> tensor<1x4x128x!tt.ptr, #blocked3> 2026-02-21T10:44:55.0583432Z %17 = arith.muli %14, %c16384_i64 : i64 2026-02-21T10:44:55.0583575Z %18 = tt.splat %17 : i64 -> tensor<1x4x128xi64, #blocked3> 2026-02-21T10:44:55.0583734Z %19 = tt.splat %15 : i64 -> tensor<4xi64, #blocked8> 2026-02-21T10:44:55.0583930Z %20 = arith.extsi %10 : tensor<4xi32, #blocked8> to tensor<4xi64, #blocked8> 2026-02-21T10:44:55.0584106Z %21 = arith.addi %19, %20 : tensor<4xi64, #blocked8> 2026-02-21T10:44:55.0584342Z %22 = ttg.convert_layout %21 : tensor<4xi64, #blocked8> -> tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T10:44:55.0584665Z %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x4xi64, #blocked9> 2026-02-21T10:44:55.0584971Z %24 = ttg.convert_layout %23 : tensor<1x4xi64, #blocked9> -> tensor<1x4xi64, #blocked6> 2026-02-21T10:44:55.0585255Z %25 = ttg.convert_layout %24 : tensor<1x4xi64, #blocked6> -> tensor<1x4xi64, #ttg.slice<{dim = 2, parent = #blocked10}>> 2026-02-21T10:44:55.0585588Z %26 = tt.expand_dims %25 {axis = 2 : i32} : tensor<1x4xi64, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x4x1xi64, #blocked10> 2026-02-21T10:44:55.0585892Z %27 = ttg.convert_layout %26 : tensor<1x4x1xi64, #blocked10> -> tensor<1x4x1xi64, #blocked4> 2026-02-21T10:44:55.0586101Z %28 = arith.muli %27, %cst_8 : tensor<1x4x1xi64, #blocked4> 2026-02-21T10:44:55.0586304Z %29 = tt.broadcast %28 : tensor<1x4x1xi64, #blocked4> -> tensor<1x4x128xi64, #blocked4> 2026-02-21T10:44:55.0586553Z %30 = ttg.convert_layout %29 : tensor<1x4x128xi64, #blocked4> -> tensor<1x4x128xi64, #blocked3> 2026-02-21T10:44:55.0586788Z %31 = arith.extsi %13 : tensor<128xi32, #blocked8> to tensor<128xi64, #blocked8> 2026-02-21T10:44:55.0587061Z %32 = ttg.convert_layout %31 : tensor<128xi64, #blocked8> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T10:44:55.0587385Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi64, #blocked9> 2026-02-21T10:44:55.0587699Z %34 = ttg.convert_layout %33 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #blocked11> 2026-02-21T10:44:55.0587992Z %35 = ttg.convert_layout %34 : tensor<1x128xi64, #blocked11> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked12}>> 2026-02-21T10:44:55.0588337Z %36 = tt.expand_dims %35 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked12}>> -> tensor<1x1x128xi64, #blocked12> 2026-02-21T10:44:55.0588647Z %37 = ttg.convert_layout %36 : tensor<1x1x128xi64, #blocked12> -> tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:55.0588896Z %38 = tt.broadcast %37 : tensor<1x1x128xi64, #blocked3> -> tensor<1x4x128xi64, #blocked3> 2026-02-21T10:44:55.0589104Z %39 = arith.addi %30, %38 : tensor<1x4x128xi64, #blocked3> 2026-02-21T10:44:55.0589267Z %40 = arith.addi %18, %39 : tensor<1x4x128xi64, #blocked3> 2026-02-21T10:44:55.0589473Z %41 = tt.addptr %16, %40 : tensor<1x4x128x!tt.ptr, #blocked3>, tensor<1x4x128xi64, #blocked3> 2026-02-21T10:44:55.0589677Z %42 = arith.cmpi sge, %14, %c0_i64 : i64 2026-02-21T10:44:55.0589803Z %43 = arith.cmpi slt, %14, %c192_i64 : i64 2026-02-21T10:44:55.0589934Z %44 = arith.andi %42, %43 : i1 2026-02-21T10:44:55.0590076Z %45 = arith.cmpi sge, %27, %cst_7 : tensor<1x4x1xi64, #blocked4> 2026-02-21T10:44:55.0590256Z %46 = arith.cmpi slt, %27, %cst_8 : tensor<1x4x1xi64, #blocked4> 2026-02-21T10:44:55.0590424Z %47 = arith.andi %45, %46 : tensor<1x4x1xi1, #blocked4> 2026-02-21T10:44:55.0590579Z %48 = tt.splat %44 : i1 -> tensor<1x4x1xi1, #blocked4> 2026-02-21T10:44:55.0590731Z %49 = arith.andi %48, %47 : tensor<1x4x1xi1, #blocked4> 2026-02-21T10:44:55.0590938Z %50 = tt.broadcast %49 : tensor<1x4x1xi1, #blocked4> -> tensor<1x4x128xi1, #blocked4> 2026-02-21T10:44:55.0591181Z %51 = ttg.convert_layout %50 : tensor<1x4x128xi1, #blocked4> -> tensor<1x4x128xi1, #blocked3> 2026-02-21T10:44:55.0591401Z %52 = arith.cmpi sge, %37, %cst_6 : tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:55.0591577Z %53 = arith.cmpi slt, %37, %cst_5 : tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:55.0591749Z %54 = arith.andi %52, %53 : tensor<1x1x128xi1, #blocked3> 2026-02-21T10:44:55.0591962Z %55 = tt.broadcast %54 : tensor<1x1x128xi1, #blocked3> -> tensor<1x4x128xi1, #blocked3> 2026-02-21T10:44:55.0592160Z %56 = arith.andi %51, %55 : tensor<1x4x128xi1, #blocked3> 2026-02-21T10:44:55.0592328Z %57 = tt.load %41, %56, %cst_4 : tensor<1x4x128x!tt.ptr, #blocked3> 2026-02-21T10:44:55.0592527Z %58 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked8> 2026-02-21T10:44:55.0592741Z %59 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x16x!tt.ptr, #blocked> 2026-02-21T10:44:55.0592942Z %60 = tt.splat %17 : i64 -> tensor<1x128x16xi64, #blocked> 2026-02-21T10:44:55.0593198Z %61 = ttg.convert_layout %34 : tensor<1x128xi64, #blocked11> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked13}>> 2026-02-21T10:44:55.0593547Z %62 = tt.expand_dims %61 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x128x1xi64, #blocked13> 2026-02-21T10:44:55.0593856Z %63 = ttg.convert_layout %62 : tensor<1x128x1xi64, #blocked13> -> tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:55.0594109Z %64 = tt.broadcast %63 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x16xi64, #blocked2> 2026-02-21T10:44:55.0594353Z %65 = ttg.convert_layout %64 : tensor<1x128x16xi64, #blocked2> -> tensor<1x128x16xi64, #blocked> 2026-02-21T10:44:55.0594587Z %66 = arith.extsi %58 : tensor<16xi32, #blocked8> to tensor<16xi64, #blocked8> 2026-02-21T10:44:55.0594781Z %67 = arith.cmpi sge, %63, %cst_2 : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:55.0594959Z %68 = arith.cmpi slt, %63, %cst_1 : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:55.0595144Z %69 = arith.andi %67, %68 : tensor<1x128x1xi1, #blocked2> 2026-02-21T10:44:55.0595305Z %70 = tt.splat %44 : i1 -> tensor<1x128x1xi1, #blocked2> 2026-02-21T10:44:55.0595497Z %71 = arith.andi %70, %69 : tensor<1x128x1xi1, #blocked2> 2026-02-21T10:44:55.0595694Z %72 = tt.broadcast %71 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x16xi1, #blocked2> 2026-02-21T10:44:55.0595936Z %73 = ttg.convert_layout %72 : tensor<1x128x16xi1, #blocked2> -> tensor<1x128x16xi1, #blocked> 2026-02-21T10:44:55.0596180Z %74 = tt.reshape %57 : tensor<1x4x128xbf16, #blocked3> -> tensor<4x128xbf16, #blocked11> 2026-02-21T10:44:55.0596366Z %75 = arith.muli %7, %c16384_i32 : i32 2026-02-21T10:44:55.0596505Z %76 = tt.splat %75 : i32 -> tensor<1x16x1xi32, #blocked5> 2026-02-21T10:44:55.0596751Z %77 = ttg.convert_layout %13 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T10:44:55.0597080Z %78 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9> 2026-02-21T10:44:55.0597377Z %79 = ttg.convert_layout %78 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked11> 2026-02-21T10:44:55.0597669Z %80 = ttg.convert_layout %79 : tensor<1x128xi32, #blocked11> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked12}>> 2026-02-21T10:44:55.0598014Z %81 = tt.expand_dims %80 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked12}>> -> tensor<1x1x128xi32, #blocked12> 2026-02-21T10:44:55.0598322Z %82 = ttg.convert_layout %81 : tensor<1x1x128xi32, #blocked12> -> tensor<1x1x128xi32, #blocked3> 2026-02-21T10:44:55.0598570Z %83 = tt.broadcast %82 : tensor<1x1x128xi32, #blocked3> -> tensor<1x16x128xi32, #blocked3> 2026-02-21T10:44:55.0598814Z %84 = tt.splat %arg2 : !tt.ptr -> tensor<1x16x128x!tt.ptr, #blocked3> 2026-02-21T10:44:55.0599201Z %85:3 = scf.for %arg4 = %c0_i32 to %c128_i32 step %c16_i32 iter_args(%arg5 = %cst_16, %arg6 = %cst_15, %arg7 = %cst_14) -> (tensor<1x4xf32, #blocked6>, tensor<1x4xf32, #blocked6>, tensor<1x4x128xf32, #blocked3>) : i32 { 2026-02-21T10:44:55.0599554Z %115 = tt.splat %arg4 : i32 -> tensor<16xi32, #blocked8> 2026-02-21T10:44:55.0599717Z %116 = arith.addi %115, %58 : tensor<16xi32, #blocked8> 2026-02-21T10:44:55.0599876Z %117 = arith.extsi %arg4 : i32 to i64 2026-02-21T10:44:55.0600017Z %118 = tt.splat %117 : i64 -> tensor<16xi64, #blocked8> 2026-02-21T10:44:55.0600172Z %119 = arith.addi %118, %66 : tensor<16xi64, #blocked8> 2026-02-21T10:44:55.0600408Z %120 = ttg.convert_layout %119 : tensor<16xi64, #blocked8> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T10:44:55.0600743Z %121 = tt.expand_dims %120 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x16xi64, #blocked9> 2026-02-21T10:44:55.0601055Z %122 = ttg.convert_layout %121 : tensor<1x16xi64, #blocked9> -> tensor<1x16xi64, #blocked7> 2026-02-21T10:44:55.0601350Z %123 = ttg.convert_layout %122 : tensor<1x16xi64, #blocked7> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked14}>> 2026-02-21T10:44:55.0601704Z %124 = tt.expand_dims %123 {axis = 1 : i32} : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked14}>> -> tensor<1x1x16xi64, #blocked14> 2026-02-21T10:44:55.0602015Z %125 = ttg.convert_layout %124 : tensor<1x1x16xi64, #blocked14> -> tensor<1x1x16xi64, #blocked1> 2026-02-21T10:44:55.0602236Z %126 = arith.muli %125, %cst_3 : tensor<1x1x16xi64, #blocked1> 2026-02-21T10:44:55.0602445Z %127 = tt.broadcast %126 : tensor<1x1x16xi64, #blocked1> -> tensor<1x128x16xi64, #blocked1> 2026-02-21T10:44:55.0602765Z %128 = ttg.convert_layout %127 : tensor<1x128x16xi64, #blocked1> -> tensor<1x128x16xi64, #blocked> 2026-02-21T10:44:55.0602990Z %129 = arith.addi %65, %128 : tensor<1x128x16xi64, #blocked> 2026-02-21T10:44:55.0603155Z %130 = arith.addi %60, %129 : tensor<1x128x16xi64, #blocked> 2026-02-21T10:44:55.0603374Z %131 = tt.addptr %59, %130 : tensor<1x128x16x!tt.ptr, #blocked>, tensor<1x128x16xi64, #blocked> 2026-02-21T10:44:55.0603625Z %132 = arith.cmpi sge, %125, %cst_0 : tensor<1x1x16xi64, #blocked1> 2026-02-21T10:44:55.0603810Z %133 = arith.cmpi slt, %125, %cst_3 : tensor<1x1x16xi64, #blocked1> 2026-02-21T10:44:55.0603987Z %134 = arith.andi %132, %133 : tensor<1x1x16xi1, #blocked1> 2026-02-21T10:44:55.0604196Z %135 = tt.broadcast %134 : tensor<1x1x16xi1, #blocked1> -> tensor<1x128x16xi1, #blocked1> 2026-02-21T10:44:55.0604450Z %136 = ttg.convert_layout %135 : tensor<1x128x16xi1, #blocked1> -> tensor<1x128x16xi1, #blocked> 2026-02-21T10:44:55.0604661Z %137 = arith.andi %73, %136 : tensor<1x128x16xi1, #blocked> 2026-02-21T10:44:55.0604842Z %138 = tt.load %131, %137, %cst : tensor<1x128x16x!tt.ptr, #blocked> 2026-02-21T10:44:55.0605069Z %139 = tt.reshape %138 : tensor<1x128x16xbf16, #blocked> -> tensor<128x16xbf16, #blocked7> 2026-02-21T10:44:55.0605377Z %140 = ttg.convert_layout %74 : tensor<4x128xbf16, #blocked11> -> tensor<4x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked7}>> 2026-02-21T10:44:55.0605744Z %141 = ttg.convert_layout %139 : tensor<128x16xbf16, #blocked7> -> tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked7}>> 2026-02-21T10:44:55.0606049Z %142 = ttg.convert_layout %cst_13 : tensor<4x16xf32, #blocked7> -> tensor<4x16xf32, #blocked7> 2026-02-21T10:44:55.0606467Z %143 = tt.dot %140, %141, %142, inputPrecision = tf32 : tensor<4x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked7}>> * tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked7}>> -> tensor<4x16xf32, #blocked7> 2026-02-21T10:44:55.0606894Z %144 = tt.reshape %143 : tensor<4x16xf32, #blocked7> -> tensor<1x4x16xf32, #blocked> 2026-02-21T10:44:55.0607147Z %145 = arith.truncf %144 : tensor<1x4x16xf32, #blocked> to tensor<1x4x16xbf16, #blocked> 2026-02-21T10:44:55.0607387Z %146 = arith.extf %145 : tensor<1x4x16xbf16, #blocked> to tensor<1x4x16xf32, #blocked> 2026-02-21T10:44:55.0607581Z %147 = "tt.reduce"(%146) <{axis = 2 : i32}> ({ 2026-02-21T10:44:55.0607713Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T10:44:55.0607841Z %203 = arith.maxnumf %arg8, %arg9 : f32 2026-02-21T10:44:55.0607969Z tt.reduce.return %203 : f32 2026-02-21T10:44:55.0608182Z }) : (tensor<1x4x16xf32, #blocked>) -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T10:44:55.0608474Z %148 = ttg.convert_layout %147 : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0608750Z %149 = arith.truncf %148 : tensor<1x4xf32, #blocked6> to tensor<1x4xbf16, #blocked6> 2026-02-21T10:44:55.0608976Z %150 = arith.extf %149 : tensor<1x4xbf16, #blocked6> to tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0609190Z %151 = arith.mulf %150, %cst_12 : tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0609387Z %152 = arith.truncf %151 : tensor<1x4xf32, #blocked6> to tensor<1x4xbf16, #blocked6> 2026-02-21T10:44:55.0609605Z %153 = arith.extf %152 : tensor<1x4xbf16, #blocked6> to tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0609809Z %154 = arith.cmpf ogt, %arg5, %153 : tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0609987Z %155 = arith.cmpf une, %arg5, %arg5 : tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0610154Z %156 = arith.ori %154, %155 : tensor<1x4xi1, #blocked6> 2026-02-21T10:44:55.0610355Z %157 = arith.select %156, %arg5, %153 : tensor<1x4xi1, #blocked6>, tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0610562Z %158 = arith.mulf %146, %cst_11 : tensor<1x4x16xf32, #blocked> 2026-02-21T10:44:55.0610768Z %159 = arith.truncf %158 : tensor<1x4x16xf32, #blocked> to tensor<1x4x16xbf16, #blocked> 2026-02-21T10:44:55.0611060Z %160 = ttg.convert_layout %157 : tensor<1x4xf32, #blocked6> -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked10}>> 2026-02-21T10:44:55.0611404Z %161 = tt.expand_dims %160 {axis = 2 : i32} : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x4x1xf32, #blocked10> 2026-02-21T10:44:55.0611729Z %162 = ttg.convert_layout %161 : tensor<1x4x1xf32, #blocked10> -> tensor<1x4x1xf32, #blocked4> 2026-02-21T10:44:55.0611970Z %163 = arith.extf %159 : tensor<1x4x16xbf16, #blocked> to tensor<1x4x16xf32, #blocked> 2026-02-21T10:44:55.0612211Z %164 = tt.broadcast %162 : tensor<1x4x1xf32, #blocked4> -> tensor<1x4x16xf32, #blocked4> 2026-02-21T10:44:55.0612460Z %165 = ttg.convert_layout %164 : tensor<1x4x16xf32, #blocked4> -> tensor<1x4x16xf32, #blocked> 2026-02-21T10:44:55.0612673Z %166 = arith.subf %163, %165 : tensor<1x4x16xf32, #blocked> 2026-02-21T10:44:55.0612983Z %167 = tt.extern_elementwise %166 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x4x16xf32, #blocked>) -> tensor<1x4x16xf32, #blocked> 2026-02-21T10:44:55.0613272Z %168 = "tt.reduce"(%167) <{axis = 2 : i32}> ({ 2026-02-21T10:44:55.0613406Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T10:44:55.0613530Z %203 = arith.addf %arg8, %arg9 : f32 2026-02-21T10:44:55.0613652Z tt.reduce.return %203 : f32 2026-02-21T10:44:55.0613846Z }) : (tensor<1x4x16xf32, #blocked>) -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T10:44:55.0614137Z %169 = ttg.convert_layout %168 : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0614385Z %170 = arith.subf %arg5, %157 : tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0614675Z %171 = tt.extern_elementwise %170 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x4xf32, #blocked6>) -> tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0614982Z %172 = arith.mulf %arg6, %171 : tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0615150Z %173 = arith.addf %172, %169 : tensor<1x4xf32, #blocked6> 2026-02-21T10:44:55.0615395Z %174 = ttg.convert_layout %171 : tensor<1x4xf32, #blocked6> -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked10}>> 2026-02-21T10:44:55.0615747Z %175 = tt.expand_dims %174 {axis = 2 : i32} : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x4x1xf32, #blocked10> 2026-02-21T10:44:55.0616051Z %176 = ttg.convert_layout %175 : tensor<1x4x1xf32, #blocked10> -> tensor<1x4x1xf32, #blocked4> 2026-02-21T10:44:55.0616319Z %177 = tt.broadcast %176 : tensor<1x4x1xf32, #blocked4> -> tensor<1x4x128xf32, #blocked4> 2026-02-21T10:44:55.0616573Z %178 = ttg.convert_layout %177 : tensor<1x4x128xf32, #blocked4> -> tensor<1x4x128xf32, #blocked3> 2026-02-21T10:44:55.0616792Z %179 = arith.mulf %arg7, %178 : tensor<1x4x128xf32, #blocked3> 2026-02-21T10:44:55.0617059Z %180 = ttg.convert_layout %116 : tensor<16xi32, #blocked8> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T10:44:55.0617387Z %181 = tt.expand_dims %180 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x16xi32, #blocked9> 2026-02-21T10:44:55.0617682Z %182 = ttg.convert_layout %181 : tensor<1x16xi32, #blocked9> -> tensor<1x16xi32, #blocked7> 2026-02-21T10:44:55.0617977Z %183 = ttg.convert_layout %182 : tensor<1x16xi32, #blocked7> -> tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked15}>> 2026-02-21T10:44:55.0618321Z %184 = tt.expand_dims %183 {axis = 2 : i32} : tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x16x1xi32, #blocked15> 2026-02-21T10:44:55.0618630Z %185 = ttg.convert_layout %184 : tensor<1x16x1xi32, #blocked15> -> tensor<1x16x1xi32, #blocked5> 2026-02-21T10:44:55.0618852Z %186 = arith.muli %185, %cst_10 : tensor<1x16x1xi32, #blocked5> 2026-02-21T10:44:55.0619019Z %187 = arith.addi %76, %186 : tensor<1x16x1xi32, #blocked5> 2026-02-21T10:44:55.0619225Z %188 = tt.broadcast %187 : tensor<1x16x1xi32, #blocked5> -> tensor<1x16x128xi32, #blocked5> 2026-02-21T10:44:55.0619480Z %189 = ttg.convert_layout %188 : tensor<1x16x128xi32, #blocked5> -> tensor<1x16x128xi32, #blocked3> 2026-02-21T10:44:55.0619704Z %190 = arith.addi %189, %83 : tensor<1x16x128xi32, #blocked3> 2026-02-21T10:44:55.0619941Z %191 = tt.addptr %84, %190 : tensor<1x16x128x!tt.ptr, #blocked3>, tensor<1x16x128xi32, #blocked3> 2026-02-21T10:44:55.0620167Z %192 = tt.load %191 : tensor<1x16x128x!tt.ptr, #blocked3> 2026-02-21T10:44:55.0620373Z %193 = arith.truncf %167 : tensor<1x4x16xf32, #blocked> to tensor<1x4x16xbf16, #blocked> 2026-02-21T10:44:55.0620615Z %194 = tt.reshape %179 : tensor<1x4x128xf32, #blocked3> -> tensor<4x128xf32, #blocked11> 2026-02-21T10:44:55.0620857Z %195 = tt.reshape %193 : tensor<1x4x16xbf16, #blocked> -> tensor<4x16xbf16, #blocked7> 2026-02-21T10:44:55.0621100Z %196 = tt.reshape %192 : tensor<1x16x128xbf16, #blocked3> -> tensor<16x128xbf16, #blocked11> 2026-02-21T10:44:55.0621405Z %197 = ttg.convert_layout %195 : tensor<4x16xbf16, #blocked7> -> tensor<4x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked16}>> 2026-02-21T10:44:55.0621768Z %198 = ttg.convert_layout %196 : tensor<16x128xbf16, #blocked11> -> tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked16}>> 2026-02-21T10:44:55.0622079Z %199 = ttg.convert_layout %194 : tensor<4x128xf32, #blocked11> -> tensor<4x128xf32, #blocked16> 2026-02-21T10:44:55.0622508Z %200 = tt.dot %197, %198, %199, inputPrecision = tf32 : tensor<4x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked16}>> * tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked16}>> -> tensor<4x128xf32, #blocked16> 2026-02-21T10:44:55.0622923Z %201 = ttg.convert_layout %200 : tensor<4x128xf32, #blocked16> -> tensor<4x128xf32, #blocked11> 2026-02-21T10:44:55.0623170Z %202 = tt.reshape %201 : tensor<4x128xf32, #blocked11> -> tensor<1x4x128xf32, #blocked3> 2026-02-21T10:44:55.0623466Z scf.yield %157, %173, %202 : tensor<1x4xf32, #blocked6>, tensor<1x4xf32, #blocked6>, tensor<1x4x128xf32, #blocked3> 2026-02-21T10:44:55.0623682Z } {tt.num_stages = 4 : i32} 2026-02-21T10:44:55.0623905Z %86 = ttg.convert_layout %85#1 : tensor<1x4xf32, #blocked6> -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked10}>> 2026-02-21T10:44:55.0624247Z %87 = tt.expand_dims %86 {axis = 2 : i32} : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x4x1xf32, #blocked10> 2026-02-21T10:44:55.0624556Z %88 = ttg.convert_layout %87 : tensor<1x4x1xf32, #blocked10> -> tensor<1x4x1xf32, #blocked4> 2026-02-21T10:44:55.0624799Z %89 = tt.broadcast %88 : tensor<1x4x1xf32, #blocked4> -> tensor<1x4x128xf32, #blocked4> 2026-02-21T10:44:55.0625041Z %90 = ttg.convert_layout %89 : tensor<1x4x128xf32, #blocked4> -> tensor<1x4x128xf32, #blocked3> 2026-02-21T10:44:55.0625260Z %91 = arith.divf %85#2, %90 : tensor<1x4x128xf32, #blocked3> 2026-02-21T10:44:55.0625490Z %92 = arith.truncf %91 : tensor<1x4x128xf32, #blocked3> to tensor<1x4x128xbf16, #blocked3> 2026-02-21T10:44:55.0625676Z %93 = arith.muli %7, %c16384_i32 : i32 2026-02-21T10:44:55.0625894Z %94 = ttg.convert_layout %12 : tensor<4xi32, #blocked8> -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T10:44:55.0626214Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x4xi32, #blocked9> 2026-02-21T10:44:55.0626498Z %96 = ttg.convert_layout %95 : tensor<1x4xi32, #blocked9> -> tensor<1x4xi32, #blocked6> 2026-02-21T10:44:55.0626779Z %97 = ttg.convert_layout %96 : tensor<1x4xi32, #blocked6> -> tensor<1x4xi32, #ttg.slice<{dim = 2, parent = #blocked10}>> 2026-02-21T10:44:55.0627108Z %98 = tt.expand_dims %97 {axis = 2 : i32} : tensor<1x4xi32, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x4x1xi32, #blocked10> 2026-02-21T10:44:55.0627408Z %99 = ttg.convert_layout %98 : tensor<1x4x1xi32, #blocked10> -> tensor<1x4x1xi32, #blocked4> 2026-02-21T10:44:55.0627617Z %100 = arith.muli %99, %cst_9 : tensor<1x4x1xi32, #blocked4> 2026-02-21T10:44:55.0627785Z %101 = tt.splat %93 : i32 -> tensor<1x4x1xi32, #blocked4> 2026-02-21T10:44:55.0627949Z %102 = arith.addi %101, %100 : tensor<1x4x1xi32, #blocked4> 2026-02-21T10:44:55.0628209Z %103 = ttg.convert_layout %13 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T10:44:55.0628547Z %104 = tt.expand_dims %103 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9> 2026-02-21T10:44:55.0628841Z %105 = ttg.convert_layout %104 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked11> 2026-02-21T10:44:55.0629143Z %106 = ttg.convert_layout %105 : tensor<1x128xi32, #blocked11> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked12}>> 2026-02-21T10:44:55.0629503Z %107 = tt.expand_dims %106 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked12}>> -> tensor<1x1x128xi32, #blocked12> 2026-02-21T10:44:55.0629810Z %108 = ttg.convert_layout %107 : tensor<1x1x128xi32, #blocked12> -> tensor<1x1x128xi32, #blocked3> 2026-02-21T10:44:55.0630063Z %109 = tt.broadcast %102 : tensor<1x4x1xi32, #blocked4> -> tensor<1x4x128xi32, #blocked4> 2026-02-21T10:44:55.0630310Z %110 = ttg.convert_layout %109 : tensor<1x4x128xi32, #blocked4> -> tensor<1x4x128xi32, #blocked3> 2026-02-21T10:44:55.0630564Z %111 = tt.broadcast %108 : tensor<1x1x128xi32, #blocked3> -> tensor<1x4x128xi32, #blocked3> 2026-02-21T10:44:55.0630771Z %112 = arith.addi %110, %111 : tensor<1x4x128xi32, #blocked3> 2026-02-21T10:44:55.0630963Z %113 = tt.splat %arg3 : !tt.ptr -> tensor<1x4x128x!tt.ptr, #blocked3> 2026-02-21T10:44:55.0631207Z %114 = tt.addptr %113, %112 : tensor<1x4x128x!tt.ptr, #blocked3>, tensor<1x4x128xi32, #blocked3> 2026-02-21T10:44:55.0631441Z tt.store %114, %92 : tensor<1x4x128x!tt.ptr, #blocked3> 2026-02-21T10:44:55.0631587Z tt.return 2026-02-21T10:44:55.0631671Z } 2026-02-21T10:44:55.0631758Z } 2026-02-21T10:44:55.0631802Z 2026-02-21T10:44:55.0631840Z {-# 2026-02-21T10:44:55.0631923Z external_resources: { 2026-02-21T10:44:55.0632030Z mlir_reproducer: { 2026-02-21T10:44:55.0634289Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T10:44:55.0636612Z disable_threading: false, 2026-02-21T10:44:55.0636720Z verify_each: true 2026-02-21T10:44:55.0636818Z } 2026-02-21T10:44:55.0636898Z } 2026-02-21T10:44:55.0636970Z #-} 2026-02-21T10:44:55.0637253Z /tmp/torchinductor_root/xd/cxd2hduf2l55d74tltebjgb4u7ej4fcfh4zszguf5nurbyjr3hpx.py:17:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T10:44:55.0637968Z /tmp/torchinductor_root/xd/cxd2hduf2l55d74tltebjgb4u7ej4fcfh4zszguf5nurbyjr3hpx.py:17:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T10:44:55.0638539Z [20s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T10:44:55.0639271Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 4, 16], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T10:44:55.0639938Z Error: RuntimeError: PassManager::run failed 2026-02-21T10:44:55.0640114Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T10:44:55.1159372Z /tmp/torchinductor_root/qp/cqpelahacicx6ki3kthwus2gomtuw7rhjdarif5mfbgtnqccw44x.py:62:24: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T10:44:55.1160813Z k = tl.load(tl.make_block_ptr(k_view, [192, 128, 128], [16384, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_3, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T10:44:55.1161685Z ^ 2026-02-21T10:44:55.1163459Z /tmp/torchinductor_root/qp/cqpelahacicx6ki3kthwus2gomtuw7rhjdarif5mfbgtnqccw44x.py:64:145: note: - use: %105 = "tt.reshape"(<>) : (tensor<1x128x16xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 4], order = [1, 0, 2]}>>) -> tensor<128x16xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 4], order = [0, 1]}>> 2026-02-21T10:44:55.1165144Z 2026-02-21T10:44:55.1166077Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T10:44:55.1167336Z ^ 2026-02-21T10:44:55.1167926Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T10:44:55.1168656Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 4, 16], warpsPerCTA = [1, 4, 1], order = [2, 1, 0]}> 2026-02-21T10:44:55.1169536Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:44:55.1170456Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}> 2026-02-21T10:44:55.1171328Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:44:55.1172180Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T10:44:55.1172791Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T10:44:55.1173154Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [2, 1, 0]}> 2026-02-21T10:44:55.1173471Z #blocked7 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [4], order = [0]}> 2026-02-21T10:44:55.1173765Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [0, 1]}> 2026-02-21T10:44:55.1174064Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T10:44:55.1174383Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [0, 1, 2]}> 2026-02-21T10:44:55.1174711Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [0, 1, 2]}> 2026-02-21T10:44:55.1175057Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}> 2026-02-21T10:44:55.1175375Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:44:55.1175693Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:44:55.1176010Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:44:55.1176332Z #blocked16 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:44:55.1176668Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T10:44:55.1177196Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T10:44:55.1177604Z %c192_i64 = arith.constant 192 : i64 2026-02-21T10:44:55.1177730Z %c0_i64 = arith.constant 0 : i64 2026-02-21T10:44:55.1177863Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T10:44:55.1177990Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T10:44:55.1178162Z %cst = arith.constant dense<0.000000e+00> : tensor<1x128x16xbf16, #blocked> 2026-02-21T10:44:55.1178387Z %cst_0 = arith.constant dense<0> : tensor<1x1x16xi64, #blocked1> 2026-02-21T10:44:55.1178574Z %cst_1 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:55.1178761Z %cst_2 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:55.1178941Z %cst_3 = arith.constant dense<128> : tensor<1x1x16xi64, #blocked1> 2026-02-21T10:44:55.1179103Z %c3072_i32 = arith.constant 3072 : i32 2026-02-21T10:44:55.1179224Z %c4864_i32 = arith.constant 4864 : i32 2026-02-21T10:44:55.1179370Z %c24576_i32 = arith.constant 24576 : i32 2026-02-21T10:44:55.1179524Z %cst_4 = arith.constant dense<128> : tensor<1x16x1xi32, #blocked3> 2026-02-21T10:44:55.1179713Z %cst_5 = arith.constant dense<0.127517432> : tensor<1x1x16xf32, #blocked1> 2026-02-21T10:44:55.1179915Z %cst_6 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1180110Z %cst_7 = arith.constant dense<0.000000e+00> : tensor<1x16xf32, #blocked5> 2026-02-21T10:44:55.1180281Z %c0_i32 = arith.constant 0 : i32 2026-02-21T10:44:55.1180459Z %cst_8 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked6> 2026-02-21T10:44:55.1180666Z %cst_9 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1180867Z %cst_10 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1181028Z %c16_i32 = arith.constant 16 : i32 2026-02-21T10:44:55.1181150Z %c128_i32 = arith.constant 128 : i32 2026-02-21T10:44:55.1181274Z %0 = tt.get_program_id x : i32 2026-02-21T10:44:55.1181439Z %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked7> 2026-02-21T10:44:55.1181709Z %2 = ttg.convert_layout %1 : tensor<128xi32, #blocked7> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:44:55.1182040Z %3 = tt.expand_dims %2 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi32, #blocked8> 2026-02-21T10:44:55.1182334Z %4 = ttg.convert_layout %3 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #blocked9> 2026-02-21T10:44:55.1182618Z %5 = ttg.convert_layout %4 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T10:44:55.1182980Z %6 = tt.expand_dims %5 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<1x1x128xi32, #blocked10> 2026-02-21T10:44:55.1183285Z %7 = ttg.convert_layout %6 : tensor<1x1x128xi32, #blocked10> -> tensor<1x1x128xi32, #blocked6> 2026-02-21T10:44:55.1183521Z %8 = tt.splat %arg0 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked6> 2026-02-21T10:44:55.1183734Z %9 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked7> 2026-02-21T10:44:55.1183940Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x16x!tt.ptr, #blocked> 2026-02-21T10:44:55.1184157Z %11 = arith.extsi %1 : tensor<128xi32, #blocked7> to tensor<128xi64, #blocked7> 2026-02-21T10:44:55.1184433Z %12 = ttg.convert_layout %11 : tensor<128xi64, #blocked7> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:44:55.1184764Z %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi64, #blocked8> 2026-02-21T10:44:55.1185062Z %14 = ttg.convert_layout %13 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #blocked9> 2026-02-21T10:44:55.1185350Z %15 = ttg.convert_layout %14 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked11}>> 2026-02-21T10:44:55.1185695Z %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x128x1xi64, #blocked11> 2026-02-21T10:44:55.1186001Z %17 = ttg.convert_layout %16 : tensor<1x128x1xi64, #blocked11> -> tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:55.1186254Z %18 = tt.broadcast %17 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x16xi64, #blocked2> 2026-02-21T10:44:55.1186520Z %19 = ttg.convert_layout %18 : tensor<1x128x16xi64, #blocked2> -> tensor<1x128x16xi64, #blocked> 2026-02-21T10:44:55.1186750Z %20 = arith.extsi %9 : tensor<16xi32, #blocked7> to tensor<16xi64, #blocked7> 2026-02-21T10:44:55.1186945Z %21 = arith.cmpi sge, %17, %cst_2 : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:55.1187124Z %22 = arith.cmpi slt, %17, %cst_1 : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:55.1187291Z %23 = arith.andi %21, %22 : tensor<1x128x1xi1, #blocked2> 2026-02-21T10:44:55.1187510Z %24 = tt.broadcast %7 : tensor<1x1x128xi32, #blocked6> -> tensor<1x16x128xi32, #blocked6> 2026-02-21T10:44:55.1187760Z %25 = ttg.convert_layout %24 : tensor<1x16x128xi32, #blocked6> -> tensor<1x16x128xi32, #blocked12> 2026-02-21T10:44:55.1187999Z %26 = tt.splat %arg2 : !tt.ptr -> tensor<1x16x128x!tt.ptr, #blocked12> 2026-02-21T10:44:55.1188202Z %27 = tt.splat %arg3 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked6> 2026-02-21T10:44:55.1188411Z scf.for %arg4 = %0 to %c24576_i32 step %c4864_i32 : i32 { 2026-02-21T10:44:55.1188564Z %28 = arith.divsi %arg4, %c3072_i32 : i32 2026-02-21T10:44:55.1188690Z %29 = arith.muli %28, %c16_i32 : i32 2026-02-21T10:44:55.1188811Z %30 = arith.subi %c128_i32, %29 : i32 2026-02-21T10:44:55.1188928Z %31 = arith.minsi %30, %c16_i32 : i32 2026-02-21T10:44:55.1189051Z %32 = arith.remsi %arg4, %c3072_i32 : i32 2026-02-21T10:44:55.1189170Z %33 = arith.remsi %32, %31 : i32 2026-02-21T10:44:55.1189284Z %34 = arith.addi %29, %33 : i32 2026-02-21T10:44:55.1189395Z %35 = arith.divsi %32, %31 : i32 2026-02-21T10:44:55.1189507Z %36 = arith.muli %35, %c16384_i32 : i32 2026-02-21T10:44:55.1189625Z %37 = arith.muli %34, %c128_i32 : i32 2026-02-21T10:44:55.1189736Z %38 = arith.addi %36, %37 : i32 2026-02-21T10:44:55.1189871Z %39 = tt.splat %38 : i32 -> tensor<1x1x128xi32, #blocked6> 2026-02-21T10:44:55.1190029Z %40 = arith.addi %39, %7 : tensor<1x1x128xi32, #blocked6> 2026-02-21T10:44:55.1190241Z %41 = tt.addptr %8, %40 : tensor<1x1x128x!tt.ptr, #blocked6>, tensor<1x1x128xi32, #blocked6> 2026-02-21T10:44:55.1190448Z %42 = tt.load %41 : tensor<1x1x128x!tt.ptr, #blocked6> 2026-02-21T10:44:55.1190588Z %43 = arith.extsi %35 : i32 to i64 2026-02-21T10:44:55.1190728Z %44 = arith.muli %43, %c16384_i64 : i64 2026-02-21T10:44:55.1190865Z %45 = tt.splat %44 : i64 -> tensor<1x128x16xi64, #blocked> 2026-02-21T10:44:55.1191009Z %46 = arith.cmpi sge, %43, %c0_i64 : i64 2026-02-21T10:44:55.1191135Z %47 = arith.cmpi slt, %43, %c192_i64 : i64 2026-02-21T10:44:55.1191257Z %48 = arith.andi %46, %47 : i1 2026-02-21T10:44:55.1191387Z %49 = tt.splat %48 : i1 -> tensor<1x128x1xi1, #blocked2> 2026-02-21T10:44:55.1191544Z %50 = arith.andi %49, %23 : tensor<1x128x1xi1, #blocked2> 2026-02-21T10:44:55.1191739Z %51 = tt.broadcast %50 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x16xi1, #blocked2> 2026-02-21T10:44:55.1191984Z %52 = ttg.convert_layout %51 : tensor<1x128x16xi1, #blocked2> -> tensor<1x128x16xi1, #blocked> 2026-02-21T10:44:55.1192226Z %53 = tt.reshape %42 : tensor<1x1x128xbf16, #blocked6> -> tensor<1x128xbf16, #blocked9> 2026-02-21T10:44:55.1192417Z %54 = tt.splat %36 : i32 -> tensor<1x16x1xi32, #blocked3> 2026-02-21T10:44:55.1192774Z %55:3 = scf.for %arg5 = %c0_i32 to %c128_i32 step %c16_i32 iter_args(%arg6 = %cst_10, %arg7 = %cst_9, %arg8 = %cst_8) -> (tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked6>) : i32 { 2026-02-21T10:44:55.1193130Z %64 = tt.splat %arg5 : i32 -> tensor<16xi32, #blocked7> 2026-02-21T10:44:55.1193284Z %65 = arith.addi %64, %9 : tensor<16xi32, #blocked7> 2026-02-21T10:44:55.1193421Z %66 = arith.extsi %arg5 : i32 to i64 2026-02-21T10:44:55.1193559Z %67 = tt.splat %66 : i64 -> tensor<16xi64, #blocked7> 2026-02-21T10:44:55.1193714Z %68 = arith.addi %67, %20 : tensor<16xi64, #blocked7> 2026-02-21T10:44:55.1193967Z %69 = ttg.convert_layout %68 : tensor<16xi64, #blocked7> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:44:55.1194292Z %70 = tt.expand_dims %69 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x16xi64, #blocked8> 2026-02-21T10:44:55.1194582Z %71 = ttg.convert_layout %70 : tensor<1x16xi64, #blocked8> -> tensor<1x16xi64, #blocked5> 2026-02-21T10:44:55.1194883Z %72 = ttg.convert_layout %71 : tensor<1x16xi64, #blocked5> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked13}>> 2026-02-21T10:44:55.1195225Z %73 = tt.expand_dims %72 {axis = 1 : i32} : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked13}>> -> tensor<1x1x16xi64, #blocked13> 2026-02-21T10:44:55.1195534Z %74 = ttg.convert_layout %73 : tensor<1x1x16xi64, #blocked13> -> tensor<1x1x16xi64, #blocked1> 2026-02-21T10:44:55.1195746Z %75 = arith.muli %74, %cst_3 : tensor<1x1x16xi64, #blocked1> 2026-02-21T10:44:55.1195968Z %76 = tt.broadcast %75 : tensor<1x1x16xi64, #blocked1> -> tensor<1x128x16xi64, #blocked1> 2026-02-21T10:44:55.1196214Z %77 = ttg.convert_layout %76 : tensor<1x128x16xi64, #blocked1> -> tensor<1x128x16xi64, #blocked> 2026-02-21T10:44:55.1196425Z %78 = arith.addi %19, %77 : tensor<1x128x16xi64, #blocked> 2026-02-21T10:44:55.1196588Z %79 = arith.addi %45, %78 : tensor<1x128x16xi64, #blocked> 2026-02-21T10:44:55.1196797Z %80 = tt.addptr %10, %79 : tensor<1x128x16x!tt.ptr, #blocked>, tensor<1x128x16xi64, #blocked> 2026-02-21T10:44:55.1197022Z %81 = arith.cmpi sge, %74, %cst_0 : tensor<1x1x16xi64, #blocked1> 2026-02-21T10:44:55.1197199Z %82 = arith.cmpi slt, %74, %cst_3 : tensor<1x1x16xi64, #blocked1> 2026-02-21T10:44:55.1197367Z %83 = arith.andi %81, %82 : tensor<1x1x16xi1, #blocked1> 2026-02-21T10:44:55.1197563Z %84 = tt.broadcast %83 : tensor<1x1x16xi1, #blocked1> -> tensor<1x128x16xi1, #blocked1> 2026-02-21T10:44:55.1197815Z %85 = ttg.convert_layout %84 : tensor<1x128x16xi1, #blocked1> -> tensor<1x128x16xi1, #blocked> 2026-02-21T10:44:55.1198024Z %86 = arith.andi %52, %85 : tensor<1x128x16xi1, #blocked> 2026-02-21T10:44:55.1198189Z %87 = tt.load %80, %86, %cst : tensor<1x128x16x!tt.ptr, #blocked> 2026-02-21T10:44:55.1198417Z %88 = tt.reshape %87 : tensor<1x128x16xbf16, #blocked> -> tensor<128x16xbf16, #blocked5> 2026-02-21T10:44:55.1198708Z %89 = ttg.convert_layout %53 : tensor<1x128xbf16, #blocked9> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>> 2026-02-21T10:44:55.1199062Z %90 = ttg.convert_layout %88 : tensor<128x16xbf16, #blocked5> -> tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>> 2026-02-21T10:44:55.1199364Z %91 = ttg.convert_layout %cst_7 : tensor<1x16xf32, #blocked5> -> tensor<1x16xf32, #blocked5> 2026-02-21T10:44:55.1199767Z %92 = tt.dot %89, %90, %91, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>> * tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>> -> tensor<1x16xf32, #blocked5> 2026-02-21T10:44:55.1200153Z %93 = tt.reshape %92 : tensor<1x16xf32, #blocked5> -> tensor<1x1x16xf32, #blocked1> 2026-02-21T10:44:55.1200390Z %94 = arith.truncf %93 : tensor<1x1x16xf32, #blocked1> to tensor<1x1x16xbf16, #blocked1> 2026-02-21T10:44:55.1200620Z %95 = arith.extf %94 : tensor<1x1x16xbf16, #blocked1> to tensor<1x1x16xf32, #blocked1> 2026-02-21T10:44:55.1200809Z %96 = "tt.reduce"(%95) <{axis = 2 : i32}> ({ 2026-02-21T10:44:55.1200936Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T10:44:55.1201062Z %151 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T10:44:55.1201190Z tt.reduce.return %151 : f32 2026-02-21T10:44:55.1201381Z }) : (tensor<1x1x16xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T10:44:55.1201690Z %97 = ttg.convert_layout %96 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1201953Z %98 = arith.truncf %97 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4> 2026-02-21T10:44:55.1202173Z %99 = arith.extf %98 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1202364Z %100 = arith.mulf %99, %cst_6 : tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1202558Z %101 = arith.truncf %100 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4> 2026-02-21T10:44:55.1202856Z %102 = arith.extf %101 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1203053Z %103 = arith.cmpf ogt, %arg6, %102 : tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1203230Z %104 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1203395Z %105 = arith.ori %103, %104 : tensor<1x1xi1, #blocked4> 2026-02-21T10:44:55.1203612Z %106 = arith.select %105, %arg6, %102 : tensor<1x1xi1, #blocked4>, tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1203846Z %107 = arith.mulf %95, %cst_5 : tensor<1x1x16xf32, #blocked1> 2026-02-21T10:44:55.1204053Z %108 = arith.truncf %107 : tensor<1x1x16xf32, #blocked1> to tensor<1x1x16xbf16, #blocked1> 2026-02-21T10:44:55.1204342Z %109 = ttg.convert_layout %106 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T10:44:55.1204685Z %110 = tt.expand_dims %109 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T10:44:55.1204990Z %111 = ttg.convert_layout %110 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15> 2026-02-21T10:44:55.1205237Z %112 = arith.extf %108 : tensor<1x1x16xbf16, #blocked1> to tensor<1x1x16xf32, #blocked1> 2026-02-21T10:44:55.1205472Z %113 = tt.broadcast %111 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x16xf32, #blocked15> 2026-02-21T10:44:55.1205729Z %114 = ttg.convert_layout %113 : tensor<1x1x16xf32, #blocked15> -> tensor<1x1x16xf32, #blocked1> 2026-02-21T10:44:55.1205943Z %115 = arith.subf %112, %114 : tensor<1x1x16xf32, #blocked1> 2026-02-21T10:44:55.1206263Z %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x16xf32, #blocked1>) -> tensor<1x1x16xf32, #blocked1> 2026-02-21T10:44:55.1206554Z %117 = "tt.reduce"(%116) <{axis = 2 : i32}> ({ 2026-02-21T10:44:55.1206682Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T10:44:55.1206804Z %151 = arith.addf %arg9, %arg10 : f32 2026-02-21T10:44:55.1206926Z tt.reduce.return %151 : f32 2026-02-21T10:44:55.1207117Z }) : (tensor<1x1x16xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T10:44:55.1207413Z %118 = ttg.convert_layout %117 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1207660Z %119 = arith.subf %arg6, %106 : tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1207948Z %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked4>) -> tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1208238Z %121 = arith.mulf %arg7, %120 : tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1208397Z %122 = arith.addf %121, %118 : tensor<1x1xf32, #blocked4> 2026-02-21T10:44:55.1208645Z %123 = ttg.convert_layout %120 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T10:44:55.1208985Z %124 = tt.expand_dims %123 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T10:44:55.1209288Z %125 = ttg.convert_layout %124 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15> 2026-02-21T10:44:55.1209560Z %126 = tt.broadcast %125 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x128xf32, #blocked15> 2026-02-21T10:44:55.1209815Z %127 = ttg.convert_layout %126 : tensor<1x1x128xf32, #blocked15> -> tensor<1x1x128xf32, #blocked6> 2026-02-21T10:44:55.1210052Z %128 = arith.mulf %arg8, %127 : tensor<1x1x128xf32, #blocked6> 2026-02-21T10:44:55.1210294Z %129 = ttg.convert_layout %65 : tensor<16xi32, #blocked7> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:44:55.1210623Z %130 = tt.expand_dims %129 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x16xi32, #blocked8> 2026-02-21T10:44:55.1210934Z %131 = ttg.convert_layout %130 : tensor<1x16xi32, #blocked8> -> tensor<1x16xi32, #blocked5> 2026-02-21T10:44:55.1211224Z %132 = ttg.convert_layout %131 : tensor<1x16xi32, #blocked5> -> tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked16}>> 2026-02-21T10:44:55.1211592Z %133 = tt.expand_dims %132 {axis = 2 : i32} : tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked16}>> -> tensor<1x16x1xi32, #blocked16> 2026-02-21T10:44:55.1211899Z %134 = ttg.convert_layout %133 : tensor<1x16x1xi32, #blocked16> -> tensor<1x16x1xi32, #blocked3> 2026-02-21T10:44:55.1212114Z %135 = arith.muli %134, %cst_4 : tensor<1x16x1xi32, #blocked3> 2026-02-21T10:44:55.1212282Z %136 = arith.addi %54, %135 : tensor<1x16x1xi32, #blocked3> 2026-02-21T10:44:55.1212482Z %137 = tt.broadcast %136 : tensor<1x16x1xi32, #blocked3> -> tensor<1x16x128xi32, #blocked3> 2026-02-21T10:44:55.1212745Z %138 = ttg.convert_layout %137 : tensor<1x16x128xi32, #blocked3> -> tensor<1x16x128xi32, #blocked12> 2026-02-21T10:44:55.1212965Z %139 = arith.addi %138, %25 : tensor<1x16x128xi32, #blocked12> 2026-02-21T10:44:55.1213190Z %140 = tt.addptr %26, %139 : tensor<1x16x128x!tt.ptr, #blocked12>, tensor<1x16x128xi32, #blocked12> 2026-02-21T10:44:55.1213421Z %141 = tt.load %140 : tensor<1x16x128x!tt.ptr, #blocked12> 2026-02-21T10:44:55.1213629Z %142 = arith.truncf %116 : tensor<1x1x16xf32, #blocked1> to tensor<1x1x16xbf16, #blocked1> 2026-02-21T10:44:55.1213872Z %143 = tt.reshape %128 : tensor<1x1x128xf32, #blocked6> -> tensor<1x128xf32, #blocked9> 2026-02-21T10:44:55.1214102Z %144 = tt.reshape %142 : tensor<1x1x16xbf16, #blocked1> -> tensor<1x16xbf16, #blocked5> 2026-02-21T10:44:55.1214361Z %145 = tt.reshape %141 : tensor<1x16x128xbf16, #blocked12> -> tensor<16x128xbf16, #blocked9> 2026-02-21T10:44:55.1214663Z %146 = ttg.convert_layout %144 : tensor<1x16xbf16, #blocked5> -> tensor<1x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked9}>> 2026-02-21T10:44:55.1215015Z %147 = ttg.convert_layout %145 : tensor<16x128xbf16, #blocked9> -> tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked9}>> 2026-02-21T10:44:55.1215317Z %148 = ttg.convert_layout %143 : tensor<1x128xf32, #blocked9> -> tensor<1x128xf32, #blocked9> 2026-02-21T10:44:55.1215731Z %149 = tt.dot %146, %147, %148, inputPrecision = tf32 : tensor<1x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked9}>> * tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked9}>> -> tensor<1x128xf32, #blocked9> 2026-02-21T10:44:55.1216128Z %150 = tt.reshape %149 : tensor<1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked6> 2026-02-21T10:44:55.1216408Z scf.yield %106, %122, %150 : tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked6> 2026-02-21T10:44:55.1216650Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32} 2026-02-21T10:44:55.1216903Z %56 = ttg.convert_layout %55#1 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T10:44:55.1217237Z %57 = tt.expand_dims %56 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T10:44:55.1217536Z %58 = ttg.convert_layout %57 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15> 2026-02-21T10:44:55.1217803Z %59 = tt.broadcast %58 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x128xf32, #blocked15> 2026-02-21T10:44:55.1218049Z %60 = ttg.convert_layout %59 : tensor<1x1x128xf32, #blocked15> -> tensor<1x1x128xf32, #blocked6> 2026-02-21T10:44:55.1218262Z %61 = arith.divf %55#2, %60 : tensor<1x1x128xf32, #blocked6> 2026-02-21T10:44:55.1218464Z %62 = arith.truncf %61 : tensor<1x1x128xf32, #blocked6> to tensor<1x1x128xbf16, #blocked6> 2026-02-21T10:44:55.1218731Z %63 = tt.addptr %27, %40 : tensor<1x1x128x!tt.ptr, #blocked6>, tensor<1x1x128xi32, #blocked6> 2026-02-21T10:44:55.1220794Z tt.store %63, %62 : tensor<1x1x128x!tt.ptr, #blocked6> 2026-02-21T10:44:55.1220965Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32} 2026-02-21T10:44:55.1221107Z tt.return 2026-02-21T10:44:55.1221186Z } 2026-02-21T10:44:55.1221259Z } 2026-02-21T10:44:55.1221301Z 2026-02-21T10:44:55.1221334Z {-# 2026-02-21T10:44:55.1221417Z external_resources: { 2026-02-21T10:44:55.1221515Z mlir_reproducer: { 2026-02-21T10:44:55.1223782Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=2 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=2}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T10:44:55.1226058Z disable_threading: false, 2026-02-21T10:44:55.1226164Z verify_each: true 2026-02-21T10:44:55.1226257Z } 2026-02-21T10:44:55.1226327Z } 2026-02-21T10:44:55.1226405Z #-} 2026-02-21T10:44:55.1226696Z /tmp/torchinductor_root/qp/cqpelahacicx6ki3kthwus2gomtuw7rhjdarif5mfbgtnqccw44x.py:17:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T10:44:55.1227379Z /tmp/torchinductor_root/qp/cqpelahacicx6ki3kthwus2gomtuw7rhjdarif5mfbgtnqccw44x.py:17:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T10:44:55.1227946Z [20s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T10:44:55.1228754Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[4, 4], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T10:44:55.1229493Z Error: RuntimeError: PassManager::run failed 2026-02-21T10:44:55.1229663Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T10:44:56.7216073Z /tmp/torchinductor_root/lz/clzaf7w6wcyqtwtghizt7yfmicm4hjzvfbzhjpc2zvux6s7fsoth.py:57:20: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T10:44:56.7217884Z k = tl.load(tl.make_block_ptr(k_view, [192, 128, 128], [16384, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_1, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T10:44:56.7218756Z ^ 2026-02-21T10:44:56.7220358Z /tmp/torchinductor_root/lz/clzaf7w6wcyqtwtghizt7yfmicm4hjzvfbzhjpc2zvux6s7fsoth.py:59:141: note: - use: %124 = "tt.reshape"(<>) : (tensor<1x128x32xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 4], order = [1, 0, 2]}>>) -> tensor<128x32xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 4], order = [0, 1]}>> 2026-02-21T10:44:56.7221626Z 2026-02-21T10:44:56.7222295Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T10:44:56.7223192Z ^ 2026-02-21T10:44:56.7223641Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T10:44:56.7224105Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [1, 4, 1], order = [2, 1, 0]}> 2026-02-21T10:44:56.7224733Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 1, 32], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:44:56.7225351Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}> 2026-02-21T10:44:56.7225966Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [2, 1, 0]}> 2026-02-21T10:44:56.7226587Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 32, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:44:56.7227185Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T10:44:56.7227880Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T10:44:56.7228434Z #blocked7 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [4], order = [0]}> 2026-02-21T10:44:56.7228987Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [0, 1]}> 2026-02-21T10:44:56.7229565Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T10:44:56.7230076Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [0, 1, 2]}> 2026-02-21T10:44:56.7230544Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [0, 1, 2]}> 2026-02-21T10:44:56.7230991Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}> 2026-02-21T10:44:56.7231439Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 1, 32], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:44:56.7231887Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:44:56.7232331Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:44:56.7232773Z #blocked16 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 32, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:44:56.7233252Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T10:44:56.7234004Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T10:44:56.7234567Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T10:44:56.7234747Z %c192_i64 = arith.constant 192 : i64 2026-02-21T10:44:56.7234941Z %c0_i64 = arith.constant 0 : i64 2026-02-21T10:44:56.7235108Z %c128_i64 = arith.constant 128 : i64 2026-02-21T10:44:56.7235311Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T10:44:56.7235547Z %cst = arith.constant dense<0.000000e+00> : tensor<1x128x32xbf16, #blocked> 2026-02-21T10:44:56.7235828Z %cst_0 = arith.constant dense<0> : tensor<1x1x32xi64, #blocked1> 2026-02-21T10:44:56.7236088Z %cst_1 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:56.7236344Z %cst_2 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:56.7236600Z %cst_3 = arith.constant dense<128> : tensor<1x1x32xi64, #blocked1> 2026-02-21T10:44:56.7236875Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<1x1x128xbf16, #blocked3> 2026-02-21T10:44:56.7237149Z %cst_5 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:56.7237405Z %cst_6 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:56.7237617Z %c32_i32 = arith.constant 32 : i32 2026-02-21T10:44:56.7237786Z %c128_i32 = arith.constant 128 : i32 2026-02-21T10:44:56.7237952Z %c512_i32 = arith.constant 512 : i32 2026-02-21T10:44:56.7238157Z %cst_7 = arith.constant dense<128> : tensor<1x32x1xi32, #blocked4> 2026-02-21T10:44:56.7238426Z %cst_8 = arith.constant dense<0.127517432> : tensor<1x1x32xf32, #blocked1> 2026-02-21T10:44:56.7238703Z %cst_9 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7238981Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<1x32xf32, #blocked6> 2026-02-21T10:44:56.7239236Z %c0_i32 = arith.constant 0 : i32 2026-02-21T10:44:56.7239461Z %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked3> 2026-02-21T10:44:56.7239771Z %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7240047Z %cst_13 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7240258Z %c4_i32 = arith.constant 4 : i32 2026-02-21T10:44:56.7240392Z %c192_i32 = arith.constant 192 : i32 2026-02-21T10:44:56.7240527Z %0 = tt.get_program_id x : i32 2026-02-21T10:44:56.7240662Z %1 = tt.get_program_id y : i32 2026-02-21T10:44:56.7240793Z %2 = arith.muli %1, %c192_i32 : i32 2026-02-21T10:44:56.7251176Z %3 = arith.addi %0, %2 : i32 2026-02-21T10:44:56.7251322Z %4 = arith.divsi %3, %c512_i32 : i32 2026-02-21T10:44:56.7251453Z %5 = arith.muli %4, %c4_i32 : i32 2026-02-21T10:44:56.7251569Z %6 = arith.subi %c192_i32, %5 : i32 2026-02-21T10:44:56.7251714Z %7 = arith.minsi %6, %c4_i32 : i32 2026-02-21T10:44:56.7251832Z %8 = arith.remsi %3, %c512_i32 : i32 2026-02-21T10:44:56.7251953Z %9 = arith.remsi %8, %7 : i32 2026-02-21T10:44:56.7252064Z %10 = arith.addi %5, %9 : i32 2026-02-21T10:44:56.7252184Z %11 = arith.divsi %8, %7 : i32 2026-02-21T10:44:56.7252346Z %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked7> 2026-02-21T10:44:56.7252528Z %13 = arith.extsi %10 : i32 to i64 2026-02-21T10:44:56.7252646Z %14 = arith.extsi %11 : i32 to i64 2026-02-21T10:44:56.7252810Z %15 = tt.splat %arg0 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked3> 2026-02-21T10:44:56.7252989Z %16 = arith.muli %13, %c16384_i64 : i64 2026-02-21T10:44:56.7253135Z %17 = tt.splat %16 : i64 -> tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:56.7253281Z %18 = arith.muli %14, %c128_i64 : i64 2026-02-21T10:44:56.7253468Z %19 = tt.splat %18 : i64 -> tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:56.7253657Z %20 = arith.extsi %12 : tensor<128xi32, #blocked7> to tensor<128xi64, #blocked7> 2026-02-21T10:44:56.7253934Z %21 = ttg.convert_layout %20 : tensor<128xi64, #blocked7> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:44:56.7254273Z %22 = tt.expand_dims %21 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi64, #blocked8> 2026-02-21T10:44:56.7254596Z %23 = ttg.convert_layout %22 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #blocked9> 2026-02-21T10:44:56.7254907Z %24 = ttg.convert_layout %23 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T10:44:56.7255258Z %25 = tt.expand_dims %24 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<1x1x128xi64, #blocked10> 2026-02-21T10:44:56.7255569Z %26 = ttg.convert_layout %25 : tensor<1x1x128xi64, #blocked10> -> tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:56.7255780Z %27 = arith.addi %19, %26 : tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:56.7255944Z %28 = arith.addi %17, %27 : tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:56.7256151Z %29 = tt.addptr %15, %28 : tensor<1x1x128x!tt.ptr, #blocked3>, tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:56.7256351Z %30 = arith.cmpi sge, %13, %c0_i64 : i64 2026-02-21T10:44:56.7256487Z %31 = arith.cmpi slt, %13, %c192_i64 : i64 2026-02-21T10:44:56.7256611Z %32 = arith.andi %30, %31 : i1 2026-02-21T10:44:56.7256732Z %33 = arith.cmpi sge, %14, %c0_i64 : i64 2026-02-21T10:44:56.7256858Z %34 = arith.cmpi slt, %14, %c128_i64 : i64 2026-02-21T10:44:56.7256982Z %35 = arith.andi %33, %34 : i1 2026-02-21T10:44:56.7257094Z %36 = arith.andi %32, %35 : i1 2026-02-21T10:44:56.7257231Z %37 = tt.splat %36 : i1 -> tensor<1x1x128xi1, #blocked3> 2026-02-21T10:44:56.7257399Z %38 = arith.cmpi sge, %26, %cst_6 : tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:56.7257579Z %39 = arith.cmpi slt, %26, %cst_5 : tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:56.7257753Z %40 = arith.andi %38, %39 : tensor<1x1x128xi1, #blocked3> 2026-02-21T10:44:56.7257912Z %41 = arith.andi %37, %40 : tensor<1x1x128xi1, #blocked3> 2026-02-21T10:44:56.7258115Z %42 = tt.load %29, %41, %cst_4 : tensor<1x1x128x!tt.ptr, #blocked3> 2026-02-21T10:44:56.7258315Z %43 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked7> 2026-02-21T10:44:56.7258535Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x32x!tt.ptr, #blocked> 2026-02-21T10:44:56.7258729Z %45 = tt.splat %16 : i64 -> tensor<1x128x32xi64, #blocked> 2026-02-21T10:44:56.7258978Z %46 = ttg.convert_layout %23 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked11}>> 2026-02-21T10:44:56.7259331Z %47 = tt.expand_dims %46 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x128x1xi64, #blocked11> 2026-02-21T10:44:56.7259641Z %48 = ttg.convert_layout %47 : tensor<1x128x1xi64, #blocked11> -> tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:56.7259898Z %49 = tt.broadcast %48 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x32xi64, #blocked2> 2026-02-21T10:44:56.7260157Z %50 = ttg.convert_layout %49 : tensor<1x128x32xi64, #blocked2> -> tensor<1x128x32xi64, #blocked> 2026-02-21T10:44:56.7260391Z %51 = arith.extsi %43 : tensor<32xi32, #blocked7> to tensor<32xi64, #blocked7> 2026-02-21T10:44:56.7260592Z %52 = arith.cmpi sge, %48, %cst_2 : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:56.7260767Z %53 = arith.cmpi slt, %48, %cst_1 : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:44:56.7260942Z %54 = arith.andi %52, %53 : tensor<1x128x1xi1, #blocked2> 2026-02-21T10:44:56.7261103Z %55 = tt.splat %32 : i1 -> tensor<1x128x1xi1, #blocked2> 2026-02-21T10:44:56.7261283Z %56 = arith.andi %55, %54 : tensor<1x128x1xi1, #blocked2> 2026-02-21T10:44:56.7261485Z %57 = tt.broadcast %56 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x32xi1, #blocked2> 2026-02-21T10:44:56.7261732Z %58 = ttg.convert_layout %57 : tensor<1x128x32xi1, #blocked2> -> tensor<1x128x32xi1, #blocked> 2026-02-21T10:44:56.7261990Z %59 = tt.reshape %42 : tensor<1x1x128xbf16, #blocked3> -> tensor<1x128xbf16, #blocked9> 2026-02-21T10:44:56.7262175Z %60 = arith.muli %10, %c16384_i32 : i32 2026-02-21T10:44:56.7262347Z %61 = tt.splat %60 : i32 -> tensor<1x32x1xi32, #blocked4> 2026-02-21T10:44:56.7262591Z %62 = ttg.convert_layout %12 : tensor<128xi32, #blocked7> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:44:56.7262936Z %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi32, #blocked8> 2026-02-21T10:44:56.7263237Z %64 = ttg.convert_layout %63 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #blocked9> 2026-02-21T10:44:56.7263529Z %65 = ttg.convert_layout %64 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked10}>> 2026-02-21T10:44:56.7263883Z %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<1x1x128xi32, #blocked10> 2026-02-21T10:44:56.7264196Z %67 = ttg.convert_layout %66 : tensor<1x1x128xi32, #blocked10> -> tensor<1x1x128xi32, #blocked3> 2026-02-21T10:44:56.7264450Z %68 = tt.broadcast %67 : tensor<1x1x128xi32, #blocked3> -> tensor<1x32x128xi32, #blocked3> 2026-02-21T10:44:56.7264711Z %69 = ttg.convert_layout %68 : tensor<1x32x128xi32, #blocked3> -> tensor<1x32x128xi32, #blocked12> 2026-02-21T10:44:56.7264954Z %70 = tt.splat %arg2 : !tt.ptr -> tensor<1x32x128x!tt.ptr, #blocked12> 2026-02-21T10:44:56.7265344Z %71:3 = scf.for %arg4 = %c0_i32 to %c128_i32 step %c32_i32 iter_args(%arg5 = %cst_13, %arg6 = %cst_12, %arg7 = %cst_11) -> (tensor<1x1xf32, #blocked5>, tensor<1x1xf32, #blocked5>, tensor<1x1x128xf32, #blocked3>) : i32 { 2026-02-21T10:44:56.7265702Z %81 = tt.splat %arg4 : i32 -> tensor<32xi32, #blocked7> 2026-02-21T10:44:56.7265863Z %82 = arith.addi %81, %43 : tensor<32xi32, #blocked7> 2026-02-21T10:44:56.7266013Z %83 = arith.extsi %arg4 : i32 to i64 2026-02-21T10:44:56.7266170Z %84 = tt.splat %83 : i64 -> tensor<32xi64, #blocked7> 2026-02-21T10:44:56.7266326Z %85 = arith.addi %84, %51 : tensor<32xi64, #blocked7> 2026-02-21T10:44:56.7266571Z %86 = ttg.convert_layout %85 : tensor<32xi64, #blocked7> -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:44:56.7266904Z %87 = tt.expand_dims %86 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x32xi64, #blocked8> 2026-02-21T10:44:56.7267200Z %88 = ttg.convert_layout %87 : tensor<1x32xi64, #blocked8> -> tensor<1x32xi64, #blocked6> 2026-02-21T10:44:56.7267489Z %89 = ttg.convert_layout %88 : tensor<1x32xi64, #blocked6> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked13}>> 2026-02-21T10:44:56.7267836Z %90 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked13}>> -> tensor<1x1x32xi64, #blocked13> 2026-02-21T10:44:56.7268149Z %91 = ttg.convert_layout %90 : tensor<1x1x32xi64, #blocked13> -> tensor<1x1x32xi64, #blocked1> 2026-02-21T10:44:56.7268367Z %92 = arith.muli %91, %cst_3 : tensor<1x1x32xi64, #blocked1> 2026-02-21T10:44:56.7268579Z %93 = tt.broadcast %92 : tensor<1x1x32xi64, #blocked1> -> tensor<1x128x32xi64, #blocked1> 2026-02-21T10:44:56.7268829Z %94 = ttg.convert_layout %93 : tensor<1x128x32xi64, #blocked1> -> tensor<1x128x32xi64, #blocked> 2026-02-21T10:44:56.7269049Z %95 = arith.addi %50, %94 : tensor<1x128x32xi64, #blocked> 2026-02-21T10:44:56.7269214Z %96 = arith.addi %45, %95 : tensor<1x128x32xi64, #blocked> 2026-02-21T10:44:56.7269425Z %97 = tt.addptr %44, %96 : tensor<1x128x32x!tt.ptr, #blocked>, tensor<1x128x32xi64, #blocked> 2026-02-21T10:44:56.7269684Z %98 = arith.cmpi sge, %91, %cst_0 : tensor<1x1x32xi64, #blocked1> 2026-02-21T10:44:56.7269867Z %99 = arith.cmpi slt, %91, %cst_3 : tensor<1x1x32xi64, #blocked1> 2026-02-21T10:44:56.7270041Z %100 = arith.andi %98, %99 : tensor<1x1x32xi1, #blocked1> 2026-02-21T10:44:56.7270253Z %101 = tt.broadcast %100 : tensor<1x1x32xi1, #blocked1> -> tensor<1x128x32xi1, #blocked1> 2026-02-21T10:44:56.7270506Z %102 = ttg.convert_layout %101 : tensor<1x128x32xi1, #blocked1> -> tensor<1x128x32xi1, #blocked> 2026-02-21T10:44:56.7270743Z %103 = arith.andi %58, %102 : tensor<1x128x32xi1, #blocked> 2026-02-21T10:44:56.7270947Z %104 = tt.load %97, %103, %cst : tensor<1x128x32x!tt.ptr, #blocked> 2026-02-21T10:44:56.7271169Z %105 = tt.reshape %104 : tensor<1x128x32xbf16, #blocked> -> tensor<128x32xbf16, #blocked6> 2026-02-21T10:44:56.7271477Z %106 = ttg.convert_layout %59 : tensor<1x128xbf16, #blocked9> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> 2026-02-21T10:44:56.7271836Z %107 = ttg.convert_layout %105 : tensor<128x32xbf16, #blocked6> -> tensor<128x32xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> 2026-02-21T10:44:56.7272149Z %108 = ttg.convert_layout %cst_10 : tensor<1x32xf32, #blocked6> -> tensor<1x32xf32, #blocked6> 2026-02-21T10:44:56.7272569Z %109 = tt.dot %106, %107, %108, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> * tensor<128x32xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> -> tensor<1x32xf32, #blocked6> 2026-02-21T10:44:56.7272970Z %110 = tt.reshape %109 : tensor<1x32xf32, #blocked6> -> tensor<1x1x32xf32, #blocked1> 2026-02-21T10:44:56.7273216Z %111 = arith.truncf %110 : tensor<1x1x32xf32, #blocked1> to tensor<1x1x32xbf16, #blocked1> 2026-02-21T10:44:56.7273461Z %112 = arith.extf %111 : tensor<1x1x32xbf16, #blocked1> to tensor<1x1x32xf32, #blocked1> 2026-02-21T10:44:56.7273666Z %113 = "tt.reduce"(%112) <{axis = 2 : i32}> ({ 2026-02-21T10:44:56.7273807Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T10:44:56.7273937Z %168 = arith.maxnumf %arg8, %arg9 : f32 2026-02-21T10:44:56.7274073Z tt.reduce.return %168 : f32 2026-02-21T10:44:56.7274266Z }) : (tensor<1x1x32xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T10:44:56.7274589Z %114 = ttg.convert_layout %113 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7274873Z %115 = arith.truncf %114 : tensor<1x1xf32, #blocked5> to tensor<1x1xbf16, #blocked5> 2026-02-21T10:44:56.7275101Z %116 = arith.extf %115 : tensor<1x1xbf16, #blocked5> to tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7275309Z %117 = arith.mulf %116, %cst_9 : tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7275506Z %118 = arith.truncf %117 : tensor<1x1xf32, #blocked5> to tensor<1x1xbf16, #blocked5> 2026-02-21T10:44:56.7275733Z %119 = arith.extf %118 : tensor<1x1xbf16, #blocked5> to tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7275933Z %120 = arith.cmpf ogt, %arg5, %119 : tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7276119Z %121 = arith.cmpf une, %arg5, %arg5 : tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7276294Z %122 = arith.ori %120, %121 : tensor<1x1xi1, #blocked5> 2026-02-21T10:44:56.7276493Z %123 = arith.select %122, %arg5, %119 : tensor<1x1xi1, #blocked5>, tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7276710Z %124 = arith.mulf %112, %cst_8 : tensor<1x1x32xf32, #blocked1> 2026-02-21T10:44:56.7276920Z %125 = arith.truncf %124 : tensor<1x1x32xf32, #blocked1> to tensor<1x1x32xbf16, #blocked1> 2026-02-21T10:44:56.7277216Z %126 = ttg.convert_layout %123 : tensor<1x1xf32, #blocked5> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T10:44:56.7277567Z %127 = tt.expand_dims %126 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T10:44:56.7277889Z %128 = ttg.convert_layout %127 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15> 2026-02-21T10:44:56.7278148Z %129 = arith.extf %125 : tensor<1x1x32xbf16, #blocked1> to tensor<1x1x32xf32, #blocked1> 2026-02-21T10:44:56.7278393Z %130 = tt.broadcast %128 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x32xf32, #blocked15> 2026-02-21T10:44:56.7278650Z %131 = ttg.convert_layout %130 : tensor<1x1x32xf32, #blocked15> -> tensor<1x1x32xf32, #blocked1> 2026-02-21T10:44:56.7278890Z %132 = arith.subf %129, %131 : tensor<1x1x32xf32, #blocked1> 2026-02-21T10:44:56.7279225Z %133 = tt.extern_elementwise %132 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x32xf32, #blocked1>) -> tensor<1x1x32xf32, #blocked1> 2026-02-21T10:44:56.7279524Z %134 = "tt.reduce"(%133) <{axis = 2 : i32}> ({ 2026-02-21T10:44:56.7279655Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T10:44:56.7279784Z %168 = arith.addf %arg8, %arg9 : f32 2026-02-21T10:44:56.7279909Z tt.reduce.return %168 : f32 2026-02-21T10:44:56.7280103Z }) : (tensor<1x1x32xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T10:44:56.7280403Z %135 = ttg.convert_layout %134 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7280650Z %136 = arith.subf %arg5, %123 : tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7280946Z %137 = tt.extern_elementwise %136 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked5>) -> tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7281242Z %138 = arith.mulf %arg6, %137 : tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7281405Z %139 = arith.addf %138, %135 : tensor<1x1xf32, #blocked5> 2026-02-21T10:44:56.7281655Z %140 = ttg.convert_layout %137 : tensor<1x1xf32, #blocked5> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T10:44:56.7281998Z %141 = tt.expand_dims %140 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T10:44:56.7282308Z %142 = ttg.convert_layout %141 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15> 2026-02-21T10:44:56.7282641Z %143 = tt.broadcast %142 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x128xf32, #blocked15> 2026-02-21T10:44:56.7282900Z %144 = ttg.convert_layout %143 : tensor<1x1x128xf32, #blocked15> -> tensor<1x1x128xf32, #blocked3> 2026-02-21T10:44:56.7283127Z %145 = arith.mulf %arg7, %144 : tensor<1x1x128xf32, #blocked3> 2026-02-21T10:44:56.7283374Z %146 = ttg.convert_layout %82 : tensor<32xi32, #blocked7> -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:44:56.7283708Z %147 = tt.expand_dims %146 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x32xi32, #blocked8> 2026-02-21T10:44:56.7284005Z %148 = ttg.convert_layout %147 : tensor<1x32xi32, #blocked8> -> tensor<1x32xi32, #blocked6> 2026-02-21T10:44:56.7284296Z %149 = ttg.convert_layout %148 : tensor<1x32xi32, #blocked6> -> tensor<1x32xi32, #ttg.slice<{dim = 2, parent = #blocked16}>> 2026-02-21T10:44:56.7284648Z %150 = tt.expand_dims %149 {axis = 2 : i32} : tensor<1x32xi32, #ttg.slice<{dim = 2, parent = #blocked16}>> -> tensor<1x32x1xi32, #blocked16> 2026-02-21T10:44:56.7284957Z %151 = ttg.convert_layout %150 : tensor<1x32x1xi32, #blocked16> -> tensor<1x32x1xi32, #blocked4> 2026-02-21T10:44:56.7285179Z %152 = arith.muli %151, %cst_7 : tensor<1x32x1xi32, #blocked4> 2026-02-21T10:44:56.7285352Z %153 = arith.addi %61, %152 : tensor<1x32x1xi32, #blocked4> 2026-02-21T10:44:56.7285555Z %154 = tt.broadcast %153 : tensor<1x32x1xi32, #blocked4> -> tensor<1x32x128xi32, #blocked4> 2026-02-21T10:44:56.7285825Z %155 = ttg.convert_layout %154 : tensor<1x32x128xi32, #blocked4> -> tensor<1x32x128xi32, #blocked12> 2026-02-21T10:44:56.7286069Z %156 = arith.addi %155, %69 : tensor<1x32x128xi32, #blocked12> 2026-02-21T10:44:56.7286299Z %157 = tt.addptr %70, %156 : tensor<1x32x128x!tt.ptr, #blocked12>, tensor<1x32x128xi32, #blocked12> 2026-02-21T10:44:56.7286534Z %158 = tt.load %157 : tensor<1x32x128x!tt.ptr, #blocked12> 2026-02-21T10:44:56.7286747Z %159 = arith.truncf %133 : tensor<1x1x32xf32, #blocked1> to tensor<1x1x32xbf16, #blocked1> 2026-02-21T10:44:56.7286995Z %160 = tt.reshape %145 : tensor<1x1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked9> 2026-02-21T10:44:56.7287249Z %161 = tt.reshape %159 : tensor<1x1x32xbf16, #blocked1> -> tensor<1x32xbf16, #blocked6> 2026-02-21T10:44:56.7287516Z %162 = tt.reshape %158 : tensor<1x32x128xbf16, #blocked12> -> tensor<32x128xbf16, #blocked9> 2026-02-21T10:44:56.7287819Z %163 = ttg.convert_layout %161 : tensor<1x32xbf16, #blocked6> -> tensor<1x32xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked9}>> 2026-02-21T10:44:56.7288182Z %164 = ttg.convert_layout %162 : tensor<32x128xbf16, #blocked9> -> tensor<32x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked9}>> 2026-02-21T10:44:56.7288493Z %165 = ttg.convert_layout %160 : tensor<1x128xf32, #blocked9> -> tensor<1x128xf32, #blocked9> 2026-02-21T10:44:56.7288901Z %166 = tt.dot %163, %164, %165, inputPrecision = tf32 : tensor<1x32xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked9}>> * tensor<32x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked9}>> -> tensor<1x128xf32, #blocked9> 2026-02-21T10:44:56.7289301Z %167 = tt.reshape %166 : tensor<1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked3> 2026-02-21T10:44:56.7289581Z scf.yield %123, %139, %167 : tensor<1x1xf32, #blocked5>, tensor<1x1xf32, #blocked5>, tensor<1x1x128xf32, #blocked3> 2026-02-21T10:44:56.7289838Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32} 2026-02-21T10:44:56.7290111Z %72 = ttg.convert_layout %71#1 : tensor<1x1xf32, #blocked5> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T10:44:56.7290450Z %73 = tt.expand_dims %72 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T10:44:56.7290754Z %74 = ttg.convert_layout %73 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15> 2026-02-21T10:44:56.7291025Z %75 = tt.broadcast %74 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x128xf32, #blocked15> 2026-02-21T10:44:56.7291272Z %76 = ttg.convert_layout %75 : tensor<1x1x128xf32, #blocked15> -> tensor<1x1x128xf32, #blocked3> 2026-02-21T10:44:56.7291495Z %77 = arith.divf %71#2, %76 : tensor<1x1x128xf32, #blocked3> 2026-02-21T10:44:56.7291706Z %78 = arith.truncf %77 : tensor<1x1x128xf32, #blocked3> to tensor<1x1x128xbf16, #blocked3> 2026-02-21T10:44:56.7291936Z %79 = tt.splat %arg3 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked3> 2026-02-21T10:44:56.7292177Z %80 = tt.addptr %79, %28 : tensor<1x1x128x!tt.ptr, #blocked3>, tensor<1x1x128xi64, #blocked3> 2026-02-21T10:44:56.7292397Z tt.store %80, %78, %41 : tensor<1x1x128x!tt.ptr, #blocked3> 2026-02-21T10:44:56.7292548Z tt.return 2026-02-21T10:44:56.7292633Z } 2026-02-21T10:44:56.7292724Z } 2026-02-21T10:44:56.7292771Z 2026-02-21T10:44:56.7292809Z {-# 2026-02-21T10:44:56.7292895Z external_resources: { 2026-02-21T10:44:56.7293009Z mlir_reproducer: { 2026-02-21T10:44:56.7295263Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T10:44:56.7297577Z disable_threading: false, 2026-02-21T10:44:56.7297695Z verify_each: true 2026-02-21T10:44:56.7297797Z } 2026-02-21T10:44:56.7297873Z } 2026-02-21T10:44:56.7297955Z #-} 2026-02-21T10:44:56.7298238Z /tmp/torchinductor_root/lz/clzaf7w6wcyqtwtghizt7yfmicm4hjzvfbzhjpc2zvux6s7fsoth.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T10:44:56.7298957Z /tmp/torchinductor_root/lz/clzaf7w6wcyqtwtghizt7yfmicm4hjzvfbzhjpc2zvux6s7fsoth.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T10:44:56.7299518Z [22s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T10:44:56.7300254Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T10:44:56.7300922Z Error: RuntimeError: PassManager::run failed 2026-02-21T10:44:56.7301098Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T10:44:59.0286806Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.5 configs/s 2026-02-21T10:44:59.0293402Z [24s] Adaptive compile timeout: 30s (90% percentile=3.4s, bounds=[30.0s, 60s]) 2026-02-21T10:44:59.9638588Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 991.8 configs/s 2026-02-21T10:45:00.2617261Z [25s] Initial random population of 100, 5 starting points: 2026-02-21T10:45:00.2617438Z error=9 2026-02-21T10:45:00.2617526Z ok=91 2026-02-21T10:45:00.2657857Z min=0.0253 2026-02-21T10:45:00.2657959Z mid=0.1022 2026-02-21T10:45:00.2658048Z max=6.1846 2026-02-21T10:45:00.2658144Z best={'block_sizes': [1, 16, 16], 2026-02-21T10:45:00.2658313Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T10:45:00.2658478Z 'l2_groupings': [1], 2026-02-21T10:45:00.2658594Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:45:00.2658722Z 'loop_orders': [[0, 1]], 2026-02-21T10:45:00.2658835Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:45:00.2658938Z 'num_stages': 1, 2026-02-21T10:45:00.2659031Z 'num_warps': 1, 2026-02-21T10:45:00.2659122Z 'pid_type': 'xyz', 2026-02-21T10:45:00.2659235Z 'range_flattens': [None, False], 2026-02-21T10:45:00.2659355Z 'range_multi_buffers': [None, False], 2026-02-21T10:45:00.2659472Z 'range_num_stages': [0, 2], 2026-02-21T10:45:00.2659584Z 'range_unroll_factors': [0, 4], 2026-02-21T10:45:00.2659696Z 'range_warp_specializes': [], 2026-02-21T10:45:00.2659802Z 'waves_per_eu': 2} 2026-02-21T10:45:00.2659913Z [25s] Fitting surrogate: 100 points, 100 targets 2026-02-21T10:45:00.9279902Z [26s] Generation 1 starting: 68 neighbors, 5 active search path(s) 2026-02-21T10:45:11.0542516Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 2.7 configs/s 2026-02-21T10:45:15.4097731Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 16.6 configs/s 2026-02-21T10:45:18.0620516Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 370.4 2026-02-21T10:45:18.0621096Z configs/s 2026-02-21T10:45:18.4832388Z [44s] Generation 1 complete: 2026-02-21T10:45:18.4832767Z ok=73 2026-02-21T10:45:18.4833008Z min=0.0202 2026-02-21T10:45:18.4833229Z mid=0.0289 2026-02-21T10:45:18.4833429Z max=0.2869 2026-02-21T10:45:18.4834103Z best={'block_sizes': [1, 128, 128], 2026-02-21T10:45:18.4834525Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T10:45:18.4834924Z 'l2_groupings': [1], 2026-02-21T10:45:18.4835342Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:45:18.4835664Z 'loop_orders': [[0, 1]], 2026-02-21T10:45:18.4835948Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:45:18.4836219Z 'num_stages': 1, 2026-02-21T10:45:18.4836457Z 'num_warps': 8, 2026-02-21T10:45:18.4836709Z 'pid_type': 'flat', 2026-02-21T10:45:18.4836965Z 'range_flattens': [None, True], 2026-02-21T10:45:18.4837260Z 'range_multi_buffers': [None, False], 2026-02-21T10:45:18.4837602Z 'range_num_stages': [0, 1], 2026-02-21T10:45:18.4838157Z 'range_unroll_factors': [0, 0], 2026-02-21T10:45:18.4838502Z 'range_warp_specializes': [], 2026-02-21T10:45:18.4838835Z 'waves_per_eu': 4} 2026-02-21T10:45:18.5123885Z [44s] Fitting surrogate: 173 points, 173 targets 2026-02-21T10:45:19.4409112Z [45s] Generation 2 starting: 67 neighbors, 5 active search path(s) 2026-02-21T10:45:28.6248480Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 5.6 configs/s 2026-02-21T10:45:32.9520259Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 67/67 16.0 configs/s 2026-02-21T10:45:36.5054313Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 316.7 2026-02-21T10:45:36.5054925Z configs/s 2026-02-21T10:45:37.0301226Z [62s] Generation 2 complete: 2026-02-21T10:45:37.0301443Z ok=72 2026-02-21T10:45:37.0301533Z min=0.0218 2026-02-21T10:45:37.0301618Z mid=0.0294 2026-02-21T10:45:37.0301703Z max=0.1891 2026-02-21T10:45:37.0301799Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:45:37.0301961Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:45:37.0302119Z 'l2_groupings': [1], 2026-02-21T10:45:37.0302800Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:45:37.0302925Z 'loop_orders': [[0, 1]], 2026-02-21T10:45:37.0303053Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:45:37.0303157Z 'num_stages': 4, 2026-02-21T10:45:37.0303252Z 'num_warps': 4, 2026-02-21T10:45:37.0303348Z 'pid_type': 'flat', 2026-02-21T10:45:37.0303463Z 'range_flattens': [None, True], 2026-02-21T10:45:37.0303586Z 'range_multi_buffers': [None, None], 2026-02-21T10:45:37.0303704Z 'range_num_stages': [0, 4], 2026-02-21T10:45:37.0303816Z 'range_unroll_factors': [0, 2], 2026-02-21T10:45:37.0303930Z 'range_warp_specializes': [], 2026-02-21T10:45:37.0304049Z 'waves_per_eu': 2} 2026-02-21T10:45:37.1345960Z [62s] Fitting surrogate: 245 points, 245 targets 2026-02-21T10:45:37.8664744Z [63s] Generation 3 starting: 55 neighbors, 5 active search path(s) 2026-02-21T10:45:52.0099009Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 1.0 configs/s 2026-02-21T10:45:56.0940053Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 14.4 configs/s 2026-02-21T10:45:58.2165622Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 465.0 2026-02-21T10:45:58.2166803Z configs/s 2026-02-21T10:45:58.6175312Z [84s] Generation 3 complete: 2026-02-21T10:45:58.6175736Z ok=60 2026-02-21T10:45:58.6175922Z min=0.0198 2026-02-21T10:45:58.6176118Z mid=0.0242 2026-02-21T10:45:58.6176293Z max=2.2780 2026-02-21T10:45:58.6176497Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:45:58.6176859Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:45:58.6177608Z 'l2_groupings': [1], 2026-02-21T10:45:58.6177849Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:45:58.6178124Z 'loop_orders': [[0, 1]], 2026-02-21T10:45:58.6178361Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:45:58.6178592Z 'num_stages': 3, 2026-02-21T10:45:58.6178793Z 'num_warps': 4, 2026-02-21T10:45:58.6178991Z 'pid_type': 'flat', 2026-02-21T10:45:58.6179223Z 'range_flattens': [None, True], 2026-02-21T10:45:58.6179465Z 'range_multi_buffers': [None, None], 2026-02-21T10:45:58.6179738Z 'range_num_stages': [0, 4], 2026-02-21T10:45:58.6179960Z 'range_unroll_factors': [0, 2], 2026-02-21T10:45:58.6180193Z 'range_warp_specializes': [], 2026-02-21T10:45:58.6180414Z 'waves_per_eu': 2} 2026-02-21T10:45:58.6809750Z [84s] Fitting surrogate: 305 points, 305 targets 2026-02-21T10:45:59.4721136Z [85s] Generation 4 starting: 69 neighbors, 5 active search path(s) 2026-02-21T10:46:10.3629234Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 13.0 configs/s 2026-02-21T10:46:14.9170146Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 15.9 configs/s 2026-02-21T10:46:18.1384323Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 308.8 2026-02-21T10:46:18.1384929Z configs/s 2026-02-21T10:46:18.6698366Z [104s] Generation 4 complete: 2026-02-21T10:46:18.6703021Z ok=74 2026-02-21T10:46:18.6703299Z min=0.0200 2026-02-21T10:46:18.6703455Z mid=0.0242 2026-02-21T10:46:18.6703606Z max=0.0730 2026-02-21T10:46:18.6703768Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:46:18.6704094Z 'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T10:46:18.6704399Z 'l2_groupings': [64], 2026-02-21T10:46:18.6704580Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:46:18.6704786Z 'loop_orders': [[0, 1]], 2026-02-21T10:46:18.6704965Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:46:18.6705139Z 'num_stages': 3, 2026-02-21T10:46:18.6705297Z 'num_warps': 4, 2026-02-21T10:46:18.6705450Z 'pid_type': 'flat', 2026-02-21T10:46:18.6705625Z 'range_flattens': [None, True], 2026-02-21T10:46:18.6705834Z 'range_multi_buffers': [None, False], 2026-02-21T10:46:18.6706043Z 'range_num_stages': [0, 3], 2026-02-21T10:46:18.6706225Z 'range_unroll_factors': [0, 4], 2026-02-21T10:46:18.6706415Z 'range_warp_specializes': [], 2026-02-21T10:46:18.6706604Z 'waves_per_eu': 1} 2026-02-21T10:46:18.7751267Z [104s] Fitting surrogate: 379 points, 379 targets 2026-02-21T10:46:19.8463582Z [105s] Generation 5 starting: 52 neighbors, 4 active search path(s) 2026-02-21T10:46:29.2456626Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 2.6 configs/s 2026-02-21T10:46:32.6030047Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 16.2 configs/s 2026-02-21T10:46:34.7153173Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 467.0 2026-02-21T10:46:34.7153756Z configs/s 2026-02-21T10:46:35.1105013Z [120s] Generation 5 complete: 2026-02-21T10:46:35.1105400Z ok=56 2026-02-21T10:46:35.1105627Z min=0.0198 2026-02-21T10:46:35.1105843Z mid=0.0232 2026-02-21T10:46:35.1106052Z max=0.0706 2026-02-21T10:46:35.1106289Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:46:35.1106712Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:46:35.1107147Z 'l2_groupings': [1], 2026-02-21T10:46:35.1107430Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:46:35.1108128Z 'loop_orders': [[0, 1]], 2026-02-21T10:46:35.1108411Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:46:35.1108683Z 'num_stages': 3, 2026-02-21T10:46:35.1108918Z 'num_warps': 4, 2026-02-21T10:46:35.1109176Z 'pid_type': 'flat', 2026-02-21T10:46:35.1109441Z 'range_flattens': [None, True], 2026-02-21T10:46:35.1109758Z 'range_multi_buffers': [None, None], 2026-02-21T10:46:35.1110068Z 'range_num_stages': [0, 3], 2026-02-21T10:46:35.1110342Z 'range_unroll_factors': [0, 2], 2026-02-21T10:46:35.1110640Z 'range_warp_specializes': [], 2026-02-21T10:46:35.1110916Z 'waves_per_eu': 2} 2026-02-21T10:46:35.1702295Z [120s] Fitting surrogate: 435 points, 435 targets 2026-02-21T10:46:36.2904778Z [121s] Generation 6 starting: 55 neighbors, 4 active search path(s) 2026-02-21T10:46:44.4055897Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 9.2 configs/s 2026-02-21T10:46:48.0412741Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 56/56 16.0 configs/s 2026-02-21T10:46:50.3449868Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 429.6 2026-02-21T10:46:50.3450128Z configs/s 2026-02-21T10:46:50.7618453Z [136s] Generation 6 complete: 2026-02-21T10:46:50.7618778Z ok=59 2026-02-21T10:46:50.7618964Z min=0.0188 2026-02-21T10:46:50.7619138Z mid=0.0214 2026-02-21T10:46:50.7619311Z max=0.3548 2026-02-21T10:46:50.7619498Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:46:50.7619851Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:46:50.7620178Z 'l2_groupings': [1], 2026-02-21T10:46:50.7620413Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:46:50.7620674Z 'loop_orders': [[0, 1]], 2026-02-21T10:46:50.7620905Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:46:50.7621132Z 'num_stages': 3, 2026-02-21T10:46:50.7621325Z 'num_warps': 4, 2026-02-21T10:46:50.7621680Z 'pid_type': 'flat', 2026-02-21T10:46:50.7621900Z 'range_flattens': [None, True], 2026-02-21T10:46:50.7622180Z 'range_multi_buffers': [None, None], 2026-02-21T10:46:50.7622442Z 'range_num_stages': [0, 3], 2026-02-21T10:46:50.7622673Z 'range_unroll_factors': [0, 2], 2026-02-21T10:46:50.7622919Z 'range_warp_specializes': [], 2026-02-21T10:46:50.7623154Z 'waves_per_eu': 3} 2026-02-21T10:46:50.8339878Z [136s] Fitting surrogate: 494 points, 494 targets 2026-02-21T10:46:51.9942435Z [137s] Generation 7 starting: 59 neighbors, 4 active search path(s) 2026-02-21T10:47:01.5057177Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 6.8 configs/s 2026-02-21T10:47:05.3374455Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 16.0 configs/s 2026-02-21T10:47:07.4740702Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 463.7 2026-02-21T10:47:07.4741304Z configs/s 2026-02-21T10:47:07.9056876Z [153s] Generation 7 complete: 2026-02-21T10:47:07.9057316Z ok=63 2026-02-21T10:47:07.9057530Z min=0.0186 2026-02-21T10:47:07.9057781Z mid=0.0247 2026-02-21T10:47:07.9057982Z max=0.1292 2026-02-21T10:47:07.9058213Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:47:07.9058623Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:47:07.9059043Z 'l2_groupings': [1], 2026-02-21T10:47:07.9059326Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:47:07.9059645Z 'loop_orders': [[0, 1]], 2026-02-21T10:47:07.9059926Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:47:07.9060199Z 'num_stages': 3, 2026-02-21T10:47:07.9060910Z 'num_warps': 4, 2026-02-21T10:47:07.9061118Z 'pid_type': 'flat', 2026-02-21T10:47:07.9061345Z 'range_flattens': [None, True], 2026-02-21T10:47:07.9061610Z 'range_multi_buffers': [None, None], 2026-02-21T10:47:07.9061884Z 'range_num_stages': [0, 2], 2026-02-21T10:47:07.9062120Z 'range_unroll_factors': [0, 2], 2026-02-21T10:47:07.9062391Z 'range_warp_specializes': [], 2026-02-21T10:47:07.9062629Z 'waves_per_eu': 3} 2026-02-21T10:47:07.9703495Z [153s] Fitting surrogate: 557 points, 557 targets 2026-02-21T10:47:08.9901353Z [154s] Generation 8 starting: 45 neighbors, 3 active search path(s) 2026-02-21T10:47:16.6652594Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/45 4.5 configs/s 2026-02-21T10:47:19.6064007Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 45/45 16.1 configs/s 2026-02-21T10:47:21.3626125Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 559.3 2026-02-21T10:47:21.3626777Z configs/s 2026-02-21T10:47:21.7391761Z [167s] Generation 8 complete: 2026-02-21T10:47:21.7392214Z ok=48 2026-02-21T10:47:21.7392435Z min=0.0188 2026-02-21T10:47:21.7392654Z mid=0.0218 2026-02-21T10:47:21.7392858Z max=0.1299 2026-02-21T10:47:21.7393094Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:47:21.7393542Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:47:21.7393941Z 'l2_groupings': [1], 2026-02-21T10:47:21.7394221Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:47:21.7394571Z 'loop_orders': [[0, 1]], 2026-02-21T10:47:21.7394848Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:47:21.7395125Z 'num_stages': 3, 2026-02-21T10:47:21.7395356Z 'num_warps': 4, 2026-02-21T10:47:21.7396026Z 'pid_type': 'flat', 2026-02-21T10:47:21.7396294Z 'range_flattens': [None, True], 2026-02-21T10:47:21.7396593Z 'range_multi_buffers': [None, None], 2026-02-21T10:47:21.7396903Z 'range_num_stages': [0, 2], 2026-02-21T10:47:21.7397180Z 'range_unroll_factors': [0, 2], 2026-02-21T10:47:21.7397491Z 'range_warp_specializes': [], 2026-02-21T10:47:21.7397771Z 'waves_per_eu': 3} 2026-02-21T10:47:21.7968366Z [167s] Fitting surrogate: 605 points, 605 targets 2026-02-21T10:47:22.1455859Z [167s] Generation 9 starting: 22 neighbors, 2 active search path(s) 2026-02-21T10:47:26.9678138Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 4.7 configs/s 2026-02-21T10:47:28.4536933Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 16.4 configs/s 2026-02-21T10:47:29.1965431Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1296.7 2026-02-21T10:47:29.1965918Z configs/s 2026-02-21T10:47:29.4830539Z [175s] Generation 9 complete: 2026-02-21T10:47:29.4830839Z ok=24 2026-02-21T10:47:29.4830947Z min=0.0186 2026-02-21T10:47:29.4831062Z mid=0.0244 2026-02-21T10:47:29.4831167Z max=0.1235 2026-02-21T10:47:29.4831290Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:47:29.4831502Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:47:29.4831708Z 'l2_groupings': [1], 2026-02-21T10:47:29.4831849Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:47:29.4832003Z 'loop_orders': [[0, 1]], 2026-02-21T10:47:29.4832138Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:47:29.4832273Z 'num_stages': 3, 2026-02-21T10:47:29.4832877Z 'num_warps': 4, 2026-02-21T10:47:29.4832997Z 'pid_type': 'flat', 2026-02-21T10:47:29.4833319Z 'range_flattens': [None, True], 2026-02-21T10:47:29.4833862Z 'range_multi_buffers': [None, None], 2026-02-21T10:47:29.4833996Z 'range_num_stages': [0, 2], 2026-02-21T10:47:29.4834110Z 'range_unroll_factors': [0, 2], 2026-02-21T10:47:29.4834235Z 'range_warp_specializes': [], 2026-02-21T10:47:29.4834350Z 'waves_per_eu': 3} 2026-02-21T10:47:29.5000597Z [175s] Fitting surrogate: 629 points, 629 targets 2026-02-21T10:47:29.7568746Z [175s] Generation 10 starting: 16 neighbors, 1 active search path(s) 2026-02-21T10:47:33.0168509Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.4 configs/s 2026-02-21T10:47:34.1295632Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.5 configs/s 2026-02-21T10:47:35.1024484Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 992.7 2026-02-21T10:47:35.1025107Z configs/s 2026-02-21T10:47:35.4210674Z [181s] Generation 10 complete: 2026-02-21T10:47:35.4211002Z ok=18 2026-02-21T10:47:35.4211247Z min=0.0202 2026-02-21T10:47:35.4211496Z mid=0.0224 2026-02-21T10:47:35.4211700Z max=0.0303 2026-02-21T10:47:35.4214132Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:47:35.4215008Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:47:35.4215416Z 'l2_groupings': [1], 2026-02-21T10:47:35.4215701Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:47:35.4216024Z 'loop_orders': [[0, 1]], 2026-02-21T10:47:35.4216305Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:47:35.4216581Z 'num_stages': 3, 2026-02-21T10:47:35.4216816Z 'num_warps': 4, 2026-02-21T10:47:35.4217043Z 'pid_type': 'flat', 2026-02-21T10:47:35.4217307Z 'range_flattens': [None, True], 2026-02-21T10:47:35.4217605Z 'range_multi_buffers': [None, None], 2026-02-21T10:47:35.4217910Z 'range_num_stages': [0, 2], 2026-02-21T10:47:35.4218184Z 'range_unroll_factors': [0, 2], 2026-02-21T10:47:35.4218622Z 'range_warp_specializes': [], 2026-02-21T10:47:35.4218826Z 'waves_per_eu': 3} 2026-02-21T10:47:35.4462042Z [181s] Fitting surrogate: 647 points, 647 targets 2026-02-21T10:47:35.7119063Z [181s] Generation 11 starting: 15 neighbors, 1 active search path(s) 2026-02-21T10:47:38.4365293Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 6.4 configs/s 2026-02-21T10:47:39.5394358Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.7 configs/s 2026-02-21T10:47:40.0720682Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1810.2 2026-02-21T10:47:40.0721247Z configs/s 2026-02-21T10:47:40.3696337Z [185s] Generation 11 complete: 2026-02-21T10:47:40.3696661Z ok=17 2026-02-21T10:47:40.3696868Z min=0.0196 2026-02-21T10:47:40.3697080Z mid=0.0262 2026-02-21T10:47:40.3697279Z max=0.0739 2026-02-21T10:47:40.3697508Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:47:40.3697933Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:47:40.3698330Z 'l2_groupings': [1], 2026-02-21T10:47:40.3698625Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:47:40.3698947Z 'loop_orders': [[0, 1]], 2026-02-21T10:47:40.3699229Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:47:40.3699495Z 'num_stages': 3, 2026-02-21T10:47:40.3699736Z 'num_warps': 4, 2026-02-21T10:47:40.3699965Z 'pid_type': 'flat', 2026-02-21T10:47:40.3700228Z 'range_flattens': [None, True], 2026-02-21T10:47:40.3700526Z 'range_multi_buffers': [None, None], 2026-02-21T10:47:40.3700833Z 'range_num_stages': [0, 2], 2026-02-21T10:47:40.3701112Z 'range_unroll_factors': [0, 2], 2026-02-21T10:47:40.3701408Z 'range_warp_specializes': [], 2026-02-21T10:47:40.3701683Z 'waves_per_eu': 3} 2026-02-21T10:47:40.3871707Z [185s] Fitting surrogate: 664 points, 664 targets 2026-02-21T10:47:40.6698643Z [186s] Generation 12 starting: 16 neighbors, 1 active search path(s) 2026-02-21T10:47:45.4007810Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 1.8 configs/s 2026-02-21T10:47:46.5325066Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.2 configs/s 2026-02-21T10:47:46.7680589Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 4380.2 2026-02-21T10:47:46.7681129Z configs/s 2026-02-21T10:47:46.9751484Z [192s] Generation 12 complete: 2026-02-21T10:47:46.9751803Z ok=18 2026-02-21T10:47:46.9752008Z min=0.0184 2026-02-21T10:47:46.9752221Z mid=0.0302 2026-02-21T10:47:46.9752419Z max=0.3199 2026-02-21T10:47:46.9752924Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:47:46.9753325Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:47:46.9753721Z 'l2_groupings': [1], 2026-02-21T10:47:46.9753993Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:47:46.9754314Z 'loop_orders': [[0, 1]], 2026-02-21T10:47:46.9754597Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:47:46.9754865Z 'num_stages': 3, 2026-02-21T10:47:46.9755109Z 'num_warps': 4, 2026-02-21T10:47:46.9755336Z 'pid_type': 'flat', 2026-02-21T10:47:46.9755607Z 'range_flattens': [None, True], 2026-02-21T10:47:46.9755905Z 'range_multi_buffers': [None, None], 2026-02-21T10:47:46.9756216Z 'range_num_stages': [0, 2], 2026-02-21T10:47:46.9756569Z 'range_unroll_factors': [0, 2], 2026-02-21T10:47:46.9756822Z 'range_warp_specializes': [], 2026-02-21T10:47:46.9757060Z 'waves_per_eu': 3} 2026-02-21T10:47:46.9836098Z [192s] Fitting surrogate: 682 points, 682 targets 2026-02-21T10:47:47.2557375Z [192s] Generation 13 starting: 16 neighbors, 1 active search path(s) 2026-02-21T10:47:50.7445678Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 3.8 configs/s 2026-02-21T10:47:51.8598702Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.5 configs/s 2026-02-21T10:47:52.6644778Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1196.2 2026-02-21T10:47:52.6645563Z configs/s 2026-02-21T10:47:52.9635338Z [198s] Generation 13 complete: 2026-02-21T10:47:52.9635682Z ok=18 2026-02-21T10:47:52.9635892Z min=0.0187 2026-02-21T10:47:52.9636045Z mid=0.0208 2026-02-21T10:47:52.9636251Z max=0.2300 2026-02-21T10:47:52.9636484Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:47:52.9636905Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:47:52.9637295Z 'l2_groupings': [1], 2026-02-21T10:47:52.9637577Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:47:52.9637891Z 'loop_orders': [[0, 1]], 2026-02-21T10:47:52.9638181Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:47:52.9638450Z 'num_stages': 3, 2026-02-21T10:47:52.9638697Z 'num_warps': 4, 2026-02-21T10:47:52.9638923Z 'pid_type': 'flat', 2026-02-21T10:47:52.9639185Z 'range_flattens': [None, True], 2026-02-21T10:47:52.9639494Z 'range_multi_buffers': [None, None], 2026-02-21T10:47:52.9639798Z 'range_num_stages': [0, 2], 2026-02-21T10:47:52.9640081Z 'range_unroll_factors': [0, 2], 2026-02-21T10:47:52.9640372Z 'range_warp_specializes': [], 2026-02-21T10:47:52.9640654Z 'waves_per_eu': 3} 2026-02-21T10:47:52.9832845Z [198s] Fitting surrogate: 700 points, 700 targets 2026-02-21T10:47:53.2558942Z [198s] Generation 14 starting: 15 neighbors, 1 active search path(s) 2026-02-21T10:47:55.8819354Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 9.3 configs/s 2026-02-21T10:47:56.9155065Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 16.9 configs/s 2026-02-21T10:47:57.5754939Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1454.5 2026-02-21T10:47:57.5756102Z configs/s 2026-02-21T10:47:57.8575869Z [203s] Generation 14 complete: 2026-02-21T10:47:57.8576206Z ok=17 2026-02-21T10:47:57.8576416Z min=0.0187 2026-02-21T10:47:57.8576636Z mid=0.0197 2026-02-21T10:47:57.8576837Z max=0.0698 2026-02-21T10:47:57.8577088Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:47:57.8577494Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:47:57.8578104Z 'l2_groupings': [1], 2026-02-21T10:47:57.8578387Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:47:57.8578702Z 'loop_orders': [[0, 1]], 2026-02-21T10:47:57.8578981Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:47:57.8579261Z 'num_stages': 3, 2026-02-21T10:47:57.8579496Z 'num_warps': 4, 2026-02-21T10:47:57.8579724Z 'pid_type': 'flat', 2026-02-21T10:47:57.8579985Z 'range_flattens': [None, True], 2026-02-21T10:47:57.8580282Z 'range_multi_buffers': [None, None], 2026-02-21T10:47:57.8580598Z 'range_num_stages': [0, 2], 2026-02-21T10:47:57.8580868Z 'range_unroll_factors': [0, 2], 2026-02-21T10:47:57.8581168Z 'range_warp_specializes': [], 2026-02-21T10:47:57.8581447Z 'waves_per_eu': 3} 2026-02-21T10:47:57.8732974Z [203s] Fitting surrogate: 717 points, 717 targets 2026-02-21T10:47:58.1642398Z [203s] Generation 15 starting: 16 neighbors, 1 active search path(s) 2026-02-21T10:48:01.0895327Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 5.7 configs/s 2026-02-21T10:48:02.1855313Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.8 configs/s 2026-02-21T10:48:03.5215395Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1128.3 2026-02-21T10:48:03.5216305Z configs/s 2026-02-21T10:48:03.8196838Z [209s] Generation 15 complete: 2026-02-21T10:48:03.8197171Z ok=18 2026-02-21T10:48:03.8197380Z min=0.0187 2026-02-21T10:48:03.8197596Z mid=0.0201 2026-02-21T10:48:03.8197815Z max=0.0727 2026-02-21T10:48:03.8198058Z best={'block_sizes': [1, 64, 64], 2026-02-21T10:48:03.8198470Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:48:03.8198859Z 'l2_groupings': [1], 2026-02-21T10:48:03.8199149Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:48:03.8199461Z 'loop_orders': [[0, 1]], 2026-02-21T10:48:03.8199984Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:48:03.8200256Z 'num_stages': 3, 2026-02-21T10:48:03.8200480Z 'num_warps': 4, 2026-02-21T10:48:03.8200730Z 'pid_type': 'flat', 2026-02-21T10:48:03.8200993Z 'range_flattens': [None, True], 2026-02-21T10:48:03.8201296Z 'range_multi_buffers': [None, None], 2026-02-21T10:48:03.8201603Z 'range_num_stages': [0, 2], 2026-02-21T10:48:03.8201888Z 'range_unroll_factors': [0, 2], 2026-02-21T10:48:03.8202175Z 'range_warp_specializes': [], 2026-02-21T10:48:03.8202462Z 'waves_per_eu': 3} 2026-02-21T10:48:03.8443995Z [209s] Fitting surrogate: 735 points, 735 targets 2026-02-21T10:48:03.9792831Z [209s] Autotuning complete in 209.6s after searching 694 configs. 2026-02-21T10:48:03.9793177Z One can hardcode the best config and skip autotuning with: 2026-02-21T10:48:03.9794344Z @helion.kernel(config=helion.Config(block_sizes=[1, 64, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T10:48:03.9795386Z 2026-02-21T10:48:03.9795665Z [209s] Code of selected kernel: /tmp/torchinductor_root/yz/cyz7bheexsaqjs64c2glkk7hmjd25c3nikx5d3o6ma22jc6ttott.py 2026-02-21T10:48:04.4578206Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T10:48:04.4579575Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 64, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T10:48:04.4581040Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T10:48:04.4581467Z WARNING:tritonbench.utils.triton_op:Completed input ID 0: 2026-02-21T10:48:04.4581729Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T10:48:04.4581936Z ------------------------------------------ 2026-02-21T10:48:04.4582129Z (4, 48, 128, 128, 128) 2026-02-21T10:48:04.4582227Z 2026-02-21T10:48:04.4587317Z 17%|█▋ | 1/6 [03:38<18:10, 218.07s/it]WARNING:tritonbench.utils.triton_op:Running input ID 1: 2026-02-21T10:48:04.4587606Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T10:48:04.4587791Z ------------------------------------------ 2026-02-21T10:48:04.4587962Z (4, 48, 256, 256, 128) 2026-02-21T10:48:04.4589880Z INFO:tritonbench.utils.triton_op:Took 0.07ms to get benchmark function for aten 2026-02-21T10:48:05.7295342Z INFO:tritonbench.utils.triton_op:Took 2.42ms to get benchmark function for flex_attention 2026-02-21T10:48:07.1547255Z WARNING:__main__:Input tensor metadata: 2026-02-21T10:48:07.1547535Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T10:48:07.1547739Z 'dtype': 'torch.bfloat16', 2026-02-21T10:48:07.1547892Z 'shape': (4, 48, 256, 128), 2026-02-21T10:48:07.1548047Z 'stride': (1572864, 32768, 128, 1)}, 2026-02-21T10:48:07.1548647Z { 'device': 'cuda:0', 2026-02-21T10:48:07.1548788Z 'dtype': 'torch.bfloat16', 2026-02-21T10:48:07.1548932Z 'shape': (4, 48, 256, 128), 2026-02-21T10:48:07.1549078Z 'stride': (1572864, 32768, 128, 1)}, 2026-02-21T10:48:07.1549237Z { 'device': 'cuda:0', 2026-02-21T10:48:07.1549369Z 'dtype': 'torch.bfloat16', 2026-02-21T10:48:07.1549507Z 'shape': (4, 48, 256, 128), 2026-02-21T10:48:07.1549653Z 'stride': (1572864, 32768, 128, 1)}), 2026-02-21T10:48:07.1549795Z 'kwargs': {}} 2026-02-21T10:48:07.1569653Z INFO:tritonbench.utils.triton_op:Took 2.78ms to get benchmark function for helion_attention 2026-02-21T10:48:07.4046936Z [0s] Autotune random seed: 2144140282 2026-02-21T10:48:07.5643868Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T10:48:25.4723692Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 3.7 configs/s 2026-02-21T10:48:27.7923620Z /tmp/torchinductor_root/re/creryn3ijcfe5ngnsdypjphn22cnx4fiylxrfemcef4ips7yanzk.py:55:129: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T10:48:27.7936866Z k = tl.load(k_view + (indices_0[:, None, None] * 32768 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None) 2026-02-21T10:48:27.7937291Z ^ 2026-02-21T10:48:27.7938077Z /tmp/torchinductor_root/re/creryn3ijcfe5ngnsdypjphn22cnx4fiylxrfemcef4ips7yanzk.py:57:141: note: - use: %132 = "tt.reshape"(<>) : (tensor<1x128x64xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 1], order = [1, 0, 2]}>>) -> tensor<128x64xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [0, 1]}>> 2026-02-21T10:48:27.7938757Z 2026-02-21T10:48:27.7939128Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T10:48:27.7939948Z ^ 2026-02-21T10:48:27.7940158Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T10:48:27.7940422Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:48:27.7940781Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:48:27.7941231Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:48:27.7941576Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T10:48:27.7941919Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T10:48:27.7942236Z #blocked5 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T10:48:27.7942539Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T10:48:27.7942843Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:48:27.7943156Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:48:27.7943472Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:48:27.7943850Z #blocked10 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T10:48:27.7944180Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T10:48:27.7944701Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T10:48:27.7945097Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T10:48:27.7945335Z %c192_i64 = arith.constant 192 : i64 2026-02-21T10:48:27.7945454Z %c0_i64 = arith.constant 0 : i64 2026-02-21T10:48:27.7945574Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T10:48:27.7945740Z %cst = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked> 2026-02-21T10:48:27.7945936Z %cst_0 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked> 2026-02-21T10:48:27.7946120Z %cst_1 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked> 2026-02-21T10:48:27.7946300Z %cst_2 = arith.constant dense<256> : tensor<1x2x1xi64, #blocked1> 2026-02-21T10:48:27.7946485Z %cst_3 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked1> 2026-02-21T10:48:27.7946657Z %cst_4 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked1> 2026-02-21T10:48:27.7946809Z %c64_i32 = arith.constant 64 : i32 2026-02-21T10:48:27.7946927Z %c256_i32 = arith.constant 256 : i32 2026-02-21T10:48:27.7947045Z %c3072_i32 = arith.constant 3072 : i32 2026-02-21T10:48:27.7947192Z %cst_5 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked1> 2026-02-21T10:48:27.7947389Z %cst_6 = arith.constant dense<128> : tensor<1x64x1xi32, #blocked2> 2026-02-21T10:48:27.7947579Z %cst_7 = arith.constant dense<0.127517432> : tensor<1x2x64xf32, #blocked> 2026-02-21T10:48:27.7947778Z %cst_8 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7947973Z %cst_9 = arith.constant dense<0.000000e+00> : tensor<2x64xf32, #blocked4> 2026-02-21T10:48:27.7948160Z %cst_10 = arith.constant dense<128> : tensor<1x1x64xi32, #blocked> 2026-02-21T10:48:27.7948331Z %c0_i32 = arith.constant 0 : i32 2026-02-21T10:48:27.7948486Z %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked> 2026-02-21T10:48:27.7948687Z %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7948880Z %cst_13 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7949038Z %c2_i32 = arith.constant 2 : i32 2026-02-21T10:48:27.7949153Z %c16_i32 = arith.constant 16 : i32 2026-02-21T10:48:27.7949266Z %c128_i32 = arith.constant 128 : i32 2026-02-21T10:48:27.7949406Z %0 = tt.get_program_id x : i32 2026-02-21T10:48:27.7949518Z %1 = arith.divsi %0, %c3072_i32 : i32 2026-02-21T10:48:27.7949635Z %2 = arith.muli %1, %c16_i32 : i32 2026-02-21T10:48:27.7949746Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T10:48:27.7949861Z %4 = arith.minsi %3, %c16_i32 : i32 2026-02-21T10:48:27.7949974Z %5 = arith.remsi %0, %c3072_i32 : i32 2026-02-21T10:48:27.7950089Z %6 = arith.remsi %5, %4 : i32 2026-02-21T10:48:27.7950201Z %7 = arith.addi %2, %6 : i32 2026-02-21T10:48:27.7950306Z %8 = arith.divsi %5, %4 : i32 2026-02-21T10:48:27.7950417Z %9 = arith.muli %7, %c2_i32 : i32 2026-02-21T10:48:27.7950571Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked5> 2026-02-21T10:48:27.7950753Z %11 = tt.splat %9 : i32 -> tensor<2xi32, #blocked5> 2026-02-21T10:48:27.7950903Z %12 = arith.addi %11, %10 : tensor<2xi32, #blocked5> 2026-02-21T10:48:27.7951078Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked5> 2026-02-21T10:48:27.7951248Z %14 = arith.extsi %8 : i32 to i64 2026-02-21T10:48:27.7951357Z %15 = arith.extsi %9 : i32 to i64 2026-02-21T10:48:27.7951532Z %16 = tt.splat %arg0 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T10:48:27.7951698Z %17 = arith.muli %14, %c32768_i64 : i64 2026-02-21T10:48:27.7951838Z %18 = tt.splat %17 : i64 -> tensor<1x2x128xi64, #blocked> 2026-02-21T10:48:27.7951990Z %19 = tt.splat %15 : i64 -> tensor<2xi64, #blocked5> 2026-02-21T10:48:27.7952163Z %20 = arith.extsi %10 : tensor<2xi32, #blocked5> to tensor<2xi64, #blocked5> 2026-02-21T10:48:27.7952341Z %21 = arith.addi %19, %20 : tensor<2xi64, #blocked5> 2026-02-21T10:48:27.7952585Z %22 = ttg.convert_layout %21 : tensor<2xi64, #blocked5> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T10:48:27.7952908Z %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi64, #blocked6> 2026-02-21T10:48:27.7953187Z %24 = ttg.convert_layout %23 : tensor<1x2xi64, #blocked6> -> tensor<1x2xi64, #blocked3> 2026-02-21T10:48:27.7953468Z %25 = ttg.convert_layout %24 : tensor<1x2xi64, #blocked3> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T10:48:27.7953802Z %26 = tt.expand_dims %25 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi64, #blocked7> 2026-02-21T10:48:27.7954097Z %27 = ttg.convert_layout %26 : tensor<1x2x1xi64, #blocked7> -> tensor<1x2x1xi64, #blocked1> 2026-02-21T10:48:27.7954306Z %28 = arith.muli %27, %cst_4 : tensor<1x2x1xi64, #blocked1> 2026-02-21T10:48:27.7954501Z %29 = tt.broadcast %28 : tensor<1x2x1xi64, #blocked1> -> tensor<1x2x128xi64, #blocked1> 2026-02-21T10:48:27.7954745Z %30 = ttg.convert_layout %29 : tensor<1x2x128xi64, #blocked1> -> tensor<1x2x128xi64, #blocked> 2026-02-21T10:48:27.7954976Z %31 = arith.extsi %13 : tensor<128xi32, #blocked5> to tensor<128xi64, #blocked5> 2026-02-21T10:48:27.7955243Z %32 = ttg.convert_layout %31 : tensor<128xi64, #blocked5> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T10:48:27.7955568Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi64, #blocked6> 2026-02-21T10:48:27.7955852Z %34 = ttg.convert_layout %33 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #blocked4> 2026-02-21T10:48:27.7956150Z %35 = ttg.convert_layout %34 : tensor<1x128xi64, #blocked4> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T10:48:27.7956490Z %36 = tt.expand_dims %35 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi64, #blocked8> 2026-02-21T10:48:27.7956787Z %37 = ttg.convert_layout %36 : tensor<1x1x128xi64, #blocked8> -> tensor<1x1x128xi64, #blocked> 2026-02-21T10:48:27.7957028Z %38 = tt.broadcast %37 : tensor<1x1x128xi64, #blocked> -> tensor<1x2x128xi64, #blocked> 2026-02-21T10:48:27.7957238Z %39 = arith.addi %30, %38 : tensor<1x2x128xi64, #blocked> 2026-02-21T10:48:27.7957395Z %40 = arith.addi %18, %39 : tensor<1x2x128xi64, #blocked> 2026-02-21T10:48:27.7957598Z %41 = tt.addptr %16, %40 : tensor<1x2x128x!tt.ptr, #blocked>, tensor<1x2x128xi64, #blocked> 2026-02-21T10:48:27.7957786Z %42 = arith.cmpi sge, %14, %c0_i64 : i64 2026-02-21T10:48:27.7957915Z %43 = arith.cmpi slt, %14, %c192_i64 : i64 2026-02-21T10:48:27.7958036Z %44 = arith.andi %42, %43 : i1 2026-02-21T10:48:27.7958177Z %45 = arith.cmpi sge, %27, %cst_3 : tensor<1x2x1xi64, #blocked1> 2026-02-21T10:48:27.7958347Z %46 = arith.cmpi slt, %27, %cst_2 : tensor<1x2x1xi64, #blocked1> 2026-02-21T10:48:27.7958512Z %47 = arith.andi %45, %46 : tensor<1x2x1xi1, #blocked1> 2026-02-21T10:48:27.7958666Z %48 = tt.splat %44 : i1 -> tensor<1x2x1xi1, #blocked1> 2026-02-21T10:48:27.7958815Z %49 = arith.andi %48, %47 : tensor<1x2x1xi1, #blocked1> 2026-02-21T10:48:27.7959005Z %50 = tt.broadcast %49 : tensor<1x2x1xi1, #blocked1> -> tensor<1x2x128xi1, #blocked1> 2026-02-21T10:48:27.7959257Z %51 = ttg.convert_layout %50 : tensor<1x2x128xi1, #blocked1> -> tensor<1x2x128xi1, #blocked> 2026-02-21T10:48:27.7959468Z %52 = arith.cmpi sge, %37, %cst_1 : tensor<1x1x128xi64, #blocked> 2026-02-21T10:48:27.7959638Z %53 = arith.cmpi slt, %37, %cst_0 : tensor<1x1x128xi64, #blocked> 2026-02-21T10:48:27.7959801Z %54 = arith.andi %52, %53 : tensor<1x1x128xi1, #blocked> 2026-02-21T10:48:27.7959987Z %55 = tt.broadcast %54 : tensor<1x1x128xi1, #blocked> -> tensor<1x2x128xi1, #blocked> 2026-02-21T10:48:27.7960180Z %56 = arith.andi %51, %55 : tensor<1x2x128xi1, #blocked> 2026-02-21T10:48:27.7960343Z %57 = tt.load %41, %56, %cst : tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T10:48:27.7960554Z %58 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked5> 2026-02-21T10:48:27.7960723Z %59 = arith.muli %8, %c32768_i32 : i32 2026-02-21T10:48:27.7960943Z %60 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T10:48:27.7961270Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6> 2026-02-21T10:48:27.7961558Z %62 = ttg.convert_layout %61 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4> 2026-02-21T10:48:27.7961840Z %63 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T10:48:27.7962177Z %64 = tt.expand_dims %63 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x128x1xi32, #blocked9> 2026-02-21T10:48:27.7962480Z %65 = ttg.convert_layout %64 : tensor<1x128x1xi32, #blocked9> -> tensor<1x128x1xi32, #blocked2> 2026-02-21T10:48:27.7962752Z %66 = tt.splat %59 : i32 -> tensor<1x128x1xi32, #blocked2> 2026-02-21T10:48:27.7962915Z %67 = arith.addi %66, %65 : tensor<1x128x1xi32, #blocked2> 2026-02-21T10:48:27.7963115Z %68 = tt.broadcast %67 : tensor<1x128x1xi32, #blocked2> -> tensor<1x128x64xi32, #blocked2> 2026-02-21T10:48:27.7963365Z %69 = ttg.convert_layout %68 : tensor<1x128x64xi32, #blocked2> -> tensor<1x128x64xi32, #blocked> 2026-02-21T10:48:27.7963598Z %70 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x64x!tt.ptr, #blocked> 2026-02-21T10:48:27.7963832Z %71 = tt.reshape %57 : tensor<1x2x128xbf16, #blocked> -> tensor<2x128xbf16, #blocked4> 2026-02-21T10:48:27.7964026Z %72 = tt.splat %59 : i32 -> tensor<1x64x1xi32, #blocked2> 2026-02-21T10:48:27.7964265Z %73 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T10:48:27.7964602Z %74 = tt.expand_dims %73 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8> 2026-02-21T10:48:27.7964920Z %75 = ttg.convert_layout %74 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked> 2026-02-21T10:48:27.7965158Z %76 = tt.broadcast %75 : tensor<1x1x128xi32, #blocked> -> tensor<1x64x128xi32, #blocked> 2026-02-21T10:48:27.7965376Z %77 = tt.splat %arg2 : !tt.ptr -> tensor<1x64x128x!tt.ptr, #blocked> 2026-02-21T10:48:27.7965757Z %78:3 = scf.for %arg4 = %c0_i32 to %c256_i32 step %c64_i32 iter_args(%arg5 = %cst_13, %arg6 = %cst_12, %arg7 = %cst_11) -> (tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked>) : i32 { 2026-02-21T10:48:27.7966115Z %108 = tt.splat %arg4 : i32 -> tensor<64xi32, #blocked5> 2026-02-21T10:48:27.7966273Z %109 = arith.addi %108, %58 : tensor<64xi32, #blocked5> 2026-02-21T10:48:27.7966510Z %110 = ttg.convert_layout %109 : tensor<64xi32, #blocked5> -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T10:48:27.7966840Z %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x64xi32, #blocked6> 2026-02-21T10:48:27.7967132Z %112 = ttg.convert_layout %111 : tensor<1x64xi32, #blocked6> -> tensor<1x64xi32, #blocked4> 2026-02-21T10:48:27.7967447Z %113 = ttg.convert_layout %112 : tensor<1x64xi32, #blocked4> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T10:48:27.7967791Z %114 = tt.expand_dims %113 {axis = 1 : i32} : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x64xi32, #blocked8> 2026-02-21T10:48:27.7968090Z %115 = ttg.convert_layout %114 : tensor<1x1x64xi32, #blocked8> -> tensor<1x1x64xi32, #blocked> 2026-02-21T10:48:27.7968305Z %116 = arith.muli %115, %cst_10 : tensor<1x1x64xi32, #blocked> 2026-02-21T10:48:27.7968505Z %117 = tt.broadcast %116 : tensor<1x1x64xi32, #blocked> -> tensor<1x128x64xi32, #blocked> 2026-02-21T10:48:27.7968729Z %118 = arith.addi %69, %117 : tensor<1x128x64xi32, #blocked> 2026-02-21T10:48:27.7968944Z %119 = tt.addptr %70, %118 : tensor<1x128x64x!tt.ptr, #blocked>, tensor<1x128x64xi32, #blocked> 2026-02-21T10:48:27.7969161Z %120 = tt.load %119 : tensor<1x128x64x!tt.ptr, #blocked> 2026-02-21T10:48:27.7969366Z %121 = tt.reshape %120 : tensor<1x128x64xbf16, #blocked> -> tensor<128x64xbf16, #blocked4> 2026-02-21T10:48:27.7969663Z %122 = ttg.convert_layout %71 : tensor<2x128xbf16, #blocked4> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked4}>> 2026-02-21T10:48:27.7970020Z %123 = ttg.convert_layout %121 : tensor<128x64xbf16, #blocked4> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked4}>> 2026-02-21T10:48:27.7970324Z %124 = ttg.convert_layout %cst_9 : tensor<2x64xf32, #blocked4> -> tensor<2x64xf32, #blocked4> 2026-02-21T10:48:27.7970735Z %125 = tt.dot %122, %123, %124, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked4}>> * tensor<128x64xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked4}>> -> tensor<2x64xf32, #blocked4> 2026-02-21T10:48:27.7971126Z %126 = tt.reshape %125 : tensor<2x64xf32, #blocked4> -> tensor<1x2x64xf32, #blocked> 2026-02-21T10:48:27.7971358Z %127 = arith.truncf %126 : tensor<1x2x64xf32, #blocked> to tensor<1x2x64xbf16, #blocked> 2026-02-21T10:48:27.7971591Z %128 = arith.extf %127 : tensor<1x2x64xbf16, #blocked> to tensor<1x2x64xf32, #blocked> 2026-02-21T10:48:27.7971781Z %129 = "tt.reduce"(%128) <{axis = 2 : i32}> ({ 2026-02-21T10:48:27.7971922Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T10:48:27.7972046Z %182 = arith.maxnumf %arg8, %arg9 : f32 2026-02-21T10:48:27.7972169Z tt.reduce.return %182 : f32 2026-02-21T10:48:27.7972358Z }) : (tensor<1x2x64xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T10:48:27.7972649Z %130 = ttg.convert_layout %129 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7972919Z %131 = arith.truncf %130 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3> 2026-02-21T10:48:27.7973161Z %132 = arith.extf %131 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7973354Z %133 = arith.mulf %132, %cst_8 : tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7973545Z %134 = arith.truncf %133 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3> 2026-02-21T10:48:27.7973759Z %135 = arith.extf %134 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7973957Z %136 = arith.cmpf ogt, %arg5, %135 : tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7974134Z %137 = arith.cmpf une, %arg5, %arg5 : tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7974296Z %138 = arith.ori %136, %137 : tensor<1x2xi1, #blocked3> 2026-02-21T10:48:27.7974494Z %139 = arith.select %138, %arg5, %135 : tensor<1x2xi1, #blocked3>, tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7974699Z %140 = arith.mulf %128, %cst_7 : tensor<1x2x64xf32, #blocked> 2026-02-21T10:48:27.7974901Z %141 = arith.truncf %140 : tensor<1x2x64xf32, #blocked> to tensor<1x2x64xbf16, #blocked> 2026-02-21T10:48:27.7975197Z %142 = ttg.convert_layout %139 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T10:48:27.7975530Z %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7> 2026-02-21T10:48:27.7975827Z %144 = ttg.convert_layout %143 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1> 2026-02-21T10:48:27.7976066Z %145 = arith.extf %141 : tensor<1x2x64xbf16, #blocked> to tensor<1x2x64xf32, #blocked> 2026-02-21T10:48:27.7976300Z %146 = tt.broadcast %144 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x64xf32, #blocked1> 2026-02-21T10:48:27.7976559Z %147 = ttg.convert_layout %146 : tensor<1x2x64xf32, #blocked1> -> tensor<1x2x64xf32, #blocked> 2026-02-21T10:48:27.7976767Z %148 = arith.subf %145, %147 : tensor<1x2x64xf32, #blocked> 2026-02-21T10:48:27.7977070Z %149 = tt.extern_elementwise %148 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x64xf32, #blocked>) -> tensor<1x2x64xf32, #blocked> 2026-02-21T10:48:27.7977354Z %150 = "tt.reduce"(%149) <{axis = 2 : i32}> ({ 2026-02-21T10:48:27.7977482Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T10:48:27.7977599Z %182 = arith.addf %arg8, %arg9 : f32 2026-02-21T10:48:27.7977718Z tt.reduce.return %182 : f32 2026-02-21T10:48:27.7977902Z }) : (tensor<1x2x64xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T10:48:27.7978187Z %151 = ttg.convert_layout %150 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7978430Z %152 = arith.subf %arg5, %139 : tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7978716Z %153 = tt.extern_elementwise %152 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked3>) -> tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7979006Z %154 = arith.mulf %arg6, %153 : tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7979168Z %155 = arith.addf %154, %151 : tensor<1x2xf32, #blocked3> 2026-02-21T10:48:27.7979407Z %156 = ttg.convert_layout %153 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T10:48:27.7979744Z %157 = tt.expand_dims %156 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7> 2026-02-21T10:48:27.7980071Z %158 = ttg.convert_layout %157 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1> 2026-02-21T10:48:27.7980314Z %159 = tt.broadcast %158 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1> 2026-02-21T10:48:27.7980562Z %160 = ttg.convert_layout %159 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked> 2026-02-21T10:48:27.7980776Z %161 = arith.mulf %arg7, %160 : tensor<1x2x128xf32, #blocked> 2026-02-21T10:48:27.7981039Z %162 = ttg.convert_layout %112 : tensor<1x64xi32, #blocked4> -> tensor<1x64xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T10:48:27.7981379Z %163 = tt.expand_dims %162 {axis = 2 : i32} : tensor<1x64xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x64x1xi32, #blocked9> 2026-02-21T10:48:27.7981679Z %164 = ttg.convert_layout %163 : tensor<1x64x1xi32, #blocked9> -> tensor<1x64x1xi32, #blocked2> 2026-02-21T10:48:27.7981893Z %165 = arith.muli %164, %cst_6 : tensor<1x64x1xi32, #blocked2> 2026-02-21T10:48:27.7982061Z %166 = arith.addi %72, %165 : tensor<1x64x1xi32, #blocked2> 2026-02-21T10:48:27.7982263Z %167 = tt.broadcast %166 : tensor<1x64x1xi32, #blocked2> -> tensor<1x64x128xi32, #blocked2> 2026-02-21T10:48:27.7982520Z %168 = ttg.convert_layout %167 : tensor<1x64x128xi32, #blocked2> -> tensor<1x64x128xi32, #blocked> 2026-02-21T10:48:27.7982732Z %169 = arith.addi %168, %76 : tensor<1x64x128xi32, #blocked> 2026-02-21T10:48:27.7982948Z %170 = tt.addptr %77, %169 : tensor<1x64x128x!tt.ptr, #blocked>, tensor<1x64x128xi32, #blocked> 2026-02-21T10:48:27.7983178Z %171 = tt.load %170 : tensor<1x64x128x!tt.ptr, #blocked> 2026-02-21T10:48:27.7983377Z %172 = arith.truncf %149 : tensor<1x2x64xf32, #blocked> to tensor<1x2x64xbf16, #blocked> 2026-02-21T10:48:27.7983610Z %173 = tt.reshape %161 : tensor<1x2x128xf32, #blocked> -> tensor<2x128xf32, #blocked4> 2026-02-21T10:48:27.7983835Z %174 = tt.reshape %172 : tensor<1x2x64xbf16, #blocked> -> tensor<2x64xbf16, #blocked4> 2026-02-21T10:48:27.7984066Z %175 = tt.reshape %171 : tensor<1x64x128xbf16, #blocked> -> tensor<64x128xbf16, #blocked4> 2026-02-21T10:48:27.7984379Z %176 = ttg.convert_layout %174 : tensor<2x64xbf16, #blocked4> -> tensor<2x64xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> 2026-02-21T10:48:27.7984734Z %177 = ttg.convert_layout %175 : tensor<64x128xbf16, #blocked4> -> tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> 2026-02-21T10:48:27.7985044Z %178 = ttg.convert_layout %173 : tensor<2x128xf32, #blocked4> -> tensor<2x128xf32, #blocked10> 2026-02-21T10:48:27.7985454Z %179 = tt.dot %176, %177, %178, inputPrecision = tf32 : tensor<2x64xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<2x128xf32, #blocked10> 2026-02-21T10:48:27.7985861Z %180 = ttg.convert_layout %179 : tensor<2x128xf32, #blocked10> -> tensor<2x128xf32, #blocked4> 2026-02-21T10:48:27.7986100Z %181 = tt.reshape %180 : tensor<2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked> 2026-02-21T10:48:27.7986361Z scf.yield %139, %155, %181 : tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked> 2026-02-21T10:48:27.7986615Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32} 2026-02-21T10:48:27.7986874Z %79 = ttg.convert_layout %78#1 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T10:48:27.7987203Z %80 = tt.expand_dims %79 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7> 2026-02-21T10:48:27.7987492Z %81 = ttg.convert_layout %80 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1> 2026-02-21T10:48:27.7987725Z %82 = tt.broadcast %81 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1> 2026-02-21T10:48:27.7987979Z %83 = ttg.convert_layout %82 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked> 2026-02-21T10:48:27.7988183Z %84 = arith.divf %78#2, %83 : tensor<1x2x128xf32, #blocked> 2026-02-21T10:48:27.7988377Z %85 = arith.truncf %84 : tensor<1x2x128xf32, #blocked> to tensor<1x2x128xbf16, #blocked> 2026-02-21T10:48:27.7988560Z %86 = arith.muli %8, %c32768_i32 : i32 2026-02-21T10:48:27.7988771Z %87 = ttg.convert_layout %12 : tensor<2xi32, #blocked5> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T10:48:27.7989105Z %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi32, #blocked6> 2026-02-21T10:48:27.7989379Z %89 = ttg.convert_layout %88 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #blocked3> 2026-02-21T10:48:27.7989653Z %90 = ttg.convert_layout %89 : tensor<1x2xi32, #blocked3> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T10:48:27.7989977Z %91 = tt.expand_dims %90 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi32, #blocked7> 2026-02-21T10:48:27.7990261Z %92 = ttg.convert_layout %91 : tensor<1x2x1xi32, #blocked7> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T10:48:27.7990478Z %93 = arith.muli %92, %cst_5 : tensor<1x2x1xi32, #blocked1> 2026-02-21T10:48:27.7990637Z %94 = tt.splat %86 : i32 -> tensor<1x2x1xi32, #blocked1> 2026-02-21T10:48:27.7990788Z %95 = arith.addi %94, %93 : tensor<1x2x1xi32, #blocked1> 2026-02-21T10:48:27.7991026Z %96 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T10:48:27.8001508Z %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6> 2026-02-21T10:48:27.8001810Z %98 = ttg.convert_layout %97 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4> 2026-02-21T10:48:27.8002100Z %99 = ttg.convert_layout %98 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T10:48:27.8002436Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8> 2026-02-21T10:48:27.8002801Z %101 = ttg.convert_layout %100 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked> 2026-02-21T10:48:27.8003042Z %102 = tt.broadcast %95 : tensor<1x2x1xi32, #blocked1> -> tensor<1x2x128xi32, #blocked1> 2026-02-21T10:48:27.8003285Z %103 = ttg.convert_layout %102 : tensor<1x2x128xi32, #blocked1> -> tensor<1x2x128xi32, #blocked> 2026-02-21T10:48:27.8003530Z %104 = tt.broadcast %101 : tensor<1x1x128xi32, #blocked> -> tensor<1x2x128xi32, #blocked> 2026-02-21T10:48:27.8003729Z %105 = arith.addi %103, %104 : tensor<1x2x128xi32, #blocked> 2026-02-21T10:48:27.8003915Z %106 = tt.splat %arg3 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T10:48:27.8004148Z %107 = tt.addptr %106, %105 : tensor<1x2x128x!tt.ptr, #blocked>, tensor<1x2x128xi32, #blocked> 2026-02-21T10:48:27.8004363Z tt.store %107, %85 : tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T10:48:27.8004499Z tt.return 2026-02-21T10:48:27.8004580Z } 2026-02-21T10:48:27.8004659Z } 2026-02-21T10:48:27.8004701Z 2026-02-21T10:48:27.8004735Z {-# 2026-02-21T10:48:27.8004817Z external_resources: { 2026-02-21T10:48:27.8004916Z mlir_reproducer: { 2026-02-21T10:48:27.8007164Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T10:48:27.8009498Z disable_threading: false, 2026-02-21T10:48:27.8009607Z verify_each: true 2026-02-21T10:48:27.8009697Z } 2026-02-21T10:48:27.8009772Z } 2026-02-21T10:48:27.8009840Z #-} 2026-02-21T10:48:27.8010121Z /tmp/torchinductor_root/re/creryn3ijcfe5ngnsdypjphn22cnx4fiylxrfemcef4ips7yanzk.py:16:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T10:48:27.8010844Z /tmp/torchinductor_root/re/creryn3ijcfe5ngnsdypjphn22cnx4fiylxrfemcef4ips7yanzk.py:16:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T10:48:27.8011393Z [20s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T10:48:27.8012158Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T10:48:27.8012823Z Error: RuntimeError: PassManager::run failed 2026-02-21T10:48:27.8012991Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T10:48:32.8813311Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.7 configs/s 2026-02-21T10:48:32.8820561Z [25s] Adaptive compile timeout: 30s (90% percentile=11.8s, bounds=[30.0s, 30s]) 2026-02-21T10:48:33.4971720Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1335.4 configs/s 2026-02-21T10:48:33.9296650Z [26s] Initial random population of 100, 5 starting points: 2026-02-21T10:48:33.9298007Z error=12 2026-02-21T10:48:33.9298363Z ok=88 2026-02-21T10:48:33.9298570Z min=0.0564 2026-02-21T10:48:33.9298779Z mid=0.3850 2026-02-21T10:48:33.9298986Z max=42.0966 2026-02-21T10:48:33.9299232Z best={'block_sizes': [1, 32, 16], 2026-02-21T10:48:33.9299729Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T10:48:33.9300141Z 'l2_groupings': [8], 2026-02-21T10:48:33.9300425Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:48:33.9300692Z 'loop_orders': [[0, 1]], 2026-02-21T10:48:33.9300864Z 'matrix_instr_nonkdim': 0, 2026-02-21T10:48:33.9301029Z 'num_stages': 2, 2026-02-21T10:48:33.9301173Z 'num_warps': 1, 2026-02-21T10:48:33.9301322Z 'pid_type': 'flat', 2026-02-21T10:48:33.9301487Z 'range_flattens': [None, None], 2026-02-21T10:48:33.9301696Z 'range_multi_buffers': [None, True], 2026-02-21T10:48:33.9301878Z 'range_num_stages': [0, 3], 2026-02-21T10:48:33.9302050Z 'range_unroll_factors': [0, 2], 2026-02-21T10:48:33.9302225Z 'range_warp_specializes': [], 2026-02-21T10:48:33.9302395Z 'waves_per_eu': 2} 2026-02-21T10:48:33.9387998Z [26s] Fitting surrogate: 100 points, 100 targets 2026-02-21T10:48:34.7814813Z [27s] Generation 1 starting: 76 neighbors, 5 active search path(s) 2026-02-21T10:48:46.5151894Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 10.2 configs/s 2026-02-21T10:48:51.9384620Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 14.9 configs/s 2026-02-21T10:48:55.6808394Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 258.3 2026-02-21T10:48:55.6808848Z configs/s 2026-02-21T10:48:56.3403130Z [48s] Generation 1 complete: 2026-02-21T10:48:56.3403345Z ok=81 2026-02-21T10:48:56.3403429Z min=0.0473 2026-02-21T10:48:56.3403913Z mid=0.0755 2026-02-21T10:48:56.3403987Z max=0.4422 2026-02-21T10:48:56.3404077Z best={'block_sizes': [1, 64, 16], 2026-02-21T10:48:56.3404345Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T10:48:56.3404500Z 'l2_groupings': [8], 2026-02-21T10:48:56.3404606Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:48:56.3404721Z 'loop_orders': [[0, 1]], 2026-02-21T10:48:56.3404824Z 'matrix_instr_nonkdim': 0, 2026-02-21T10:48:56.3404931Z 'num_stages': 2, 2026-02-21T10:48:56.3405020Z 'num_warps': 4, 2026-02-21T10:48:56.3405104Z 'pid_type': 'flat', 2026-02-21T10:48:56.3405201Z 'range_flattens': [None, None], 2026-02-21T10:48:56.3405311Z 'range_multi_buffers': [None, True], 2026-02-21T10:48:56.3405424Z 'range_num_stages': [0, 3], 2026-02-21T10:48:56.3405525Z 'range_unroll_factors': [0, 2], 2026-02-21T10:48:56.3405636Z 'range_warp_specializes': [], 2026-02-21T10:48:56.3405741Z 'waves_per_eu': 2} 2026-02-21T10:48:56.3435537Z [48s] Fitting surrogate: 181 points, 181 targets 2026-02-21T10:48:57.1274346Z [49s] Generation 2 starting: 70 neighbors, 5 active search path(s) 2026-02-21T10:49:37.1954001Z [89s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=1, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T10:49:37.6513434Z [90s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=1, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T10:49:37.6531652Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 0.5 configs/s 2026-02-21T10:49:41.9222804Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 17.0 configs/s 2026-02-21T10:49:44.3870490Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 384.7 2026-02-21T10:49:44.3870938Z configs/s 2026-02-21T10:49:44.9136845Z [97s] Generation 2 complete: 2026-02-21T10:49:44.9137064Z timeout=2 2026-02-21T10:49:44.9137186Z ok=73 2026-02-21T10:49:44.9137310Z min=0.0454 2026-02-21T10:49:44.9137435Z mid=0.0787 2026-02-21T10:49:44.9137558Z max=4.1786 2026-02-21T10:49:44.9137701Z best={'block_sizes': [1, 64, 16], 2026-02-21T10:49:44.9137961Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:49:44.9138230Z 'l2_groupings': [4], 2026-02-21T10:49:44.9138404Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:49:44.9138598Z 'loop_orders': [[0, 1]], 2026-02-21T10:49:44.9139102Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:49:44.9139271Z 'num_stages': 2, 2026-02-21T10:49:44.9139413Z 'num_warps': 4, 2026-02-21T10:49:44.9139548Z 'pid_type': 'flat', 2026-02-21T10:49:44.9139738Z 'range_flattens': [None, False], 2026-02-21T10:49:44.9139922Z 'range_multi_buffers': [None, None], 2026-02-21T10:49:44.9140119Z 'range_num_stages': [0, 1], 2026-02-21T10:49:44.9140296Z 'range_unroll_factors': [0, 4], 2026-02-21T10:49:44.9140592Z 'range_warp_specializes': [], 2026-02-21T10:49:44.9140769Z 'waves_per_eu': 3} 2026-02-21T10:49:44.9572646Z [97s] Fitting surrogate: 256 points, 256 targets 2026-02-21T10:49:45.6595051Z [98s] Generation 3 starting: 70 neighbors, 5 active search path(s) 2026-02-21T10:50:09.9018475Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.9 configs/s 2026-02-21T10:50:14.2968292Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 16.7 configs/s 2026-02-21T10:50:18.3720923Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 238.8 2026-02-21T10:50:18.3721543Z configs/s 2026-02-21T10:50:19.0258222Z [131s] Generation 3 complete: 2026-02-21T10:50:19.0258852Z ok=75 2026-02-21T10:50:19.0259085Z min=0.0464 2026-02-21T10:50:19.0259299Z mid=0.0616 2026-02-21T10:50:19.0259498Z max=0.5650 2026-02-21T10:50:19.0259728Z best={'block_sizes': [1, 64, 16], 2026-02-21T10:50:19.0260137Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:50:19.0260544Z 'l2_groupings': [64], 2026-02-21T10:50:19.0260820Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:50:19.0261137Z 'loop_orders': [[0, 1]], 2026-02-21T10:50:19.0261413Z 'matrix_instr_nonkdim': 0, 2026-02-21T10:50:19.0261695Z 'num_stages': 2, 2026-02-21T10:50:19.0261933Z 'num_warps': 4, 2026-02-21T10:50:19.0262312Z 'pid_type': 'flat', 2026-02-21T10:50:19.0262580Z 'range_flattens': [None, True], 2026-02-21T10:50:19.0262884Z 'range_multi_buffers': [None, None], 2026-02-21T10:50:19.0263209Z 'range_num_stages': [0, 2], 2026-02-21T10:50:19.0263487Z 'range_unroll_factors': [0, 2], 2026-02-21T10:50:19.0263789Z 'range_warp_specializes': [], 2026-02-21T10:50:19.0264077Z 'waves_per_eu': 3} 2026-02-21T10:50:19.0308180Z [131s] Fitting surrogate: 331 points, 331 targets 2026-02-21T10:50:19.7209877Z [132s] Generation 4 starting: 67 neighbors, 5 active search path(s) 2026-02-21T10:50:30.0333337Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 11.5 configs/s 2026-02-21T10:50:34.3240095Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 16.4 configs/s 2026-02-21T10:50:36.9623667Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 364.3 2026-02-21T10:50:36.9624292Z configs/s 2026-02-21T10:50:37.5116138Z [149s] Generation 4 complete: 2026-02-21T10:50:37.5116622Z ok=72 2026-02-21T10:50:37.5116871Z min=0.0457 2026-02-21T10:50:37.5117124Z mid=0.0794 2026-02-21T10:50:37.5117361Z max=1.5306 2026-02-21T10:50:37.5117605Z best={'block_sizes': [1, 64, 16], 2026-02-21T10:50:37.5118040Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:50:37.5118462Z 'l2_groupings': [64], 2026-02-21T10:50:37.5118827Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:50:37.5119141Z 'loop_orders': [[0, 1]], 2026-02-21T10:50:37.5119423Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:50:37.5119691Z 'num_stages': 2, 2026-02-21T10:50:37.5119939Z 'num_warps': 4, 2026-02-21T10:50:37.5120168Z 'pid_type': 'flat', 2026-02-21T10:50:37.5120441Z 'range_flattens': [None, True], 2026-02-21T10:50:37.5120762Z 'range_multi_buffers': [None, None], 2026-02-21T10:50:37.5121082Z 'range_num_stages': [0, 3], 2026-02-21T10:50:37.5121361Z 'range_unroll_factors': [0, 2], 2026-02-21T10:50:37.5121653Z 'range_warp_specializes': [], 2026-02-21T10:50:37.5121964Z 'waves_per_eu': 3} 2026-02-21T10:50:37.5668208Z [150s] Fitting surrogate: 403 points, 403 targets 2026-02-21T10:50:38.1967704Z [150s] Generation 5 starting: 47 neighbors, 4 active search path(s) 2026-02-21T10:50:46.5331541Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 6.1 configs/s 2026-02-21T10:50:49.5730429Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 16.6 configs/s 2026-02-21T10:50:51.9346843Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 403.8 2026-02-21T10:50:51.9347456Z configs/s 2026-02-21T10:50:52.3957783Z [164s] Generation 5 complete: 2026-02-21T10:50:52.3958148Z ok=51 2026-02-21T10:50:52.3958359Z min=0.0432 2026-02-21T10:50:52.3958583Z mid=0.0655 2026-02-21T10:50:52.3958782Z max=0.5865 2026-02-21T10:50:52.3959020Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:50:52.3959427Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:50:52.3959853Z 'l2_groupings': [64], 2026-02-21T10:50:52.3960133Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:50:52.3960477Z 'loop_orders': [[0, 1]], 2026-02-21T10:50:52.3960758Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:50:52.3961030Z 'num_stages': 2, 2026-02-21T10:50:52.3961270Z 'num_warps': 4, 2026-02-21T10:50:52.3961741Z 'pid_type': 'flat', 2026-02-21T10:50:52.3962020Z 'range_flattens': [None, None], 2026-02-21T10:50:52.3962319Z 'range_multi_buffers': [None, None], 2026-02-21T10:50:52.3962708Z 'range_num_stages': [0, 3], 2026-02-21T10:50:52.3962981Z 'range_unroll_factors': [0, 2], 2026-02-21T10:50:52.3963289Z 'range_warp_specializes': [], 2026-02-21T10:50:52.3963566Z 'waves_per_eu': 3} 2026-02-21T10:50:52.4393336Z [164s] Fitting surrogate: 454 points, 454 targets 2026-02-21T10:50:52.9443205Z [165s] Generation 6 starting: 46 neighbors, 3 active search path(s) 2026-02-21T10:50:59.6263284Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 7.7 configs/s 2026-02-21T10:51:02.4794636Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 16.4 configs/s 2026-02-21T10:51:05.9416561Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 315.4 2026-02-21T10:51:05.9417190Z configs/s 2026-02-21T10:51:06.4360354Z [178s] Generation 6 complete: 2026-02-21T10:51:06.4360774Z error=1 2026-02-21T10:51:06.4360981Z ok=48 2026-02-21T10:51:06.4361188Z min=0.0440 2026-02-21T10:51:06.4361429Z mid=0.0566 2026-02-21T10:51:06.4361633Z max=0.3078 2026-02-21T10:51:06.4361865Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:51:06.4362300Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:51:06.4362781Z 'l2_groupings': [64], 2026-02-21T10:51:06.4363061Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:51:06.4363385Z 'loop_orders': [[0, 1]], 2026-02-21T10:51:06.4363666Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:51:06.4363988Z 'num_stages': 2, 2026-02-21T10:51:06.4364212Z 'num_warps': 4, 2026-02-21T10:51:06.4364437Z 'pid_type': 'flat', 2026-02-21T10:51:06.4364682Z 'range_flattens': [None, False], 2026-02-21T10:51:06.4364990Z 'range_multi_buffers': [None, None], 2026-02-21T10:51:06.4365274Z 'range_num_stages': [0, 3], 2026-02-21T10:51:06.4365544Z 'range_unroll_factors': [0, 2], 2026-02-21T10:51:06.4365830Z 'range_warp_specializes': [], 2026-02-21T10:51:06.4366089Z 'waves_per_eu': 3} 2026-02-21T10:51:06.4934727Z [178s] Fitting surrogate: 503 points, 503 targets 2026-02-21T10:51:06.9958383Z [179s] Generation 7 starting: 46 neighbors, 3 active search path(s) 2026-02-21T10:51:12.1266454Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 5.0 configs/s 2026-02-21T10:51:15.0842403Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 16.3 configs/s 2026-02-21T10:51:17.9734047Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 332.4 2026-02-21T10:51:17.9734715Z configs/s 2026-02-21T10:51:18.4661306Z [190s] Generation 7 complete: 2026-02-21T10:51:18.4662059Z ok=49 2026-02-21T10:51:18.4662178Z min=0.0438 2026-02-21T10:51:18.4662296Z mid=0.0504 2026-02-21T10:51:18.4662416Z max=0.5898 2026-02-21T10:51:18.4662553Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:51:18.4662805Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:51:18.4663033Z 'l2_groupings': [64], 2026-02-21T10:51:18.4663178Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:51:18.4663356Z 'loop_orders': [[0, 1]], 2026-02-21T10:51:18.4663527Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:51:18.4663671Z 'num_stages': 2, 2026-02-21T10:51:18.4663791Z 'num_warps': 4, 2026-02-21T10:51:18.4663922Z 'pid_type': 'flat', 2026-02-21T10:51:18.4664060Z 'range_flattens': [None, False], 2026-02-21T10:51:18.4664234Z 'range_multi_buffers': [None, None], 2026-02-21T10:51:18.4664402Z 'range_num_stages': [0, 2], 2026-02-21T10:51:18.4664556Z 'range_unroll_factors': [0, 2], 2026-02-21T10:51:18.4664727Z 'range_warp_specializes': [], 2026-02-21T10:51:18.4664880Z 'waves_per_eu': 3} 2026-02-21T10:51:18.5183047Z [190s] Fitting surrogate: 552 points, 552 targets 2026-02-21T10:51:18.8846008Z [191s] Generation 8 starting: 30 neighbors, 2 active search path(s) 2026-02-21T10:51:24.0184231Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 3.6 configs/s 2026-02-21T10:51:25.9578666Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 30/30 16.7 configs/s 2026-02-21T10:51:27.9151553Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 482.1 2026-02-21T10:51:27.9152176Z configs/s 2026-02-21T10:51:28.4024813Z [200s] Generation 8 complete: 2026-02-21T10:51:28.4025264Z ok=32 2026-02-21T10:51:28.4025483Z min=0.0445 2026-02-21T10:51:28.4025693Z mid=0.0485 2026-02-21T10:51:28.4025888Z max=0.2253 2026-02-21T10:51:28.4026118Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:51:28.4027027Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:51:28.4027433Z 'l2_groupings': [64], 2026-02-21T10:51:28.4027725Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:51:28.4028037Z 'loop_orders': [[0, 1]], 2026-02-21T10:51:28.4028314Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:51:28.4028586Z 'num_stages': 2, 2026-02-21T10:51:28.4028814Z 'num_warps': 4, 2026-02-21T10:51:28.4029040Z 'pid_type': 'flat', 2026-02-21T10:51:28.4029297Z 'range_flattens': [None, False], 2026-02-21T10:51:28.4029599Z 'range_multi_buffers': [None, None], 2026-02-21T10:51:28.4029912Z 'range_num_stages': [0, 2], 2026-02-21T10:51:28.4030184Z 'range_unroll_factors': [0, 2], 2026-02-21T10:51:28.4030477Z 'range_warp_specializes': [], 2026-02-21T10:51:28.4030752Z 'waves_per_eu': 3} 2026-02-21T10:51:28.4397976Z [200s] Fitting surrogate: 584 points, 584 targets 2026-02-21T10:51:28.8049017Z [201s] Generation 9 starting: 26 neighbors, 2 active search path(s) 2026-02-21T10:51:32.7662638Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 8.6 configs/s 2026-02-21T10:51:34.4326976Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 17.1 configs/s 2026-02-21T10:51:35.7920876Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 678.1 2026-02-21T10:51:35.7921303Z configs/s 2026-02-21T10:51:36.1775769Z [208s] Generation 9 complete: 2026-02-21T10:51:36.1775965Z ok=28 2026-02-21T10:51:36.1776089Z min=0.0438 2026-02-21T10:51:36.1776226Z mid=0.0494 2026-02-21T10:51:36.1776739Z max=0.1405 2026-02-21T10:51:36.1776931Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:51:36.1777203Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:51:36.1777430Z 'l2_groupings': [64], 2026-02-21T10:51:36.1777620Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:51:36.1777801Z 'loop_orders': [[0, 1]], 2026-02-21T10:51:36.1777957Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:51:36.1778120Z 'num_stages': 2, 2026-02-21T10:51:36.1778252Z 'num_warps': 4, 2026-02-21T10:51:36.1778388Z 'pid_type': 'flat', 2026-02-21T10:51:36.1778669Z 'range_flattens': [None, False], 2026-02-21T10:51:36.1778846Z 'range_multi_buffers': [None, None], 2026-02-21T10:51:36.1779023Z 'range_num_stages': [0, 2], 2026-02-21T10:51:36.1779274Z 'range_unroll_factors': [0, 2], 2026-02-21T10:51:36.1779446Z 'range_warp_specializes': [], 2026-02-21T10:51:36.1779607Z 'waves_per_eu': 3} 2026-02-21T10:51:36.2034404Z [208s] Fitting surrogate: 612 points, 612 targets 2026-02-21T10:51:36.4555068Z [208s] Generation 10 starting: 15 neighbors, 1 active search path(s) 2026-02-21T10:51:38.8496231Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 11.9 configs/s 2026-02-21T10:51:39.8522716Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.5 configs/s 2026-02-21T10:51:40.2518201Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2073.1 2026-02-21T10:51:40.2518784Z configs/s 2026-02-21T10:51:40.5965754Z [213s] Generation 10 complete: 2026-02-21T10:51:40.5966143Z ok=17 2026-02-21T10:51:40.5966357Z min=0.0463 2026-02-21T10:51:40.5966567Z mid=0.0741 2026-02-21T10:51:40.5966767Z max=0.3287 2026-02-21T10:51:40.5967006Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:51:40.5967409Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:51:40.5967807Z 'l2_groupings': [64], 2026-02-21T10:51:40.5968085Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:51:40.5968396Z 'loop_orders': [[0, 1]], 2026-02-21T10:51:40.5968682Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:51:40.5968950Z 'num_stages': 2, 2026-02-21T10:51:40.5969179Z 'num_warps': 4, 2026-02-21T10:51:40.5969393Z 'pid_type': 'flat', 2026-02-21T10:51:40.5969588Z 'range_flattens': [None, False], 2026-02-21T10:51:40.5969837Z 'range_multi_buffers': [None, None], 2026-02-21T10:51:40.5970084Z 'range_num_stages': [0, 2], 2026-02-21T10:51:40.5970629Z 'range_unroll_factors': [0, 2], 2026-02-21T10:51:40.5970876Z 'range_warp_specializes': [], 2026-02-21T10:51:40.5971105Z 'waves_per_eu': 3} 2026-02-21T10:51:40.6064897Z [213s] Fitting surrogate: 629 points, 629 targets 2026-02-21T10:51:40.8322830Z [213s] Generation 11 starting: 12 neighbors, 1 active search path(s) 2026-02-21T10:51:49.7859169Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 0.4 configs/s 2026-02-21T10:51:50.6653611Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 13/13 17.7 configs/s 2026-02-21T10:51:50.9235575Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3621.8 2026-02-21T10:51:50.9236246Z configs/s 2026-02-21T10:51:51.2695882Z [223s] Generation 11 complete: 2026-02-21T10:51:51.2696085Z ok=14 2026-02-21T10:51:51.2696176Z min=0.0438 2026-02-21T10:51:51.2696276Z mid=0.0750 2026-02-21T10:51:51.2696359Z max=1.5105 2026-02-21T10:51:51.2696471Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:51:51.2696636Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:51:51.2697040Z 'l2_groupings': [64], 2026-02-21T10:51:51.2697154Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:51:51.2697289Z 'loop_orders': [[0, 1]], 2026-02-21T10:51:51.2697412Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:51:51.2697520Z 'num_stages': 2, 2026-02-21T10:51:51.2697615Z 'num_warps': 4, 2026-02-21T10:51:51.2697722Z 'pid_type': 'flat', 2026-02-21T10:51:51.2697831Z 'range_flattens': [None, False], 2026-02-21T10:51:51.2697957Z 'range_multi_buffers': [None, None], 2026-02-21T10:51:51.2698161Z 'range_num_stages': [0, 2], 2026-02-21T10:51:51.2698273Z 'range_unroll_factors': [0, 2], 2026-02-21T10:51:51.2698395Z 'range_warp_specializes': [], 2026-02-21T10:51:51.2698514Z 'waves_per_eu': 3} 2026-02-21T10:51:51.2779001Z [223s] Fitting surrogate: 643 points, 643 targets 2026-02-21T10:51:51.5379279Z [223s] Generation 12 starting: 14 neighbors, 1 active search path(s) 2026-02-21T10:51:54.1176352Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 4.6 configs/s 2026-02-21T10:51:55.0721862Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 17.3 configs/s 2026-02-21T10:51:55.6385944Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1522.8 2026-02-21T10:51:55.6386826Z configs/s 2026-02-21T10:51:56.0219947Z [228s] Generation 12 complete: 2026-02-21T10:51:56.0220367Z ok=16 2026-02-21T10:51:56.0220576Z min=0.0460 2026-02-21T10:51:56.0220819Z mid=0.0604 2026-02-21T10:51:56.0221023Z max=0.3355 2026-02-21T10:51:56.0221247Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:51:56.0221649Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:51:56.0222038Z 'l2_groupings': [64], 2026-02-21T10:51:56.0222316Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:51:56.0222631Z 'loop_orders': [[0, 1]], 2026-02-21T10:51:56.0223375Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:51:56.0223644Z 'num_stages': 2, 2026-02-21T10:51:56.0223871Z 'num_warps': 4, 2026-02-21T10:51:56.0224119Z 'pid_type': 'flat', 2026-02-21T10:51:56.0224379Z 'range_flattens': [None, False], 2026-02-21T10:51:56.0224683Z 'range_multi_buffers': [None, None], 2026-02-21T10:51:56.0225003Z 'range_num_stages': [0, 2], 2026-02-21T10:51:56.0225277Z 'range_unroll_factors': [0, 2], 2026-02-21T10:51:56.0225574Z 'range_warp_specializes': [], 2026-02-21T10:51:56.0225849Z 'waves_per_eu': 3} 2026-02-21T10:51:56.0326785Z [228s] Fitting surrogate: 659 points, 659 targets 2026-02-21T10:51:56.2767777Z [228s] Generation 13 starting: 15 neighbors, 1 active search path(s) 2026-02-21T10:52:11.4792348Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 0.5 configs/s 2026-02-21T10:52:12.5563541Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.2 configs/s 2026-02-21T10:52:12.7349677Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 5882.9 2026-02-21T10:52:12.7350221Z configs/s 2026-02-21T10:52:13.0676954Z [245s] Generation 13 complete: 2026-02-21T10:52:13.0677207Z ok=17 2026-02-21T10:52:13.0677389Z min=0.0430 2026-02-21T10:52:13.0677549Z mid=0.0954 2026-02-21T10:52:13.0677707Z max=1.6010 2026-02-21T10:52:13.0677903Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:52:13.0678207Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:52:13.0678508Z 'l2_groupings': [64], 2026-02-21T10:52:13.0678718Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:52:13.0679130Z 'loop_orders': [[0, 1]], 2026-02-21T10:52:13.0679340Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:52:13.0679545Z 'num_stages': 2, 2026-02-21T10:52:13.0679715Z 'num_warps': 4, 2026-02-21T10:52:13.0679892Z 'pid_type': 'flat', 2026-02-21T10:52:13.0680087Z 'range_flattens': [None, False], 2026-02-21T10:52:13.0680318Z 'range_multi_buffers': [None, None], 2026-02-21T10:52:13.0680554Z 'range_num_stages': [0, 2], 2026-02-21T10:52:13.0680762Z 'range_unroll_factors': [0, 2], 2026-02-21T10:52:13.0681101Z 'range_warp_specializes': [], 2026-02-21T10:52:13.0681310Z 'waves_per_eu': 3} 2026-02-21T10:52:13.0732897Z [245s] Fitting surrogate: 676 points, 676 targets 2026-02-21T10:52:13.3204512Z [245s] Generation 14 starting: 15 neighbors, 1 active search path(s) 2026-02-21T10:52:15.6815931Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 12.5 configs/s 2026-02-21T10:52:16.6953462Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.3 configs/s 2026-02-21T10:52:17.5781567Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1002.7 2026-02-21T10:52:17.5783202Z configs/s 2026-02-21T10:52:17.9821695Z [250s] Generation 14 complete: 2026-02-21T10:52:17.9822044Z ok=17 2026-02-21T10:52:17.9822237Z min=0.0443 2026-02-21T10:52:17.9822400Z mid=0.0488 2026-02-21T10:52:17.9822583Z max=0.3780 2026-02-21T10:52:17.9822757Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:52:17.9823060Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:52:17.9823379Z 'l2_groupings': [64], 2026-02-21T10:52:17.9823589Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:52:17.9823825Z 'loop_orders': [[0, 1]], 2026-02-21T10:52:17.9824468Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:52:17.9824665Z 'num_stages': 2, 2026-02-21T10:52:17.9835269Z 'num_warps': 4, 2026-02-21T10:52:17.9835417Z 'pid_type': 'flat', 2026-02-21T10:52:17.9835557Z 'range_flattens': [None, False], 2026-02-21T10:52:17.9835733Z 'range_multi_buffers': [None, None], 2026-02-21T10:52:17.9835935Z 'range_num_stages': [0, 2], 2026-02-21T10:52:17.9836086Z 'range_unroll_factors': [0, 2], 2026-02-21T10:52:17.9836245Z 'range_warp_specializes': [], 2026-02-21T10:52:17.9836395Z 'waves_per_eu': 3} 2026-02-21T10:52:17.9971933Z [250s] Fitting surrogate: 693 points, 693 targets 2026-02-21T10:52:18.2232127Z [250s] Generation 15 starting: 11 neighbors, 1 active search path(s) 2026-02-21T10:52:19.5324311Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 17.6 configs/s 2026-02-21T10:52:20.2249028Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 16.9 configs/s 2026-02-21T10:52:20.6159682Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2146.6 2026-02-21T10:52:20.6160242Z configs/s 2026-02-21T10:52:20.9605837Z [253s] Generation 15 complete: 2026-02-21T10:52:20.9606034Z ok=13 2026-02-21T10:52:20.9606168Z min=0.0448 2026-02-21T10:52:20.9606284Z mid=0.0626 2026-02-21T10:52:20.9606387Z max=0.2893 2026-02-21T10:52:20.9606494Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:52:20.9606694Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:52:20.9606923Z 'l2_groupings': [64], 2026-02-21T10:52:20.9607065Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:52:20.9607234Z 'loop_orders': [[0, 1]], 2026-02-21T10:52:20.9607374Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:52:20.9607522Z 'num_stages': 2, 2026-02-21T10:52:20.9607650Z 'num_warps': 4, 2026-02-21T10:52:20.9607774Z 'pid_type': 'flat', 2026-02-21T10:52:20.9607911Z 'range_flattens': [None, False], 2026-02-21T10:52:20.9608074Z 'range_multi_buffers': [None, None], 2026-02-21T10:52:20.9608206Z 'range_num_stages': [0, 2], 2026-02-21T10:52:20.9608331Z 'range_unroll_factors': [0, 2], 2026-02-21T10:52:20.9608496Z 'range_warp_specializes': [], 2026-02-21T10:52:20.9608653Z 'waves_per_eu': 3} 2026-02-21T10:52:20.9702825Z [253s] Fitting surrogate: 706 points, 706 targets 2026-02-21T10:52:21.2268650Z [253s] Generation 16 starting: 16 neighbors, 1 active search path(s) 2026-02-21T10:52:34.0645851Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.7 configs/s 2026-02-21T10:52:35.2125121Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.0 configs/s 2026-02-21T10:52:35.5734579Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2381.1 2026-02-21T10:52:35.5735152Z configs/s 2026-02-21T10:52:35.9575131Z [268s] Generation 16 complete: 2026-02-21T10:52:35.9575515Z ok=18 2026-02-21T10:52:35.9575728Z min=0.0458 2026-02-21T10:52:35.9575960Z mid=0.0769 2026-02-21T10:52:35.9576162Z max=0.8985 2026-02-21T10:52:35.9576387Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:52:35.9576797Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:52:35.9577191Z 'l2_groupings': [64], 2026-02-21T10:52:35.9577490Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:52:35.9577801Z 'loop_orders': [[0, 1]], 2026-02-21T10:52:35.9578077Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:52:35.9578343Z 'num_stages': 2, 2026-02-21T10:52:35.9578570Z 'num_warps': 4, 2026-02-21T10:52:35.9578805Z 'pid_type': 'flat', 2026-02-21T10:52:35.9579064Z 'range_flattens': [None, False], 2026-02-21T10:52:35.9579377Z 'range_multi_buffers': [None, None], 2026-02-21T10:52:35.9579678Z 'range_num_stages': [0, 2], 2026-02-21T10:52:35.9579958Z 'range_unroll_factors': [0, 2], 2026-02-21T10:52:35.9580261Z 'range_warp_specializes': [], 2026-02-21T10:52:35.9580545Z 'waves_per_eu': 3} 2026-02-21T10:52:35.9669292Z [268s] Fitting surrogate: 724 points, 724 targets 2026-02-21T10:52:36.2212562Z [268s] Generation 17 starting: 15 neighbors, 1 active search path(s) 2026-02-21T10:52:40.5651375Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.3 configs/s 2026-02-21T10:52:41.6353507Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.3 configs/s 2026-02-21T10:52:41.8205600Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 5614.8 2026-02-21T10:52:41.8206171Z configs/s 2026-02-21T10:52:42.1576344Z [274s] Generation 17 complete: 2026-02-21T10:52:42.1576505Z ok=17 2026-02-21T10:52:42.1576593Z min=0.0446 2026-02-21T10:52:42.1576911Z mid=0.0981 2026-02-21T10:52:42.1576990Z max=0.7139 2026-02-21T10:52:42.1577076Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:52:42.1577242Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:52:42.1577385Z 'l2_groupings': [64], 2026-02-21T10:52:42.1577490Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:52:42.1577615Z 'loop_orders': [[0, 1]], 2026-02-21T10:52:42.1577719Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:52:42.1577823Z 'num_stages': 2, 2026-02-21T10:52:42.1577907Z 'num_warps': 4, 2026-02-21T10:52:42.1577993Z 'pid_type': 'flat', 2026-02-21T10:52:42.1578096Z 'range_flattens': [None, False], 2026-02-21T10:52:42.1578211Z 'range_multi_buffers': [None, None], 2026-02-21T10:52:42.1578321Z 'range_num_stages': [0, 2], 2026-02-21T10:52:42.1578424Z 'range_unroll_factors': [0, 2], 2026-02-21T10:52:42.1578534Z 'range_warp_specializes': [], 2026-02-21T10:52:42.1578639Z 'waves_per_eu': 3} 2026-02-21T10:52:42.1654762Z [274s] Fitting surrogate: 741 points, 741 targets 2026-02-21T10:52:42.3914239Z [274s] Generation 18 starting: 15 neighbors, 1 active search path(s) 2026-02-21T10:52:45.6797432Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 3.3 configs/s 2026-02-21T10:52:46.7518548Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.3 configs/s 2026-02-21T10:52:47.1386732Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2177.7 2026-02-21T10:52:47.1386969Z configs/s 2026-02-21T10:52:47.5039484Z [279s] Generation 18 complete: 2026-02-21T10:52:47.5040568Z ok=17 2026-02-21T10:52:47.5040789Z min=0.0464 2026-02-21T10:52:47.5041004Z mid=0.0657 2026-02-21T10:52:47.5041200Z max=0.5624 2026-02-21T10:52:47.5041429Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:52:47.5041787Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:52:47.5041978Z 'l2_groupings': [64], 2026-02-21T10:52:47.5042112Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:52:47.5042262Z 'loop_orders': [[0, 1]], 2026-02-21T10:52:47.5042403Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:52:47.5042698Z 'num_stages': 2, 2026-02-21T10:52:47.5042809Z 'num_warps': 4, 2026-02-21T10:52:47.5042915Z 'pid_type': 'flat', 2026-02-21T10:52:47.5043036Z 'range_flattens': [None, False], 2026-02-21T10:52:47.5043266Z 'range_multi_buffers': [None, None], 2026-02-21T10:52:47.5043408Z 'range_num_stages': [0, 2], 2026-02-21T10:52:47.5043534Z 'range_unroll_factors': [0, 2], 2026-02-21T10:52:47.5043676Z 'range_warp_specializes': [], 2026-02-21T10:52:47.5043804Z 'waves_per_eu': 3} 2026-02-21T10:52:47.5132540Z [279s] Fitting surrogate: 758 points, 758 targets 2026-02-21T10:52:47.7486770Z [280s] Generation 19 starting: 13 neighbors, 1 active search path(s) 2026-02-21T10:52:51.1074113Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 3.6 configs/s 2026-02-21T10:52:52.0629321Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 17.3 configs/s 2026-02-21T10:52:52.3276045Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3447.3 2026-02-21T10:52:52.3276651Z configs/s 2026-02-21T10:52:52.6914692Z [285s] Generation 19 complete: 2026-02-21T10:52:52.6914946Z ok=15 2026-02-21T10:52:52.6915219Z min=0.0437 2026-02-21T10:52:52.6915401Z mid=0.0743 2026-02-21T10:52:52.6915551Z max=0.4740 2026-02-21T10:52:52.6915722Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:52:52.6916027Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:52:52.6916337Z 'l2_groupings': [64], 2026-02-21T10:52:52.6916546Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:52:52.6916780Z 'loop_orders': [[0, 1]], 2026-02-21T10:52:52.6916988Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:52:52.6917185Z 'num_stages': 2, 2026-02-21T10:52:52.6917359Z 'num_warps': 4, 2026-02-21T10:52:52.6917531Z 'pid_type': 'flat', 2026-02-21T10:52:52.6917992Z 'range_flattens': [None, False], 2026-02-21T10:52:52.6918225Z 'range_multi_buffers': [None, None], 2026-02-21T10:52:52.6918453Z 'range_num_stages': [0, 2], 2026-02-21T10:52:52.6918676Z 'range_unroll_factors': [0, 2], 2026-02-21T10:52:52.6918898Z 'range_warp_specializes': [], 2026-02-21T10:52:52.6919112Z 'waves_per_eu': 3} 2026-02-21T10:52:52.6998866Z [285s] Fitting surrogate: 773 points, 773 targets 2026-02-21T10:52:52.9402011Z [285s] Generation 20 starting: 15 neighbors, 1 active search path(s) 2026-02-21T10:52:55.6447196Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 8.0 configs/s 2026-02-21T10:52:56.6679945Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.1 configs/s 2026-02-21T10:52:57.4320939Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1148.3 2026-02-21T10:52:57.4321529Z configs/s 2026-02-21T10:52:57.8050628Z [290s] Generation 20 complete: 2026-02-21T10:52:57.8051032Z ok=17 2026-02-21T10:52:57.8051280Z min=0.0434 2026-02-21T10:52:57.8051493Z mid=0.0549 2026-02-21T10:52:57.8051689Z max=0.1516 2026-02-21T10:52:57.8052423Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:52:57.8052821Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:52:57.8053210Z 'l2_groupings': [64], 2026-02-21T10:52:57.8053499Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:52:57.8053816Z 'loop_orders': [[0, 1]], 2026-02-21T10:52:57.8054089Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:52:57.8054361Z 'num_stages': 2, 2026-02-21T10:52:57.8054621Z 'num_warps': 4, 2026-02-21T10:52:57.8054977Z 'pid_type': 'flat', 2026-02-21T10:52:57.8055233Z 'range_flattens': [None, False], 2026-02-21T10:52:57.8055536Z 'range_multi_buffers': [None, None], 2026-02-21T10:52:57.8055837Z 'range_num_stages': [0, 2], 2026-02-21T10:52:57.8056115Z 'range_unroll_factors': [0, 2], 2026-02-21T10:52:57.8056409Z 'range_warp_specializes': [], 2026-02-21T10:52:57.8056682Z 'waves_per_eu': 3} 2026-02-21T10:52:57.8201435Z [290s] Fitting surrogate: 790 points, 790 targets 2026-02-21T10:52:58.3780971Z [290s] Autotuning complete in 290.8s after searching 746 configs. 2026-02-21T10:52:58.3781489Z One can hardcode the best config and skip autotuning with: 2026-02-21T10:52:58.3783426Z @helion.kernel(config=helion.Config(block_sizes=[1, 64, 32], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T10:52:58.3784812Z 2026-02-21T10:52:58.3785174Z [290s] Code of selected kernel: /tmp/torchinductor_root/4t/c4tnig7zzinrar4lpsnkttfd4dxmfj7alahjvfll6sl6hpe4kmtn.py 2026-02-21T10:52:58.9684747Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T10:52:58.9687585Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 64, 32], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T10:52:58.9689721Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T10:52:58.9690100Z WARNING:tritonbench.utils.triton_op:Completed input ID 1: 2026-02-21T10:52:58.9690380Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T10:52:58.9690600Z ------------------------------------------ 2026-02-21T10:52:58.9690810Z (4, 48, 256, 256, 128) 2026-02-21T10:52:58.9690922Z 2026-02-21T10:52:58.9691522Z 33%|███▎ | 2/6 [08:32<17:32, 263.04s/it]WARNING:tritonbench.utils.triton_op:Running input ID 2: 2026-02-21T10:52:58.9691875Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T10:52:58.9692095Z ------------------------------------------ 2026-02-21T10:52:58.9692294Z (4, 48, 512, 512, 128) 2026-02-21T10:52:58.9692571Z INFO:tritonbench.utils.triton_op:Took 0.06ms to get benchmark function for aten 2026-02-21T10:53:00.0227282Z INFO:tritonbench.utils.triton_op:Took 1.62ms to get benchmark function for flex_attention 2026-02-21T10:53:01.6032561Z WARNING:__main__:Input tensor metadata: 2026-02-21T10:53:01.6033022Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T10:53:01.6033400Z 'dtype': 'torch.bfloat16', 2026-02-21T10:53:01.6033720Z 'shape': (4, 48, 512, 128), 2026-02-21T10:53:01.6034048Z 'stride': (3145728, 65536, 128, 1)}, 2026-02-21T10:53:01.6034375Z { 'device': 'cuda:0', 2026-02-21T10:53:01.6034673Z 'dtype': 'torch.bfloat16', 2026-02-21T10:53:01.6034981Z 'shape': (4, 48, 512, 128), 2026-02-21T10:53:01.6035299Z 'stride': (3145728, 65536, 128, 1)}, 2026-02-21T10:53:01.6035610Z { 'device': 'cuda:0', 2026-02-21T10:53:01.6036372Z 'dtype': 'torch.bfloat16', 2026-02-21T10:53:01.6036676Z 'shape': (4, 48, 512, 128), 2026-02-21T10:53:01.6036996Z 'stride': (3145728, 65536, 128, 1)}), 2026-02-21T10:53:01.6037312Z 'kwargs': {}} 2026-02-21T10:53:01.6055037Z INFO:tritonbench.utils.triton_op:Took 2.79ms to get benchmark function for helion_attention 2026-02-21T10:53:01.8484858Z [0s] Autotune random seed: 2144140282 2026-02-21T10:53:02.0022476Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T10:53:42.4272510Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 8, 512], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[1, 2], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T10:53:42.4289103Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T10:53:44.1017487Z /tmp/torchinductor_root/rd/crdzuare6tius3irgn2fl3moepu6my2a3zv53zxi3c2hi2j3pq4r.py:57:24: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T10:53:44.1018034Z k = tl.load(tl.make_block_ptr(k_view, [192, 128, 512], [65536, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_1, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T10:53:44.1018348Z ^ 2026-02-21T10:53:44.1019081Z /tmp/torchinductor_root/rd/crdzuare6tius3irgn2fl3moepu6my2a3zv53zxi3c2hi2j3pq4r.py:59:145: note: - use: %143 = "tt.reshape"(<>) : (tensor<1x128x128xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 8], order = [1, 0, 2]}>>) -> tensor<128x128xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 8], order = [0, 1]}>> 2026-02-21T10:53:44.1019695Z 2026-02-21T10:53:44.1020032Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T10:53:44.1020488Z ^ 2026-02-21T10:53:44.1020678Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T10:53:44.1020957Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 2], order = [2, 1, 0]}> 2026-02-21T10:53:44.1021275Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [2, 1, 0]}> 2026-02-21T10:53:44.1021636Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}> 2026-02-21T10:53:44.1021968Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}> 2026-02-21T10:53:44.1022283Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:53:44.1022589Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T10:53:44.1022883Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}> 2026-02-21T10:53:44.1023171Z #blocked7 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [8], order = [0]}> 2026-02-21T10:53:44.1023456Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [0, 1]}> 2026-02-21T10:53:44.1023848Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [0, 1, 2]}> 2026-02-21T10:53:44.1024162Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [0, 1, 2]}> 2026-02-21T10:53:44.1024474Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:53:44.1024872Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T10:53:44.1025395Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T10:53:44.1025804Z %c65536_i32 = arith.constant 65536 : i32 2026-02-21T10:53:44.1025934Z %c192_i64 = arith.constant 192 : i64 2026-02-21T10:53:44.1026053Z %c0_i64 = arith.constant 0 : i64 2026-02-21T10:53:44.1026170Z %c65536_i64 = arith.constant 65536 : i64 2026-02-21T10:53:44.1026347Z %cst = arith.constant dense<0.000000e+00> : tensor<1x128x128xbf16, #blocked> 2026-02-21T10:53:44.1026555Z %cst_0 = arith.constant dense<512> : tensor<1x1x128xi64, #blocked1> 2026-02-21T10:53:44.1026738Z %cst_1 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:53:44.1026916Z %cst_2 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:53:44.1027109Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked3> 2026-02-21T10:53:44.1027300Z %cst_4 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked1> 2026-02-21T10:53:44.1027496Z %cst_5 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked1> 2026-02-21T10:53:44.1027667Z %cst_6 = arith.constant dense<512> : tensor<1x2x1xi64, #blocked4> 2026-02-21T10:53:44.1027845Z %cst_7 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked4> 2026-02-21T10:53:44.1028017Z %cst_8 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked4> 2026-02-21T10:53:44.1028162Z %c128_i32 = arith.constant 128 : i32 2026-02-21T10:53:44.1028278Z %c512_i32 = arith.constant 512 : i32 2026-02-21T10:53:44.1028390Z %c304_i32 = arith.constant 304 : i32 2026-02-21T10:53:44.1028529Z %cst_9 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked4> 2026-02-21T10:53:44.1028704Z %cst_10 = arith.constant dense<128> : tensor<1x128x1xi32, #blocked2> 2026-02-21T10:53:44.1028896Z %cst_11 = arith.constant dense<0.127517432> : tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1029093Z %cst_12 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1029288Z %cst_13 = arith.constant dense<0.000000e+00> : tensor<2x128xf32, #blocked6> 2026-02-21T10:53:44.1029447Z %c0_i32 = arith.constant 0 : i32 2026-02-21T10:53:44.1029601Z %cst_14 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1029798Z %cst_15 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1029989Z %cst_16 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1030144Z %c2_i32 = arith.constant 2 : i32 2026-02-21T10:53:44.1030258Z %c192_i32 = arith.constant 192 : i32 2026-02-21T10:53:44.1030372Z %c49152_i32 = arith.constant 49152 : i32 2026-02-21T10:53:44.1030493Z %0 = tt.get_program_id x : i32 2026-02-21T10:53:44.1030642Z %1 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked7> 2026-02-21T10:53:44.1030843Z %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked7> 2026-02-21T10:53:44.1031049Z %3 = tt.splat %arg0 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked3> 2026-02-21T10:53:44.1031253Z %4 = arith.extsi %1 : tensor<2xi32, #blocked7> to tensor<2xi64, #blocked7> 2026-02-21T10:53:44.1031475Z %5 = arith.extsi %2 : tensor<128xi32, #blocked7> to tensor<128xi64, #blocked7> 2026-02-21T10:53:44.1031734Z %6 = ttg.convert_layout %5 : tensor<128xi64, #blocked7> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:53:44.1032059Z %7 = tt.expand_dims %6 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi64, #blocked8> 2026-02-21T10:53:44.1032341Z %8 = ttg.convert_layout %7 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #blocked6> 2026-02-21T10:53:44.1032634Z %9 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T10:53:44.1032966Z %10 = tt.expand_dims %9 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi64, #blocked9> 2026-02-21T10:53:44.1033261Z %11 = ttg.convert_layout %10 : tensor<1x1x128xi64, #blocked9> -> tensor<1x1x128xi64, #blocked1> 2026-02-21T10:53:44.1033506Z %12 = tt.broadcast %11 : tensor<1x1x128xi64, #blocked1> -> tensor<1x2x128xi64, #blocked1> 2026-02-21T10:53:44.1033763Z %13 = ttg.convert_layout %12 : tensor<1x2x128xi64, #blocked1> -> tensor<1x2x128xi64, #blocked3> 2026-02-21T10:53:44.1033977Z %14 = arith.cmpi sge, %11, %cst_5 : tensor<1x1x128xi64, #blocked1> 2026-02-21T10:53:44.1034149Z %15 = arith.cmpi slt, %11, %cst_4 : tensor<1x1x128xi64, #blocked1> 2026-02-21T10:53:44.1034311Z %16 = arith.andi %14, %15 : tensor<1x1x128xi1, #blocked1> 2026-02-21T10:53:44.1034502Z %17 = tt.broadcast %16 : tensor<1x1x128xi1, #blocked1> -> tensor<1x2x128xi1, #blocked1> 2026-02-21T10:53:44.1034737Z %18 = ttg.convert_layout %17 : tensor<1x2x128xi1, #blocked1> -> tensor<1x2x128xi1, #blocked3> 2026-02-21T10:53:44.1034965Z %19 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x128x!tt.ptr, #blocked> 2026-02-21T10:53:44.1035254Z %20 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked10}>> 2026-02-21T10:53:44.1035598Z %21 = tt.expand_dims %20 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x128x1xi64, #blocked10> 2026-02-21T10:53:44.1035901Z %22 = ttg.convert_layout %21 : tensor<1x128x1xi64, #blocked10> -> tensor<1x128x1xi64, #blocked2> 2026-02-21T10:53:44.1036166Z %23 = tt.broadcast %22 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x128xi64, #blocked2> 2026-02-21T10:53:44.1036417Z %24 = ttg.convert_layout %23 : tensor<1x128x128xi64, #blocked2> -> tensor<1x128x128xi64, #blocked> 2026-02-21T10:53:44.1036636Z %25 = arith.cmpi sge, %22, %cst_2 : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:53:44.1036805Z %26 = arith.cmpi slt, %22, %cst_1 : tensor<1x128x1xi64, #blocked2> 2026-02-21T10:53:44.1036964Z %27 = arith.andi %25, %26 : tensor<1x128x1xi1, #blocked2> 2026-02-21T10:53:44.1037200Z %28 = ttg.convert_layout %2 : tensor<128xi32, #blocked7> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:53:44.1037524Z %29 = tt.expand_dims %28 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi32, #blocked8> 2026-02-21T10:53:44.1037812Z %30 = ttg.convert_layout %29 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #blocked6> 2026-02-21T10:53:44.1038092Z %31 = ttg.convert_layout %30 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T10:53:44.1038426Z %32 = tt.expand_dims %31 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi32, #blocked9> 2026-02-21T10:53:44.1038722Z %33 = ttg.convert_layout %32 : tensor<1x1x128xi32, #blocked9> -> tensor<1x1x128xi32, #blocked1> 2026-02-21T10:53:44.1038968Z %34 = tt.broadcast %33 : tensor<1x1x128xi32, #blocked1> -> tensor<1x128x128xi32, #blocked1> 2026-02-21T10:53:44.1039219Z %35 = ttg.convert_layout %34 : tensor<1x128x128xi32, #blocked1> -> tensor<1x128x128xi32, #blocked> 2026-02-21T10:53:44.1039473Z %36 = tt.splat %arg2 : !tt.ptr -> tensor<1x128x128x!tt.ptr, #blocked> 2026-02-21T10:53:44.1039734Z %37 = ttg.convert_layout %2 : tensor<128xi32, #blocked7> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:53:44.1040054Z %38 = tt.expand_dims %37 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi32, #blocked8> 2026-02-21T10:53:44.1040334Z %39 = ttg.convert_layout %38 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #blocked6> 2026-02-21T10:53:44.1040635Z %40 = ttg.convert_layout %39 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T10:53:44.1040967Z %41 = tt.expand_dims %40 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi32, #blocked9> 2026-02-21T10:53:44.1041260Z %42 = ttg.convert_layout %41 : tensor<1x1x128xi32, #blocked9> -> tensor<1x1x128xi32, #blocked1> 2026-02-21T10:53:44.1041502Z %43 = tt.broadcast %42 : tensor<1x1x128xi32, #blocked1> -> tensor<1x2x128xi32, #blocked1> 2026-02-21T10:53:44.1041737Z %44 = ttg.convert_layout %43 : tensor<1x2x128xi32, #blocked1> -> tensor<1x2x128xi32, #blocked3> 2026-02-21T10:53:44.1041981Z %45 = tt.splat %arg3 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked3> 2026-02-21T10:53:44.1042173Z scf.for %arg4 = %0 to %c49152_i32 step %c304_i32 : i32 { 2026-02-21T10:53:44.1042315Z %46 = arith.remsi %arg4, %c192_i32 : i32 2026-02-21T10:53:44.1042442Z %47 = arith.divsi %arg4, %c192_i32 : i32 2026-02-21T10:53:44.1042663Z %48 = arith.muli %47, %c2_i32 : i32 2026-02-21T10:53:44.1042802Z %49 = tt.splat %48 : i32 -> tensor<2xi32, #blocked7> 2026-02-21T10:53:44.1042949Z %50 = arith.addi %49, %1 : tensor<2xi32, #blocked7> 2026-02-21T10:53:44.1043083Z %51 = arith.extsi %46 : i32 to i64 2026-02-21T10:53:44.1043224Z %52 = arith.extsi %48 : i32 to i64 2026-02-21T10:53:44.1043340Z %53 = arith.muli %51, %c65536_i64 : i64 2026-02-21T10:53:44.1043483Z %54 = tt.splat %53 : i64 -> tensor<1x2x128xi64, #blocked3> 2026-02-21T10:53:44.1043634Z %55 = tt.splat %52 : i64 -> tensor<2xi64, #blocked7> 2026-02-21T10:53:44.1043780Z %56 = arith.addi %55, %4 : tensor<2xi64, #blocked7> 2026-02-21T10:53:44.1044008Z %57 = ttg.convert_layout %56 : tensor<2xi64, #blocked7> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:53:44.1044330Z %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x2xi64, #blocked8> 2026-02-21T10:53:44.1044610Z %59 = ttg.convert_layout %58 : tensor<1x2xi64, #blocked8> -> tensor<1x2xi64, #blocked5> 2026-02-21T10:53:44.1044885Z %60 = ttg.convert_layout %59 : tensor<1x2xi64, #blocked5> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked11}>> 2026-02-21T10:53:44.1045215Z %61 = tt.expand_dims %60 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x2x1xi64, #blocked11> 2026-02-21T10:53:44.1045510Z %62 = ttg.convert_layout %61 : tensor<1x2x1xi64, #blocked11> -> tensor<1x2x1xi64, #blocked4> 2026-02-21T10:53:44.1045714Z %63 = arith.muli %62, %cst_8 : tensor<1x2x1xi64, #blocked4> 2026-02-21T10:53:44.1045909Z %64 = tt.broadcast %63 : tensor<1x2x1xi64, #blocked4> -> tensor<1x2x128xi64, #blocked4> 2026-02-21T10:53:44.1046144Z %65 = ttg.convert_layout %64 : tensor<1x2x128xi64, #blocked4> -> tensor<1x2x128xi64, #blocked3> 2026-02-21T10:53:44.1046354Z %66 = arith.addi %65, %13 : tensor<1x2x128xi64, #blocked3> 2026-02-21T10:53:44.1046508Z %67 = arith.addi %54, %66 : tensor<1x2x128xi64, #blocked3> 2026-02-21T10:53:44.1046711Z %68 = tt.addptr %3, %67 : tensor<1x2x128x!tt.ptr, #blocked3>, tensor<1x2x128xi64, #blocked3> 2026-02-21T10:53:44.1046905Z %69 = arith.cmpi sge, %51, %c0_i64 : i64 2026-02-21T10:53:44.1047027Z %70 = arith.cmpi slt, %51, %c192_i64 : i64 2026-02-21T10:53:44.1047150Z %71 = arith.andi %69, %70 : i1 2026-02-21T10:53:44.1047323Z %72 = arith.cmpi sge, %62, %cst_7 : tensor<1x2x1xi64, #blocked4> 2026-02-21T10:53:44.1047495Z %73 = arith.cmpi slt, %62, %cst_6 : tensor<1x2x1xi64, #blocked4> 2026-02-21T10:53:44.1047656Z %74 = arith.andi %72, %73 : tensor<1x2x1xi1, #blocked4> 2026-02-21T10:53:44.1047833Z %75 = tt.splat %71 : i1 -> tensor<1x2x1xi1, #blocked4> 2026-02-21T10:53:44.1047986Z %76 = arith.andi %75, %74 : tensor<1x2x1xi1, #blocked4> 2026-02-21T10:53:44.1048189Z %77 = tt.broadcast %76 : tensor<1x2x1xi1, #blocked4> -> tensor<1x2x128xi1, #blocked4> 2026-02-21T10:53:44.1048428Z %78 = ttg.convert_layout %77 : tensor<1x2x128xi1, #blocked4> -> tensor<1x2x128xi1, #blocked3> 2026-02-21T10:53:44.1048633Z %79 = arith.andi %78, %18 : tensor<1x2x128xi1, #blocked3> 2026-02-21T10:53:44.1048801Z %80 = tt.load %68, %79, %cst_3 : tensor<1x2x128x!tt.ptr, #blocked3> 2026-02-21T10:53:44.1048976Z %81 = tt.splat %53 : i64 -> tensor<1x128x128xi64, #blocked> 2026-02-21T10:53:44.1049134Z %82 = tt.splat %71 : i1 -> tensor<1x128x1xi1, #blocked2> 2026-02-21T10:53:44.1049287Z %83 = arith.andi %82, %27 : tensor<1x128x1xi1, #blocked2> 2026-02-21T10:53:44.1049497Z %84 = tt.broadcast %83 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x128xi1, #blocked2> 2026-02-21T10:53:44.1049742Z %85 = ttg.convert_layout %84 : tensor<1x128x128xi1, #blocked2> -> tensor<1x128x128xi1, #blocked> 2026-02-21T10:53:44.1049984Z %86 = tt.reshape %80 : tensor<1x2x128xbf16, #blocked3> -> tensor<2x128xbf16, #blocked6> 2026-02-21T10:53:44.1050162Z %87 = arith.muli %46, %c65536_i32 : i32 2026-02-21T10:53:44.1050299Z %88 = tt.splat %87 : i32 -> tensor<1x128x1xi32, #blocked2> 2026-02-21T10:53:44.1050703Z %89:3 = scf.for %arg5 = %c0_i32 to %c512_i32 step %c128_i32 iter_args(%arg6 = %cst_16, %arg7 = %cst_15, %arg8 = %cst_14) -> (tensor<1x2xf32, #blocked5>, tensor<1x2xf32, #blocked5>, tensor<1x2x128xf32, #blocked3>) : i32 { 2026-02-21T10:53:44.1051060Z %111 = tt.splat %arg5 : i32 -> tensor<128xi32, #blocked7> 2026-02-21T10:53:44.1051221Z %112 = arith.addi %111, %2 : tensor<128xi32, #blocked7> 2026-02-21T10:53:44.1051361Z %113 = arith.extsi %arg5 : i32 to i64 2026-02-21T10:53:44.1051500Z %114 = tt.splat %113 : i64 -> tensor<128xi64, #blocked7> 2026-02-21T10:53:44.1051653Z %115 = arith.addi %114, %5 : tensor<128xi64, #blocked7> 2026-02-21T10:53:44.1051892Z %116 = ttg.convert_layout %115 : tensor<128xi64, #blocked7> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:53:44.1052232Z %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi64, #blocked8> 2026-02-21T10:53:44.1052529Z %118 = ttg.convert_layout %117 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #blocked6> 2026-02-21T10:53:44.1052821Z %119 = ttg.convert_layout %118 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T10:53:44.1053169Z %120 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi64, #blocked9> 2026-02-21T10:53:44.1053477Z %121 = ttg.convert_layout %120 : tensor<1x1x128xi64, #blocked9> -> tensor<1x1x128xi64, #blocked1> 2026-02-21T10:53:44.1053695Z %122 = arith.muli %121, %cst_4 : tensor<1x1x128xi64, #blocked1> 2026-02-21T10:53:44.1053903Z %123 = tt.broadcast %122 : tensor<1x1x128xi64, #blocked1> -> tensor<1x128x128xi64, #blocked1> 2026-02-21T10:53:44.1054160Z %124 = ttg.convert_layout %123 : tensor<1x128x128xi64, #blocked1> -> tensor<1x128x128xi64, #blocked> 2026-02-21T10:53:44.1054381Z %125 = arith.addi %24, %124 : tensor<1x128x128xi64, #blocked> 2026-02-21T10:53:44.1054543Z %126 = arith.addi %81, %125 : tensor<1x128x128xi64, #blocked> 2026-02-21T10:53:44.1054758Z %127 = tt.addptr %19, %126 : tensor<1x128x128x!tt.ptr, #blocked>, tensor<1x128x128xi64, #blocked> 2026-02-21T10:53:44.1055002Z %128 = arith.cmpi sge, %121, %cst_5 : tensor<1x1x128xi64, #blocked1> 2026-02-21T10:53:44.1055182Z %129 = arith.cmpi slt, %121, %cst_0 : tensor<1x1x128xi64, #blocked1> 2026-02-21T10:53:44.1055357Z %130 = arith.andi %128, %129 : tensor<1x1x128xi1, #blocked1> 2026-02-21T10:53:44.1055557Z %131 = tt.broadcast %130 : tensor<1x1x128xi1, #blocked1> -> tensor<1x128x128xi1, #blocked1> 2026-02-21T10:53:44.1055826Z %132 = ttg.convert_layout %131 : tensor<1x128x128xi1, #blocked1> -> tensor<1x128x128xi1, #blocked> 2026-02-21T10:53:44.1056044Z %133 = arith.andi %85, %132 : tensor<1x128x128xi1, #blocked> 2026-02-21T10:53:44.1056221Z %134 = tt.load %127, %133, %cst : tensor<1x128x128x!tt.ptr, #blocked> 2026-02-21T10:53:44.1056441Z %135 = tt.reshape %134 : tensor<1x128x128xbf16, #blocked> -> tensor<128x128xbf16, #blocked6> 2026-02-21T10:53:44.1056738Z %136 = ttg.convert_layout %86 : tensor<2x128xbf16, #blocked6> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> 2026-02-21T10:53:44.1057113Z %137 = ttg.convert_layout %135 : tensor<128x128xbf16, #blocked6> -> tensor<128x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> 2026-02-21T10:53:44.1057422Z %138 = ttg.convert_layout %cst_13 : tensor<2x128xf32, #blocked6> -> tensor<2x128xf32, #blocked6> 2026-02-21T10:53:44.1057846Z %139 = tt.dot %136, %137, %138, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> * tensor<128x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> -> tensor<2x128xf32, #blocked6> 2026-02-21T10:53:44.1058251Z %140 = tt.reshape %139 : tensor<2x128xf32, #blocked6> -> tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1058509Z %141 = arith.truncf %140 : tensor<1x2x128xf32, #blocked3> to tensor<1x2x128xbf16, #blocked3> 2026-02-21T10:53:44.1058760Z %142 = arith.extf %141 : tensor<1x2x128xbf16, #blocked3> to tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1058963Z %143 = "tt.reduce"(%142) <{axis = 2 : i32}> ({ 2026-02-21T10:53:44.1059095Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T10:53:44.1059225Z %198 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T10:53:44.1059354Z tt.reduce.return %198 : f32 2026-02-21T10:53:44.1059550Z }) : (tensor<1x2x128xf32, #blocked3>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> 2026-02-21T10:53:44.1059851Z %144 = ttg.convert_layout %143 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> -> tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1060130Z %145 = arith.truncf %144 : tensor<1x2xf32, #blocked5> to tensor<1x2xbf16, #blocked5> 2026-02-21T10:53:44.1060357Z %146 = arith.extf %145 : tensor<1x2xbf16, #blocked5> to tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1060552Z %147 = arith.mulf %146, %cst_12 : tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1060750Z %148 = arith.truncf %147 : tensor<1x2xf32, #blocked5> to tensor<1x2xbf16, #blocked5> 2026-02-21T10:53:44.1060971Z %149 = arith.extf %148 : tensor<1x2xbf16, #blocked5> to tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1061175Z %150 = arith.cmpf ogt, %arg6, %149 : tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1061355Z %151 = arith.cmpf une, %arg6, %arg6 : tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1061525Z %152 = arith.ori %150, %151 : tensor<1x2xi1, #blocked5> 2026-02-21T10:53:44.1061725Z %153 = arith.select %152, %arg6, %149 : tensor<1x2xi1, #blocked5>, tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1061932Z %154 = arith.mulf %142, %cst_11 : tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1062145Z %155 = arith.truncf %154 : tensor<1x2x128xf32, #blocked3> to tensor<1x2x128xbf16, #blocked3> 2026-02-21T10:53:44.1062441Z %156 = ttg.convert_layout %153 : tensor<1x2xf32, #blocked5> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked11}>> 2026-02-21T10:53:44.1062806Z %157 = tt.expand_dims %156 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x2x1xf32, #blocked11> 2026-02-21T10:53:44.1063115Z %158 = ttg.convert_layout %157 : tensor<1x2x1xf32, #blocked11> -> tensor<1x2x1xf32, #blocked4> 2026-02-21T10:53:44.1063363Z %159 = arith.extf %155 : tensor<1x2x128xbf16, #blocked3> to tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1063610Z %160 = tt.broadcast %158 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4> 2026-02-21T10:53:44.1063879Z %161 = ttg.convert_layout %160 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1064095Z %162 = arith.subf %159, %161 : tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1064412Z %163 = tt.extern_elementwise %162 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x128xf32, #blocked3>) -> tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1064704Z %164 = "tt.reduce"(%163) <{axis = 2 : i32}> ({ 2026-02-21T10:53:44.1064841Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T10:53:44.1064963Z %198 = arith.addf %arg9, %arg10 : f32 2026-02-21T10:53:44.1065114Z tt.reduce.return %198 : f32 2026-02-21T10:53:44.1065310Z }) : (tensor<1x2x128xf32, #blocked3>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> 2026-02-21T10:53:44.1065604Z %165 = ttg.convert_layout %164 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> -> tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1065853Z %166 = arith.subf %arg6, %153 : tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1066146Z %167 = tt.extern_elementwise %166 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked5>) -> tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1066457Z %168 = arith.mulf %arg7, %167 : tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1066623Z %169 = arith.addf %168, %165 : tensor<1x2xf32, #blocked5> 2026-02-21T10:53:44.1066871Z %170 = ttg.convert_layout %167 : tensor<1x2xf32, #blocked5> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked11}>> 2026-02-21T10:53:44.1067218Z %171 = tt.expand_dims %170 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x2x1xf32, #blocked11> 2026-02-21T10:53:44.1067519Z %172 = ttg.convert_layout %171 : tensor<1x2x1xf32, #blocked11> -> tensor<1x2x1xf32, #blocked4> 2026-02-21T10:53:44.1067771Z %173 = tt.broadcast %172 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4> 2026-02-21T10:53:44.1068024Z %174 = ttg.convert_layout %173 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1068242Z %175 = arith.mulf %arg8, %174 : tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1068495Z %176 = ttg.convert_layout %112 : tensor<128xi32, #blocked7> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:53:44.1068831Z %177 = tt.expand_dims %176 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi32, #blocked8> 2026-02-21T10:53:44.1069132Z %178 = ttg.convert_layout %177 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #blocked6> 2026-02-21T10:53:44.1069430Z %179 = ttg.convert_layout %178 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked10}>> 2026-02-21T10:53:44.1069780Z %180 = tt.expand_dims %179 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x128x1xi32, #blocked10> 2026-02-21T10:53:44.1070097Z %181 = ttg.convert_layout %180 : tensor<1x128x1xi32, #blocked10> -> tensor<1x128x1xi32, #blocked2> 2026-02-21T10:53:44.1070321Z %182 = arith.muli %181, %cst_10 : tensor<1x128x1xi32, #blocked2> 2026-02-21T10:53:44.1070493Z %183 = arith.addi %88, %182 : tensor<1x128x1xi32, #blocked2> 2026-02-21T10:53:44.1070704Z %184 = tt.broadcast %183 : tensor<1x128x1xi32, #blocked2> -> tensor<1x128x128xi32, #blocked2> 2026-02-21T10:53:44.1070983Z %185 = ttg.convert_layout %184 : tensor<1x128x128xi32, #blocked2> -> tensor<1x128x128xi32, #blocked> 2026-02-21T10:53:44.1071208Z %186 = arith.addi %185, %35 : tensor<1x128x128xi32, #blocked> 2026-02-21T10:53:44.1071430Z %187 = tt.addptr %36, %186 : tensor<1x128x128x!tt.ptr, #blocked>, tensor<1x128x128xi32, #blocked> 2026-02-21T10:53:44.1071653Z %188 = tt.load %187 : tensor<1x128x128x!tt.ptr, #blocked> 2026-02-21T10:53:44.1071880Z %189 = arith.truncf %163 : tensor<1x2x128xf32, #blocked3> to tensor<1x2x128xbf16, #blocked3> 2026-02-21T10:53:44.1072124Z %190 = tt.reshape %175 : tensor<1x2x128xf32, #blocked3> -> tensor<2x128xf32, #blocked6> 2026-02-21T10:53:44.1072366Z %191 = tt.reshape %189 : tensor<1x2x128xbf16, #blocked3> -> tensor<2x128xbf16, #blocked6> 2026-02-21T10:53:44.1072614Z %192 = tt.reshape %188 : tensor<1x128x128xbf16, #blocked> -> tensor<128x128xbf16, #blocked6> 2026-02-21T10:53:44.1072921Z %193 = ttg.convert_layout %191 : tensor<2x128xbf16, #blocked6> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> 2026-02-21T10:53:44.1073303Z %194 = ttg.convert_layout %192 : tensor<128x128xbf16, #blocked6> -> tensor<128x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> 2026-02-21T10:53:44.1073608Z %195 = ttg.convert_layout %190 : tensor<2x128xf32, #blocked6> -> tensor<2x128xf32, #blocked6> 2026-02-21T10:53:44.1074023Z %196 = tt.dot %193, %194, %195, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> * tensor<128x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> -> tensor<2x128xf32, #blocked6> 2026-02-21T10:53:44.1074426Z %197 = tt.reshape %196 : tensor<2x128xf32, #blocked6> -> tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1074719Z scf.yield %153, %169, %197 : tensor<1x2xf32, #blocked5>, tensor<1x2xf32, #blocked5>, tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1074933Z } 2026-02-21T10:53:44.1075126Z %90 = ttg.convert_layout %89#1 : tensor<1x2xf32, #blocked5> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked11}>> 2026-02-21T10:53:44.1075472Z %91 = tt.expand_dims %90 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x2x1xf32, #blocked11> 2026-02-21T10:53:44.1075772Z %92 = ttg.convert_layout %91 : tensor<1x2x1xf32, #blocked11> -> tensor<1x2x1xf32, #blocked4> 2026-02-21T10:53:44.1076027Z %93 = tt.broadcast %92 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4> 2026-02-21T10:53:44.1076272Z %94 = ttg.convert_layout %93 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1076490Z %95 = arith.divf %89#2, %94 : tensor<1x2x128xf32, #blocked3> 2026-02-21T10:53:44.1076694Z %96 = arith.truncf %95 : tensor<1x2x128xf32, #blocked3> to tensor<1x2x128xbf16, #blocked3> 2026-02-21T10:53:44.1076885Z %97 = arith.muli %46, %c65536_i32 : i32 2026-02-21T10:53:44.1077102Z %98 = ttg.convert_layout %50 : tensor<2xi32, #blocked7> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T10:53:44.1077422Z %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x2xi32, #blocked8> 2026-02-21T10:53:44.1077710Z %100 = ttg.convert_layout %99 : tensor<1x2xi32, #blocked8> -> tensor<1x2xi32, #blocked5> 2026-02-21T10:53:44.1077996Z %101 = ttg.convert_layout %100 : tensor<1x2xi32, #blocked5> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked11}>> 2026-02-21T10:53:44.1078337Z %102 = tt.expand_dims %101 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x2x1xi32, #blocked11> 2026-02-21T10:53:44.1078637Z %103 = ttg.convert_layout %102 : tensor<1x2x1xi32, #blocked11> -> tensor<1x2x1xi32, #blocked4> 2026-02-21T10:53:44.1078857Z %104 = arith.muli %103, %cst_9 : tensor<1x2x1xi32, #blocked4> 2026-02-21T10:53:44.1079045Z %105 = tt.splat %97 : i32 -> tensor<1x2x1xi32, #blocked4> 2026-02-21T10:53:44.1079206Z %106 = arith.addi %105, %104 : tensor<1x2x1xi32, #blocked4> 2026-02-21T10:53:44.1079409Z %107 = tt.broadcast %106 : tensor<1x2x1xi32, #blocked4> -> tensor<1x2x128xi32, #blocked4> 2026-02-21T10:53:44.1079657Z %108 = ttg.convert_layout %107 : tensor<1x2x128xi32, #blocked4> -> tensor<1x2x128xi32, #blocked3> 2026-02-21T10:53:44.1079892Z %109 = arith.addi %108, %44 : tensor<1x2x128xi32, #blocked3> 2026-02-21T10:53:44.1080107Z %110 = tt.addptr %45, %109 : tensor<1x2x128x!tt.ptr, #blocked3>, tensor<1x2x128xi32, #blocked3> 2026-02-21T10:53:44.1080325Z tt.store %110, %96 : tensor<1x2x128x!tt.ptr, #blocked3> 2026-02-21T10:53:44.1080477Z } {tt.loop_unroll_factor = 1 : i32} 2026-02-21T10:53:44.1080591Z tt.return 2026-02-21T10:53:44.1080678Z } 2026-02-21T10:53:44.1080757Z } 2026-02-21T10:53:44.1080804Z 2026-02-21T10:53:44.1080841Z {-# 2026-02-21T10:53:44.1080922Z external_resources: { 2026-02-21T10:53:44.1081029Z mlir_reproducer: { 2026-02-21T10:53:44.1083433Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T10:53:44.1085725Z disable_threading: false, 2026-02-21T10:53:44.1085840Z verify_each: true 2026-02-21T10:53:44.1085932Z } 2026-02-21T10:53:44.1086015Z } 2026-02-21T10:53:44.1086087Z #-} 2026-02-21T10:53:44.1086367Z /tmp/torchinductor_root/rd/crdzuare6tius3irgn2fl3moepu6my2a3zv53zxi3c2hi2j3pq4r.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T10:53:44.1087076Z /tmp/torchinductor_root/rd/crdzuare6tius3irgn2fl3moepu6my2a3zv53zxi3c2hi2j3pq4r.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T10:53:44.1087633Z [42s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T10:53:44.1088449Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 128], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[0, 0], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T10:53:44.1089185Z Error: RuntimeError: PassManager::run failed 2026-02-21T10:53:44.1089356Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T10:53:44.8628517Z /tmp/torchinductor_root/6v/c6vbchyecgbseqxf7v66tjcrvldnfmtcbunqfzmowyfinztxvrmq.py:55:129: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T10:53:44.8629765Z k = tl.load(k_view + (indices_0[:, None, None] * 65536 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None) 2026-02-21T10:53:44.8630394Z ^ 2026-02-21T10:53:44.8632047Z /tmp/torchinductor_root/6v/c6vbchyecgbseqxf7v66tjcrvldnfmtcbunqfzmowyfinztxvrmq.py:57:141: note: - use: %132 = "tt.reshape"(<>) : (tensor<1x128x64xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 1], order = [1, 0, 2]}>>) -> tensor<128x64xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [0, 1]}>> 2026-02-21T10:53:44.8633443Z 2026-02-21T10:53:44.8634252Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T10:53:44.8635399Z ^ 2026-02-21T10:53:44.8635800Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T10:53:44.8636309Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:53:44.8636956Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:53:44.8637652Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T10:53:44.8638279Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T10:53:44.8638882Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T10:53:44.8639457Z #blocked5 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T10:53:44.8640032Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T10:53:44.8640652Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:53:44.8641293Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:53:44.8641929Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T10:53:44.8642549Z #blocked10 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T10:53:44.8643327Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T10:53:44.8644373Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T10:53:44.8645201Z %c65536_i32 = arith.constant 65536 : i32 2026-02-21T10:53:44.8645460Z %c192_i64 = arith.constant 192 : i64 2026-02-21T10:53:44.8645678Z %c0_i64 = arith.constant 0 : i64 2026-02-21T10:53:44.8645855Z %c65536_i64 = arith.constant 65536 : i64 2026-02-21T10:53:44.8646090Z %cst = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked> 2026-02-21T10:53:44.8646380Z %cst_0 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked> 2026-02-21T10:53:44.8646672Z %cst_1 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked> 2026-02-21T10:53:44.8646925Z %cst_2 = arith.constant dense<512> : tensor<1x2x1xi64, #blocked1> 2026-02-21T10:53:44.8647181Z %cst_3 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked1> 2026-02-21T10:53:44.8647430Z %cst_4 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked1> 2026-02-21T10:53:44.8647640Z %c64_i32 = arith.constant 64 : i32 2026-02-21T10:53:44.8647834Z %c512_i32 = arith.constant 512 : i32 2026-02-21T10:53:44.8648003Z %c3072_i32 = arith.constant 3072 : i32 2026-02-21T10:53:44.8648210Z %cst_5 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked1> 2026-02-21T10:53:44.8648462Z %cst_6 = arith.constant dense<128> : tensor<1x64x1xi32, #blocked2> 2026-02-21T10:53:44.8648734Z %cst_7 = arith.constant dense<0.127517432> : tensor<1x2x64xf32, #blocked> 2026-02-21T10:53:44.8649012Z %cst_8 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8649289Z %cst_9 = arith.constant dense<0.000000e+00> : tensor<2x64xf32, #blocked4> 2026-02-21T10:53:44.8649558Z %cst_10 = arith.constant dense<128> : tensor<1x1x64xi32, #blocked> 2026-02-21T10:53:44.8649797Z %c0_i32 = arith.constant 0 : i32 2026-02-21T10:53:44.8650017Z %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked> 2026-02-21T10:53:44.8650298Z %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8650576Z %cst_13 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8650799Z %c2_i32 = arith.constant 2 : i32 2026-02-21T10:53:44.8650959Z %c16_i32 = arith.constant 16 : i32 2026-02-21T10:53:44.8651123Z %c256_i32 = arith.constant 256 : i32 2026-02-21T10:53:44.8651291Z %0 = tt.get_program_id x : i32 2026-02-21T10:53:44.8651483Z %1 = arith.divsi %0, %c3072_i32 : i32 2026-02-21T10:53:44.8651646Z %2 = arith.muli %1, %c16_i32 : i32 2026-02-21T10:53:44.8651810Z %3 = arith.subi %c256_i32, %2 : i32 2026-02-21T10:53:44.8651969Z %4 = arith.minsi %3, %c16_i32 : i32 2026-02-21T10:53:44.8652131Z %5 = arith.remsi %0, %c3072_i32 : i32 2026-02-21T10:53:44.8652302Z %6 = arith.remsi %5, %4 : i32 2026-02-21T10:53:44.8652455Z %7 = arith.addi %2, %6 : i32 2026-02-21T10:53:44.8652607Z %8 = arith.divsi %5, %4 : i32 2026-02-21T10:53:44.8652759Z %9 = arith.muli %7, %c2_i32 : i32 2026-02-21T10:53:44.8652979Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked5> 2026-02-21T10:53:44.8653236Z %11 = tt.splat %9 : i32 -> tensor<2xi32, #blocked5> 2026-02-21T10:53:44.8653449Z %12 = arith.addi %11, %10 : tensor<2xi32, #blocked5> 2026-02-21T10:53:44.8653698Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked5> 2026-02-21T10:53:44.8653936Z %14 = arith.extsi %8 : i32 to i64 2026-02-21T10:53:44.8654099Z %15 = arith.extsi %9 : i32 to i64 2026-02-21T10:53:44.8654320Z %16 = tt.splat %arg0 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T10:53:44.8654559Z %17 = arith.muli %14, %c65536_i64 : i64 2026-02-21T10:53:44.8654754Z %18 = tt.splat %17 : i64 -> tensor<1x2x128xi64, #blocked> 2026-02-21T10:53:44.8654973Z %19 = tt.splat %15 : i64 -> tensor<2xi64, #blocked5> 2026-02-21T10:53:44.8655216Z %20 = arith.extsi %10 : tensor<2xi32, #blocked5> to tensor<2xi64, #blocked5> 2026-02-21T10:53:44.8655464Z %21 = arith.addi %19, %20 : tensor<2xi64, #blocked5> 2026-02-21T10:53:44.8655774Z %22 = ttg.convert_layout %21 : tensor<2xi64, #blocked5> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T10:53:44.8656136Z %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi64, #blocked6> 2026-02-21T10:53:44.8656444Z %24 = ttg.convert_layout %23 : tensor<1x2xi64, #blocked6> -> tensor<1x2xi64, #blocked3> 2026-02-21T10:53:44.8656751Z %25 = ttg.convert_layout %24 : tensor<1x2xi64, #blocked3> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T10:53:44.8657137Z %26 = tt.expand_dims %25 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi64, #blocked7> 2026-02-21T10:53:44.8657457Z %27 = ttg.convert_layout %26 : tensor<1x2x1xi64, #blocked7> -> tensor<1x2x1xi64, #blocked1> 2026-02-21T10:53:44.8657685Z %28 = arith.muli %27, %cst_4 : tensor<1x2x1xi64, #blocked1> 2026-02-21T10:53:44.8657899Z %29 = tt.broadcast %28 : tensor<1x2x1xi64, #blocked1> -> tensor<1x2x128xi64, #blocked1> 2026-02-21T10:53:44.8658182Z %30 = ttg.convert_layout %29 : tensor<1x2x128xi64, #blocked1> -> tensor<1x2x128xi64, #blocked> 2026-02-21T10:53:44.8658441Z %31 = arith.extsi %13 : tensor<128xi32, #blocked5> to tensor<128xi64, #blocked5> 2026-02-21T10:53:44.8658735Z %32 = ttg.convert_layout %31 : tensor<128xi64, #blocked5> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T10:53:44.8659096Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi64, #blocked6> 2026-02-21T10:53:44.8659433Z %34 = ttg.convert_layout %33 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #blocked4> 2026-02-21T10:53:44.8659750Z %35 = ttg.convert_layout %34 : tensor<1x128xi64, #blocked4> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T10:53:44.8660125Z %36 = tt.expand_dims %35 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi64, #blocked8> 2026-02-21T10:53:44.8660458Z %37 = ttg.convert_layout %36 : tensor<1x1x128xi64, #blocked8> -> tensor<1x1x128xi64, #blocked> 2026-02-21T10:53:44.8660726Z %38 = tt.broadcast %37 : tensor<1x1x128xi64, #blocked> -> tensor<1x2x128xi64, #blocked> 2026-02-21T10:53:44.8660963Z %39 = arith.addi %30, %38 : tensor<1x2x128xi64, #blocked> 2026-02-21T10:53:44.8661140Z %40 = arith.addi %18, %39 : tensor<1x2x128xi64, #blocked> 2026-02-21T10:53:44.8661367Z %41 = tt.addptr %16, %40 : tensor<1x2x128x!tt.ptr, #blocked>, tensor<1x2x128xi64, #blocked> 2026-02-21T10:53:44.8661581Z %42 = arith.cmpi sge, %14, %c0_i64 : i64 2026-02-21T10:53:44.8661723Z %43 = arith.cmpi slt, %14, %c192_i64 : i64 2026-02-21T10:53:44.8661856Z %44 = arith.andi %42, %43 : i1 2026-02-21T10:53:44.8662013Z %45 = arith.cmpi sge, %27, %cst_3 : tensor<1x2x1xi64, #blocked1> 2026-02-21T10:53:44.8662205Z %46 = arith.cmpi slt, %27, %cst_2 : tensor<1x2x1xi64, #blocked1> 2026-02-21T10:53:44.8662386Z %47 = arith.andi %45, %46 : tensor<1x2x1xi1, #blocked1> 2026-02-21T10:53:44.8662556Z %48 = tt.splat %44 : i1 -> tensor<1x2x1xi1, #blocked1> 2026-02-21T10:53:44.8662720Z %49 = arith.andi %48, %47 : tensor<1x2x1xi1, #blocked1> 2026-02-21T10:53:44.8662926Z %50 = tt.broadcast %49 : tensor<1x2x1xi1, #blocked1> -> tensor<1x2x128xi1, #blocked1> 2026-02-21T10:53:44.8663189Z %51 = ttg.convert_layout %50 : tensor<1x2x128xi1, #blocked1> -> tensor<1x2x128xi1, #blocked> 2026-02-21T10:53:44.8663428Z %52 = arith.cmpi sge, %37, %cst_1 : tensor<1x1x128xi64, #blocked> 2026-02-21T10:53:44.8663619Z %53 = arith.cmpi slt, %37, %cst_0 : tensor<1x1x128xi64, #blocked> 2026-02-21T10:53:44.8663798Z %54 = arith.andi %52, %53 : tensor<1x1x128xi1, #blocked> 2026-02-21T10:53:44.8664004Z %55 = tt.broadcast %54 : tensor<1x1x128xi1, #blocked> -> tensor<1x2x128xi1, #blocked> 2026-02-21T10:53:44.8664211Z %56 = arith.andi %51, %55 : tensor<1x2x128xi1, #blocked> 2026-02-21T10:53:44.8664393Z %57 = tt.load %41, %56, %cst : tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T10:53:44.8664602Z %58 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked5> 2026-02-21T10:53:44.8664788Z %59 = arith.muli %8, %c65536_i32 : i32 2026-02-21T10:53:44.8665032Z %60 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T10:53:44.8665396Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6> 2026-02-21T10:53:44.8665718Z %62 = ttg.convert_layout %61 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4> 2026-02-21T10:53:44.8665997Z %63 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T10:53:44.8666335Z %64 = tt.expand_dims %63 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x128x1xi32, #blocked9> 2026-02-21T10:53:44.8666658Z %65 = ttg.convert_layout %64 : tensor<1x128x1xi32, #blocked9> -> tensor<1x128x1xi32, #blocked2> 2026-02-21T10:53:44.8666861Z %66 = tt.splat %59 : i32 -> tensor<1x128x1xi32, #blocked2> 2026-02-21T10:53:44.8667016Z %67 = arith.addi %66, %65 : tensor<1x128x1xi32, #blocked2> 2026-02-21T10:53:44.8667207Z %68 = tt.broadcast %67 : tensor<1x128x1xi32, #blocked2> -> tensor<1x128x64xi32, #blocked2> 2026-02-21T10:53:44.8667448Z %69 = ttg.convert_layout %68 : tensor<1x128x64xi32, #blocked2> -> tensor<1x128x64xi32, #blocked> 2026-02-21T10:53:44.8667680Z %70 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x64x!tt.ptr, #blocked> 2026-02-21T10:53:44.8667917Z %71 = tt.reshape %57 : tensor<1x2x128xbf16, #blocked> -> tensor<2x128xbf16, #blocked4> 2026-02-21T10:53:44.8668108Z %72 = tt.splat %59 : i32 -> tensor<1x64x1xi32, #blocked2> 2026-02-21T10:53:44.8668342Z %73 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T10:53:44.8668674Z %74 = tt.expand_dims %73 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8> 2026-02-21T10:53:44.8668967Z %75 = ttg.convert_layout %74 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked> 2026-02-21T10:53:44.8669220Z %76 = tt.broadcast %75 : tensor<1x1x128xi32, #blocked> -> tensor<1x64x128xi32, #blocked> 2026-02-21T10:53:44.8669440Z %77 = tt.splat %arg2 : !tt.ptr -> tensor<1x64x128x!tt.ptr, #blocked> 2026-02-21T10:53:44.8669816Z %78:3 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c64_i32 iter_args(%arg5 = %cst_13, %arg6 = %cst_12, %arg7 = %cst_11) -> (tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked>) : i32 { 2026-02-21T10:53:44.8670165Z %108 = tt.splat %arg4 : i32 -> tensor<64xi32, #blocked5> 2026-02-21T10:53:44.8670324Z %109 = arith.addi %108, %58 : tensor<64xi32, #blocked5> 2026-02-21T10:53:44.8670557Z %110 = ttg.convert_layout %109 : tensor<64xi32, #blocked5> -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T10:53:44.8670884Z %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x64xi32, #blocked6> 2026-02-21T10:53:44.8671177Z %112 = ttg.convert_layout %111 : tensor<1x64xi32, #blocked6> -> tensor<1x64xi32, #blocked4> 2026-02-21T10:53:44.8671458Z %113 = ttg.convert_layout %112 : tensor<1x64xi32, #blocked4> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T10:53:44.8671797Z %114 = tt.expand_dims %113 {axis = 1 : i32} : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x64xi32, #blocked8> 2026-02-21T10:53:44.8672095Z %115 = ttg.convert_layout %114 : tensor<1x1x64xi32, #blocked8> -> tensor<1x1x64xi32, #blocked> 2026-02-21T10:53:44.8672309Z %116 = arith.muli %115, %cst_10 : tensor<1x1x64xi32, #blocked> 2026-02-21T10:53:44.8672510Z %117 = tt.broadcast %116 : tensor<1x1x64xi32, #blocked> -> tensor<1x128x64xi32, #blocked> 2026-02-21T10:53:44.8672708Z %118 = arith.addi %69, %117 : tensor<1x128x64xi32, #blocked> 2026-02-21T10:53:44.8672922Z %119 = tt.addptr %70, %118 : tensor<1x128x64x!tt.ptr, #blocked>, tensor<1x128x64xi32, #blocked> 2026-02-21T10:53:44.8673134Z %120 = tt.load %119 : tensor<1x128x64x!tt.ptr, #blocked> 2026-02-21T10:53:44.8673337Z %121 = tt.reshape %120 : tensor<1x128x64xbf16, #blocked> -> tensor<128x64xbf16, #blocked4> 2026-02-21T10:53:44.8673645Z %122 = ttg.convert_layout %71 : tensor<2x128xbf16, #blocked4> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked4}>> 2026-02-21T10:53:44.8673996Z %123 = ttg.convert_layout %121 : tensor<128x64xbf16, #blocked4> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked4}>> 2026-02-21T10:53:44.8674294Z %124 = ttg.convert_layout %cst_9 : tensor<2x64xf32, #blocked4> -> tensor<2x64xf32, #blocked4> 2026-02-21T10:53:44.8674711Z %125 = tt.dot %122, %123, %124, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked4}>> * tensor<128x64xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked4}>> -> tensor<2x64xf32, #blocked4> 2026-02-21T10:53:44.8675094Z %126 = tt.reshape %125 : tensor<2x64xf32, #blocked4> -> tensor<1x2x64xf32, #blocked> 2026-02-21T10:53:44.8675325Z %127 = arith.truncf %126 : tensor<1x2x64xf32, #blocked> to tensor<1x2x64xbf16, #blocked> 2026-02-21T10:53:44.8675555Z %128 = arith.extf %127 : tensor<1x2x64xbf16, #blocked> to tensor<1x2x64xf32, #blocked> 2026-02-21T10:53:44.8675740Z %129 = "tt.reduce"(%128) <{axis = 2 : i32}> ({ 2026-02-21T10:53:44.8675880Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T10:53:44.8676004Z %182 = arith.maxnumf %arg8, %arg9 : f32 2026-02-21T10:53:44.8676124Z tt.reduce.return %182 : f32 2026-02-21T10:53:44.8676307Z }) : (tensor<1x2x64xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T10:53:44.8676593Z %130 = ttg.convert_layout %129 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8676860Z %131 = arith.truncf %130 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3> 2026-02-21T10:53:44.8677094Z %132 = arith.extf %131 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8677285Z %133 = arith.mulf %132, %cst_8 : tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8677475Z %134 = arith.truncf %133 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3> 2026-02-21T10:53:44.8677691Z %135 = arith.extf %134 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8677885Z %136 = arith.cmpf ogt, %arg5, %135 : tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8678057Z %137 = arith.cmpf une, %arg5, %arg5 : tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8678218Z %138 = arith.ori %136, %137 : tensor<1x2xi1, #blocked3> 2026-02-21T10:53:44.8678413Z %139 = arith.select %138, %arg5, %135 : tensor<1x2xi1, #blocked3>, tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8678619Z %140 = arith.mulf %128, %cst_7 : tensor<1x2x64xf32, #blocked> 2026-02-21T10:53:44.8678815Z %141 = arith.truncf %140 : tensor<1x2x64xf32, #blocked> to tensor<1x2x64xbf16, #blocked> 2026-02-21T10:53:44.8679098Z %142 = ttg.convert_layout %139 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T10:53:44.8679430Z %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7> 2026-02-21T10:53:44.8679724Z %144 = ttg.convert_layout %143 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1> 2026-02-21T10:53:44.8679960Z %145 = arith.extf %141 : tensor<1x2x64xbf16, #blocked> to tensor<1x2x64xf32, #blocked> 2026-02-21T10:53:44.8680188Z %146 = tt.broadcast %144 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x64xf32, #blocked1> 2026-02-21T10:53:44.8680427Z %147 = ttg.convert_layout %146 : tensor<1x2x64xf32, #blocked1> -> tensor<1x2x64xf32, #blocked> 2026-02-21T10:53:44.8680633Z %148 = arith.subf %145, %147 : tensor<1x2x64xf32, #blocked> 2026-02-21T10:53:44.8680928Z %149 = tt.extern_elementwise %148 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x64xf32, #blocked>) -> tensor<1x2x64xf32, #blocked> 2026-02-21T10:53:44.8681227Z %150 = "tt.reduce"(%149) <{axis = 2 : i32}> ({ 2026-02-21T10:53:44.8681351Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T10:53:44.8681468Z %182 = arith.addf %arg8, %arg9 : f32 2026-02-21T10:53:44.8681585Z tt.reduce.return %182 : f32 2026-02-21T10:53:44.8681765Z }) : (tensor<1x2x64xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T10:53:44.8682045Z %151 = ttg.convert_layout %150 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8682301Z %152 = arith.subf %arg5, %139 : tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8682634Z %153 = tt.extern_elementwise %152 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked3>) -> tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8682919Z %154 = arith.mulf %arg6, %153 : tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8683077Z %155 = arith.addf %154, %151 : tensor<1x2xf32, #blocked3> 2026-02-21T10:53:44.8683315Z %156 = ttg.convert_layout %153 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T10:53:44.8683665Z %157 = tt.expand_dims %156 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7> 2026-02-21T10:53:44.8683958Z %158 = ttg.convert_layout %157 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1> 2026-02-21T10:53:44.8684198Z %159 = tt.broadcast %158 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1> 2026-02-21T10:53:44.8684440Z %160 = ttg.convert_layout %159 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked> 2026-02-21T10:53:44.8684651Z %161 = arith.mulf %arg7, %160 : tensor<1x2x128xf32, #blocked> 2026-02-21T10:53:44.8684918Z %162 = ttg.convert_layout %112 : tensor<1x64xi32, #blocked4> -> tensor<1x64xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T10:53:44.8685256Z %163 = tt.expand_dims %162 {axis = 2 : i32} : tensor<1x64xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x64x1xi32, #blocked9> 2026-02-21T10:53:44.8685554Z %164 = ttg.convert_layout %163 : tensor<1x64x1xi32, #blocked9> -> tensor<1x64x1xi32, #blocked2> 2026-02-21T10:53:44.8685766Z %165 = arith.muli %164, %cst_6 : tensor<1x64x1xi32, #blocked2> 2026-02-21T10:53:44.8685927Z %166 = arith.addi %72, %165 : tensor<1x64x1xi32, #blocked2> 2026-02-21T10:53:44.8686126Z %167 = tt.broadcast %166 : tensor<1x64x1xi32, #blocked2> -> tensor<1x64x128xi32, #blocked2> 2026-02-21T10:53:44.8686383Z %168 = ttg.convert_layout %167 : tensor<1x64x128xi32, #blocked2> -> tensor<1x64x128xi32, #blocked> 2026-02-21T10:53:44.8686591Z %169 = arith.addi %168, %76 : tensor<1x64x128xi32, #blocked> 2026-02-21T10:53:44.8686808Z %170 = tt.addptr %77, %169 : tensor<1x64x128x!tt.ptr, #blocked>, tensor<1x64x128xi32, #blocked> 2026-02-21T10:53:44.8687019Z %171 = tt.load %170 : tensor<1x64x128x!tt.ptr, #blocked> 2026-02-21T10:53:44.8687219Z %172 = arith.truncf %149 : tensor<1x2x64xf32, #blocked> to tensor<1x2x64xbf16, #blocked> 2026-02-21T10:53:44.8687450Z %173 = tt.reshape %161 : tensor<1x2x128xf32, #blocked> -> tensor<2x128xf32, #blocked4> 2026-02-21T10:53:44.8687674Z %174 = tt.reshape %172 : tensor<1x2x64xbf16, #blocked> -> tensor<2x64xbf16, #blocked4> 2026-02-21T10:53:44.8687903Z %175 = tt.reshape %171 : tensor<1x64x128xbf16, #blocked> -> tensor<64x128xbf16, #blocked4> 2026-02-21T10:53:44.8688195Z %176 = ttg.convert_layout %174 : tensor<2x64xbf16, #blocked4> -> tensor<2x64xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> 2026-02-21T10:53:44.8688548Z %177 = ttg.convert_layout %175 : tensor<64x128xbf16, #blocked4> -> tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> 2026-02-21T10:53:44.8688848Z %178 = ttg.convert_layout %173 : tensor<2x128xf32, #blocked4> -> tensor<2x128xf32, #blocked10> 2026-02-21T10:53:44.8689254Z %179 = tt.dot %176, %177, %178, inputPrecision = tf32 : tensor<2x64xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<2x128xf32, #blocked10> 2026-02-21T10:53:44.8689677Z %180 = ttg.convert_layout %179 : tensor<2x128xf32, #blocked10> -> tensor<2x128xf32, #blocked4> 2026-02-21T10:53:44.8689909Z %181 = tt.reshape %180 : tensor<2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked> 2026-02-21T10:53:44.8690170Z scf.yield %139, %155, %181 : tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked> 2026-02-21T10:53:44.8690436Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32} 2026-02-21T10:53:44.8690692Z %79 = ttg.convert_layout %78#1 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T10:53:44.8691017Z %80 = tt.expand_dims %79 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7> 2026-02-21T10:53:44.8691305Z %81 = ttg.convert_layout %80 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1> 2026-02-21T10:53:44.8691552Z %82 = tt.broadcast %81 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1> 2026-02-21T10:53:44.8691788Z %83 = ttg.convert_layout %82 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked> 2026-02-21T10:53:44.8691991Z %84 = arith.divf %78#2, %83 : tensor<1x2x128xf32, #blocked> 2026-02-21T10:53:44.8692184Z %85 = arith.truncf %84 : tensor<1x2x128xf32, #blocked> to tensor<1x2x128xbf16, #blocked> 2026-02-21T10:53:44.8692358Z %86 = arith.muli %8, %c65536_i32 : i32 2026-02-21T10:53:44.8692568Z %87 = ttg.convert_layout %12 : tensor<2xi32, #blocked5> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T10:53:44.8692895Z %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi32, #blocked6> 2026-02-21T10:53:44.8693166Z %89 = ttg.convert_layout %88 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #blocked3> 2026-02-21T10:53:44.8693451Z %90 = ttg.convert_layout %89 : tensor<1x2xi32, #blocked3> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T10:53:44.8693785Z %91 = tt.expand_dims %90 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi32, #blocked7> 2026-02-21T10:53:44.8694070Z %92 = ttg.convert_layout %91 : tensor<1x2x1xi32, #blocked7> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T10:53:44.8694274Z %93 = arith.muli %92, %cst_5 : tensor<1x2x1xi32, #blocked1> 2026-02-21T10:53:44.8694429Z %94 = tt.splat %86 : i32 -> tensor<1x2x1xi32, #blocked1> 2026-02-21T10:53:44.8694579Z %95 = arith.addi %94, %93 : tensor<1x2x1xi32, #blocked1> 2026-02-21T10:53:44.8694810Z %96 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T10:53:44.8695128Z %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6> 2026-02-21T10:53:44.8695410Z %98 = ttg.convert_layout %97 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4> 2026-02-21T10:53:44.8695688Z %99 = ttg.convert_layout %98 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T10:53:44.8696021Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8> 2026-02-21T10:53:44.8703777Z %101 = ttg.convert_layout %100 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked> 2026-02-21T10:53:44.8704026Z %102 = tt.broadcast %95 : tensor<1x2x1xi32, #blocked1> -> tensor<1x2x128xi32, #blocked1> 2026-02-21T10:53:44.8704269Z %103 = ttg.convert_layout %102 : tensor<1x2x128xi32, #blocked1> -> tensor<1x2x128xi32, #blocked> 2026-02-21T10:53:44.8704511Z %104 = tt.broadcast %101 : tensor<1x1x128xi32, #blocked> -> tensor<1x2x128xi32, #blocked> 2026-02-21T10:53:44.8704756Z %105 = arith.addi %103, %104 : tensor<1x2x128xi32, #blocked> 2026-02-21T10:53:44.8704944Z %106 = tt.splat %arg3 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T10:53:44.8705178Z %107 = tt.addptr %106, %105 : tensor<1x2x128x!tt.ptr, #blocked>, tensor<1x2x128xi32, #blocked> 2026-02-21T10:53:44.8705388Z tt.store %107, %85 : tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T10:53:44.8705542Z tt.return 2026-02-21T10:53:44.8705620Z } 2026-02-21T10:53:44.8705693Z } 2026-02-21T10:53:44.8705735Z 2026-02-21T10:53:44.8705765Z {-# 2026-02-21T10:53:44.8705847Z external_resources: { 2026-02-21T10:53:44.8705944Z mlir_reproducer: { 2026-02-21T10:53:44.8708200Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T10:53:44.8710458Z disable_threading: false, 2026-02-21T10:53:44.8710564Z verify_each: true 2026-02-21T10:53:44.8710655Z } 2026-02-21T10:53:44.8710727Z } 2026-02-21T10:53:44.8710794Z #-} 2026-02-21T10:53:44.8711073Z /tmp/torchinductor_root/6v/c6vbchyecgbseqxf7v66tjcrvldnfmtcbunqfzmowyfinztxvrmq.py:16:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T10:53:44.8711752Z /tmp/torchinductor_root/6v/c6vbchyecgbseqxf7v66tjcrvldnfmtcbunqfzmowyfinztxvrmq.py:16:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T10:53:44.8712298Z [42s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T10:53:44.8713019Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T10:53:44.8713682Z Error: RuntimeError: PassManager::run failed 2026-02-21T10:53:44.8713847Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T10:53:53.7777171Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 8.8 configs/s 2026-02-21T10:53:53.7787270Z [51s] Adaptive compile timeout: 30s (90% percentile=12.2s, bounds=[30.0s, 30s]) 2026-02-21T10:53:54.9898733Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 950/950 564.1 configs/s 2026-02-21T10:53:56.1324351Z [54s] Initial random population of 100, 5 starting points: 2026-02-21T10:53:56.1324849Z error=15 2026-02-21T10:53:56.1325532Z timeout=1 2026-02-21T10:53:56.1325731Z ok=84 2026-02-21T10:53:56.1325933Z min=0.2113 2026-02-21T10:53:56.1326135Z mid=1.1127 2026-02-21T10:53:56.1326338Z max=198.0058 2026-02-21T10:53:56.1326601Z best={'block_sizes': [1, 128, 16], 2026-02-21T10:53:56.1326986Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:53:56.1327366Z 'l2_groupings': [64], 2026-02-21T10:53:56.1327635Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:53:56.1328110Z 'loop_orders': [[1, 0]], 2026-02-21T10:53:56.1328384Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:53:56.1328651Z 'num_stages': 2, 2026-02-21T10:53:56.1328880Z 'num_warps': 4, 2026-02-21T10:53:56.1329114Z 'pid_type': 'flat', 2026-02-21T10:53:56.1329371Z 'range_flattens': [None, None], 2026-02-21T10:53:56.1329678Z 'range_multi_buffers': [None, False], 2026-02-21T10:53:56.1329985Z 'range_num_stages': [0, 2], 2026-02-21T10:53:56.1330266Z 'range_unroll_factors': [0, 2], 2026-02-21T10:53:56.1330563Z 'range_warp_specializes': [], 2026-02-21T10:53:56.1330843Z 'waves_per_eu': 3} 2026-02-21T10:53:56.1412265Z [54s] Fitting surrogate: 100 points, 100 targets 2026-02-21T10:53:57.6060384Z [55s] Generation 1 starting: 80 neighbors, 5 active search path(s) 2026-02-21T10:54:13.6395690Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 2.0 configs/s 2026-02-21T10:54:18.4118388Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 17.5 configs/s 2026-02-21T10:54:22.3176041Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 229.1 2026-02-21T10:54:22.3176549Z configs/s 2026-02-21T10:54:23.2246921Z [81s] Generation 1 complete: 2026-02-21T10:54:23.2247235Z error=5 2026-02-21T10:54:23.2247414Z ok=80 2026-02-21T10:54:23.2247550Z min=0.1385 2026-02-21T10:54:23.2247690Z mid=0.2859 2026-02-21T10:54:23.2248350Z max=2.8840 2026-02-21T10:54:23.2248504Z best={'block_sizes': [1, 32, 32], 2026-02-21T10:54:23.2248773Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T10:54:23.2249062Z 'l2_groupings': [32], 2026-02-21T10:54:23.2249311Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:54:23.2249614Z 'loop_orders': [[0, 1]], 2026-02-21T10:54:23.2249810Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:54:23.2249997Z 'num_sm_multiplier': 16, 2026-02-21T10:54:23.2250173Z 'num_stages': 1, 2026-02-21T10:54:23.2250314Z 'num_warps': 2, 2026-02-21T10:54:23.2250475Z 'pid_type': 'persistent_blocked', 2026-02-21T10:54:23.2250674Z 'range_flattens': [False, None], 2026-02-21T10:54:23.2250868Z 'range_multi_buffers': [False, None], 2026-02-21T10:54:23.2251058Z 'range_num_stages': [1, 1], 2026-02-21T10:54:23.2251228Z 'range_unroll_factors': [1, 1], 2026-02-21T10:54:23.2251408Z 'range_warp_specializes': [], 2026-02-21T10:54:23.2251579Z 'waves_per_eu': 3} 2026-02-21T10:54:23.2562381Z [81s] Fitting surrogate: 185 points, 185 targets 2026-02-21T10:54:24.7520145Z [82s] Generation 2 starting: 82 neighbors, 5 active search path(s) 2026-02-21T10:54:47.5777593Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 1.2 configs/s 2026-02-21T10:54:52.5185359Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 17.1 configs/s 2026-02-21T10:55:00.8856978Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 114.0 2026-02-21T10:55:00.8859094Z configs/s 2026-02-21T10:55:01.9512934Z [119s] Generation 2 complete: 2026-02-21T10:55:01.9513743Z error=3 2026-02-21T10:55:01.9513950Z ok=84 2026-02-21T10:55:01.9514156Z min=0.1375 2026-02-21T10:55:01.9514370Z mid=0.2113 2026-02-21T10:55:01.9514567Z max=1.7679 2026-02-21T10:55:01.9514797Z best={'block_sizes': [1, 32, 32], 2026-02-21T10:55:01.9517042Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T10:55:01.9517480Z 'l2_groupings': [32], 2026-02-21T10:55:01.9517752Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:55:01.9518058Z 'loop_orders': [[0, 1]], 2026-02-21T10:55:01.9518508Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:55:01.9518774Z 'num_sm_multiplier': 16, 2026-02-21T10:55:01.9519025Z 'num_stages': 2, 2026-02-21T10:55:01.9519240Z 'num_warps': 2, 2026-02-21T10:55:01.9519496Z 'pid_type': 'persistent_blocked', 2026-02-21T10:55:01.9519789Z 'range_flattens': [False, None], 2026-02-21T10:55:01.9520078Z 'range_multi_buffers': [False, False], 2026-02-21T10:55:01.9520367Z 'range_num_stages': [1, 1], 2026-02-21T10:55:01.9520636Z 'range_unroll_factors': [1, 1], 2026-02-21T10:55:01.9520914Z 'range_warp_specializes': [], 2026-02-21T10:55:01.9521172Z 'waves_per_eu': 3} 2026-02-21T10:55:02.0301545Z [120s] Fitting surrogate: 272 points, 272 targets 2026-02-21T10:55:03.4968133Z [121s] Generation 3 starting: 76 neighbors, 5 active search path(s) 2026-02-21T10:55:18.3028075Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 5.9 configs/s 2026-02-21T10:55:23.0221876Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 16.8 configs/s 2026-02-21T10:55:29.7796487Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 139.8 2026-02-21T10:55:29.7797287Z configs/s 2026-02-21T10:55:30.7754550Z [148s] Generation 3 complete: 2026-02-21T10:55:30.7754974Z error=2 2026-02-21T10:55:30.7755186Z ok=79 2026-02-21T10:55:30.7755390Z min=0.1389 2026-02-21T10:55:30.7755592Z mid=0.2425 2026-02-21T10:55:30.7755822Z max=2.3012 2026-02-21T10:55:30.7756096Z best={'block_sizes': [1, 32, 32], 2026-02-21T10:55:30.7756506Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T10:55:30.7756918Z 'l2_groupings': [32], 2026-02-21T10:55:30.7757203Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:55:30.7757516Z 'loop_orders': [[0, 1]], 2026-02-21T10:55:30.7757794Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:55:30.7758648Z 'num_sm_multiplier': 16, 2026-02-21T10:55:30.7758911Z 'num_stages': 2, 2026-02-21T10:55:30.7759137Z 'num_warps': 2, 2026-02-21T10:55:30.7759415Z 'pid_type': 'persistent_blocked', 2026-02-21T10:55:30.7759724Z 'range_flattens': [False, None], 2026-02-21T10:55:30.7760037Z 'range_multi_buffers': [False, False], 2026-02-21T10:55:30.7760353Z 'range_num_stages': [1, 1], 2026-02-21T10:55:30.7760637Z 'range_unroll_factors': [1, 1], 2026-02-21T10:55:30.7760928Z 'range_warp_specializes': [], 2026-02-21T10:55:30.7761216Z 'waves_per_eu': 3} 2026-02-21T10:55:30.8424205Z [148s] Fitting surrogate: 353 points, 353 targets 2026-02-21T10:55:32.2848976Z [150s] Generation 4 starting: 72 neighbors, 5 active search path(s) 2026-02-21T10:55:53.5839298Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.7 configs/s 2026-02-21T10:55:57.9652762Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 16.8 configs/s 2026-02-21T10:56:05.9377987Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 119.6 2026-02-21T10:56:05.9379685Z configs/s 2026-02-21T10:56:06.9468399Z [184s] Generation 4 complete: 2026-02-21T10:56:06.9468838Z error=3 2026-02-21T10:56:06.9469041Z ok=74 2026-02-21T10:56:06.9469243Z min=0.1404 2026-02-21T10:56:06.9469443Z mid=0.2116 2026-02-21T10:56:06.9469697Z max=1.0073 2026-02-21T10:56:06.9469928Z best={'block_sizes': [1, 32, 32], 2026-02-21T10:56:06.9470340Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T10:56:06.9470751Z 'l2_groupings': [32], 2026-02-21T10:56:06.9471484Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:56:06.9471802Z 'loop_orders': [[0, 1]], 2026-02-21T10:56:06.9472083Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:56:06.9472367Z 'num_sm_multiplier': 16, 2026-02-21T10:56:06.9472621Z 'num_stages': 2, 2026-02-21T10:56:06.9472852Z 'num_warps': 2, 2026-02-21T10:56:06.9473108Z 'pid_type': 'persistent_blocked', 2026-02-21T10:56:06.9473428Z 'range_flattens': [None, None], 2026-02-21T10:56:06.9473729Z 'range_multi_buffers': [False, False], 2026-02-21T10:56:06.9474212Z 'range_num_stages': [1, 1], 2026-02-21T10:56:06.9474487Z 'range_unroll_factors': [1, 0], 2026-02-21T10:56:06.9474774Z 'range_warp_specializes': [], 2026-02-21T10:56:06.9475048Z 'waves_per_eu': 3} 2026-02-21T10:56:06.9523434Z [184s] Fitting surrogate: 430 points, 430 targets 2026-02-21T10:56:08.4891290Z [186s] Generation 5 starting: 78 neighbors, 5 active search path(s) 2026-02-21T10:56:19.1979255Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 3.5 configs/s 2026-02-21T10:56:24.1400736Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 16.5 configs/s 2026-02-21T10:56:35.7480274Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 88.7 configs/s 2026-02-21T10:56:36.8165697Z [214s] Generation 5 complete: 2026-02-21T10:56:36.8165912Z ok=83 2026-02-21T10:56:36.8166104Z min=0.1409 2026-02-21T10:56:36.8166183Z mid=0.1550 2026-02-21T10:56:36.8166371Z max=0.8546 2026-02-21T10:56:36.8166475Z best={'block_sizes': [1, 32, 64], 2026-02-21T10:56:36.8166759Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T10:56:36.8166967Z 'l2_groupings': [1], 2026-02-21T10:56:36.8167086Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:56:36.8167629Z 'loop_orders': [[0, 1]], 2026-02-21T10:56:36.8167762Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:56:36.8167866Z 'num_stages': 3, 2026-02-21T10:56:36.8167950Z 'num_warps': 2, 2026-02-21T10:56:36.8168038Z 'pid_type': 'flat', 2026-02-21T10:56:36.8168135Z 'range_flattens': [None, None], 2026-02-21T10:56:36.8168255Z 'range_multi_buffers': [None, None], 2026-02-21T10:56:36.8168368Z 'range_num_stages': [0, 1], 2026-02-21T10:56:36.8168473Z 'range_unroll_factors': [0, 4], 2026-02-21T10:56:36.8168582Z 'range_warp_specializes': [], 2026-02-21T10:56:36.8168683Z 'waves_per_eu': 2} 2026-02-21T10:56:36.8228051Z [214s] Fitting surrogate: 513 points, 513 targets 2026-02-21T10:56:37.7042749Z [215s] Generation 6 starting: 78 neighbors, 5 active search path(s) 2026-02-21T10:56:48.2146882Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 5.2 configs/s 2026-02-21T10:56:52.9952495Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 16.8 configs/s 2026-02-21T10:57:00.6670952Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 123.8 2026-02-21T10:57:00.6671564Z configs/s 2026-02-21T10:57:01.6273414Z [239s] Generation 6 complete: 2026-02-21T10:57:01.6273583Z error=1 2026-02-21T10:57:01.6273673Z ok=82 2026-02-21T10:57:01.6273751Z min=0.1435 2026-02-21T10:57:01.6273836Z mid=0.2180 2026-02-21T10:57:01.6273910Z max=1.2749 2026-02-21T10:57:01.6273999Z best={'block_sizes': [1, 32, 64], 2026-02-21T10:57:01.6274150Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T10:57:01.6274304Z 'l2_groupings': [1], 2026-02-21T10:57:01.6274414Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:57:01.6274533Z 'loop_orders': [[0, 1]], 2026-02-21T10:57:01.6274643Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:57:01.6275089Z 'num_stages': 3, 2026-02-21T10:57:01.6275176Z 'num_warps': 2, 2026-02-21T10:57:01.6275261Z 'pid_type': 'flat', 2026-02-21T10:57:01.6275359Z 'range_flattens': [None, None], 2026-02-21T10:57:01.6275480Z 'range_multi_buffers': [None, None], 2026-02-21T10:57:01.6275595Z 'range_num_stages': [0, 1], 2026-02-21T10:57:01.6275695Z 'range_unroll_factors': [0, 4], 2026-02-21T10:57:01.6275807Z 'range_warp_specializes': [], 2026-02-21T10:57:01.6275991Z 'waves_per_eu': 2} 2026-02-21T10:57:01.6990391Z [239s] Fitting surrogate: 596 points, 596 targets 2026-02-21T10:57:02.3741448Z [240s] Generation 7 starting: 56 neighbors, 4 active search path(s) 2026-02-21T10:57:16.4482098Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 1.7 configs/s 2026-02-21T10:57:19.9949921Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 16.8 configs/s 2026-02-21T10:57:25.1609815Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 179.7 2026-02-21T10:57:25.1610460Z configs/s 2026-02-21T10:57:26.0538872Z [264s] Generation 7 complete: 2026-02-21T10:57:26.0539251Z ok=60 2026-02-21T10:57:26.0539922Z min=0.1418 2026-02-21T10:57:26.0540139Z mid=0.2171 2026-02-21T10:57:26.0540344Z max=1.1309 2026-02-21T10:57:26.0540583Z best={'block_sizes': [1, 32, 64], 2026-02-21T10:57:26.0540998Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T10:57:26.0541421Z 'l2_groupings': [1], 2026-02-21T10:57:26.0541703Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:57:26.0542021Z 'loop_orders': [[0, 1]], 2026-02-21T10:57:26.0542305Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:57:26.0542572Z 'num_stages': 3, 2026-02-21T10:57:26.0542803Z 'num_warps': 2, 2026-02-21T10:57:26.0543040Z 'pid_type': 'flat', 2026-02-21T10:57:26.0543465Z 'range_flattens': [None, None], 2026-02-21T10:57:26.0543773Z 'range_multi_buffers': [None, False], 2026-02-21T10:57:26.0544085Z 'range_num_stages': [0, 1], 2026-02-21T10:57:26.0544385Z 'range_unroll_factors': [0, 4], 2026-02-21T10:57:26.0544682Z 'range_warp_specializes': [], 2026-02-21T10:57:26.0544959Z 'waves_per_eu': 2} 2026-02-21T10:57:26.1035053Z [264s] Fitting surrogate: 656 points, 656 targets 2026-02-21T10:57:26.7988658Z [264s] Generation 8 starting: 58 neighbors, 4 active search path(s) 2026-02-21T10:57:37.5316202Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 6.7 configs/s 2026-02-21T10:57:41.8531113Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 14.1 configs/s 2026-02-21T10:57:48.9309989Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 133.5 2026-02-21T10:57:48.9310557Z configs/s 2026-02-21T10:57:49.8040177Z [287s] Generation 8 complete: 2026-02-21T10:57:49.8040591Z ok=62 2026-02-21T10:57:49.8040844Z min=0.1401 2026-02-21T10:57:49.8041057Z mid=0.1890 2026-02-21T10:57:49.8041254Z max=0.8055 2026-02-21T10:57:49.8041504Z best={'block_sizes': [1, 32, 64], 2026-02-21T10:57:49.8041911Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T10:57:49.8042316Z 'l2_groupings': [64], 2026-02-21T10:57:49.8042697Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:57:49.8043011Z 'loop_orders': [[0, 1]], 2026-02-21T10:57:49.8043286Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:57:49.8043552Z 'num_stages': 2, 2026-02-21T10:57:49.8043783Z 'num_warps': 2, 2026-02-21T10:57:49.8044026Z 'pid_type': 'flat', 2026-02-21T10:57:49.8044291Z 'range_flattens': [None, None], 2026-02-21T10:57:49.8044589Z 'range_multi_buffers': [None, True], 2026-02-21T10:57:49.8044896Z 'range_num_stages': [0, 3], 2026-02-21T10:57:49.8045170Z 'range_unroll_factors': [0, 3], 2026-02-21T10:57:49.8045467Z 'range_warp_specializes': [], 2026-02-21T10:57:49.8045744Z 'waves_per_eu': 2} 2026-02-21T10:57:49.8684473Z [287s] Fitting surrogate: 718 points, 718 targets 2026-02-21T10:57:50.5844213Z [288s] Generation 9 starting: 57 neighbors, 4 active search path(s) 2026-02-21T10:58:02.8387535Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 2.6 configs/s 2026-02-21T10:58:06.4366982Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 16.5 configs/s 2026-02-21T10:58:12.7315455Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 149.8 2026-02-21T10:58:12.7316084Z configs/s 2026-02-21T10:58:13.5868498Z [311s] Generation 9 complete: 2026-02-21T10:58:13.5868870Z ok=61 2026-02-21T10:58:13.5869077Z min=0.1381 2026-02-21T10:58:13.5869290Z mid=0.1736 2026-02-21T10:58:13.5869482Z max=1.2565 2026-02-21T10:58:13.5869712Z best={'block_sizes': [1, 64, 32], 2026-02-21T10:58:13.5870119Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T10:58:13.5870516Z 'l2_groupings': [64], 2026-02-21T10:58:13.5870838Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:58:13.5871153Z 'loop_orders': [[0, 1]], 2026-02-21T10:58:13.5871454Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:58:13.5871716Z 'num_stages': 3, 2026-02-21T10:58:13.5871953Z 'num_warps': 4, 2026-02-21T10:58:13.5872181Z 'pid_type': 'flat', 2026-02-21T10:58:13.5872810Z 'range_flattens': [None, None], 2026-02-21T10:58:13.5873111Z 'range_multi_buffers': [None, True], 2026-02-21T10:58:13.5873399Z 'range_num_stages': [0, 3], 2026-02-21T10:58:13.5873556Z 'range_unroll_factors': [0, 4], 2026-02-21T10:58:13.5873685Z 'range_warp_specializes': [], 2026-02-21T10:58:13.5873812Z 'waves_per_eu': 3} 2026-02-21T10:58:13.6486489Z [311s] Fitting surrogate: 779 points, 779 targets 2026-02-21T10:58:14.9119835Z [312s] Generation 10 starting: 44 neighbors, 3 active search path(s) 2026-02-21T10:58:26.0065098Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 1.2 configs/s 2026-02-21T10:58:28.8610866Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 44/44 16.2 configs/s 2026-02-21T10:58:35.2023198Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 148.6 2026-02-21T10:58:35.2023564Z configs/s 2026-02-21T10:58:36.0354576Z [334s] Generation 10 complete: 2026-02-21T10:58:36.0354949Z ok=47 2026-02-21T10:58:36.0355166Z min=0.1344 2026-02-21T10:58:36.0355386Z mid=0.1558 2026-02-21T10:58:36.0355590Z max=1.8760 2026-02-21T10:58:36.0355826Z best={'block_sizes': [1, 32, 64], 2026-02-21T10:58:36.0356233Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T10:58:36.0356653Z 'l2_groupings': [64], 2026-02-21T10:58:36.0356934Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:58:36.0357259Z 'loop_orders': [[0, 1]], 2026-02-21T10:58:36.0357542Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:58:36.0357810Z 'num_stages': 3, 2026-02-21T10:58:36.0358045Z 'num_warps': 2, 2026-02-21T10:58:36.0358279Z 'pid_type': 'flat', 2026-02-21T10:58:36.0358560Z 'range_flattens': [None, None], 2026-02-21T10:58:36.0358862Z 'range_multi_buffers': [None, True], 2026-02-21T10:58:36.0359186Z 'range_num_stages': [0, 3], 2026-02-21T10:58:36.0359464Z 'range_unroll_factors': [0, 4], 2026-02-21T10:58:36.0359766Z 'range_warp_specializes': [], 2026-02-21T10:58:36.0360047Z 'waves_per_eu': 2} 2026-02-21T10:58:36.0960214Z [334s] Fitting surrogate: 826 points, 826 targets 2026-02-21T10:58:36.6420562Z [334s] Generation 11 starting: 44 neighbors, 3 active search path(s) 2026-02-21T10:58:44.7437966Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 4.7 configs/s 2026-02-21T10:58:47.5023516Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 44/44 16.8 configs/s 2026-02-21T10:58:52.1580901Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 226.4 2026-02-21T10:58:52.1581436Z configs/s 2026-02-21T10:58:52.9638082Z [350s] Generation 11 complete: 2026-02-21T10:58:52.9638544Z ok=47 2026-02-21T10:58:52.9638776Z min=0.1328 2026-02-21T10:58:52.9638991Z mid=0.2040 2026-02-21T10:58:52.9639651Z max=0.6957 2026-02-21T10:58:52.9639874Z best={'block_sizes': [1, 32, 64], 2026-02-21T10:58:52.9640289Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T10:58:52.9640705Z 'l2_groupings': [64], 2026-02-21T10:58:52.9640993Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:58:52.9641319Z 'loop_orders': [[0, 1]], 2026-02-21T10:58:52.9641598Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:58:52.9641869Z 'num_stages': 3, 2026-02-21T10:58:52.9642115Z 'num_warps': 2, 2026-02-21T10:58:52.9642349Z 'pid_type': 'flat', 2026-02-21T10:58:52.9642714Z 'range_flattens': [None, None], 2026-02-21T10:58:52.9643028Z 'range_multi_buffers': [None, True], 2026-02-21T10:58:52.9643336Z 'range_num_stages': [0, 4], 2026-02-21T10:58:52.9643622Z 'range_unroll_factors': [0, 4], 2026-02-21T10:58:52.9643920Z 'range_warp_specializes': [], 2026-02-21T10:58:52.9644218Z 'waves_per_eu': 2} 2026-02-21T10:58:53.0055994Z [351s] Fitting surrogate: 873 points, 873 targets 2026-02-21T10:58:53.4319505Z [351s] Generation 12 starting: 27 neighbors, 2 active search path(s) 2026-02-21T10:59:01.4739846Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 2.3 configs/s 2026-02-21T10:59:03.2155784Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 27/27 16.9 configs/s 2026-02-21T10:59:05.5341044Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 371.5 2026-02-21T10:59:05.5341559Z configs/s 2026-02-21T10:59:06.3742973Z [364s] Generation 12 complete: 2026-02-21T10:59:06.3743392Z ok=29 2026-02-21T10:59:06.3743606Z min=0.1352 2026-02-21T10:59:06.3743825Z mid=0.1922 2026-02-21T10:59:06.3744028Z max=1.5350 2026-02-21T10:59:06.3744280Z best={'block_sizes': [1, 32, 64], 2026-02-21T10:59:06.3744689Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T10:59:06.3745795Z 'l2_groupings': [64], 2026-02-21T10:59:06.3746081Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:59:06.3746429Z 'loop_orders': [[0, 1]], 2026-02-21T10:59:06.3746714Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:59:06.3746979Z 'num_stages': 3, 2026-02-21T10:59:06.3747227Z 'num_warps': 2, 2026-02-21T10:59:06.3747407Z 'pid_type': 'flat', 2026-02-21T10:59:06.3747602Z 'range_flattens': [None, None], 2026-02-21T10:59:06.3747819Z 'range_multi_buffers': [None, True], 2026-02-21T10:59:06.3748042Z 'range_num_stages': [0, 4], 2026-02-21T10:59:06.3748238Z 'range_unroll_factors': [0, 4], 2026-02-21T10:59:06.3748458Z 'range_warp_specializes': [], 2026-02-21T10:59:06.3748662Z 'waves_per_eu': 2} 2026-02-21T10:59:06.4011677Z [364s] Fitting surrogate: 902 points, 902 targets 2026-02-21T10:59:06.6481895Z [364s] Generation 13 starting: 16 neighbors, 1 active search path(s) 2026-02-21T10:59:10.8132902Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 3.6 configs/s 2026-02-21T10:59:11.8910801Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.2 configs/s 2026-02-21T10:59:13.3878064Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 544.4 2026-02-21T10:59:13.3878665Z configs/s 2026-02-21T10:59:14.2338011Z [372s] Generation 13 complete: 2026-02-21T10:59:14.2338360Z ok=18 2026-02-21T10:59:14.2338574Z min=0.1360 2026-02-21T10:59:14.2338798Z mid=0.1694 2026-02-21T10:59:14.2339015Z max=0.7277 2026-02-21T10:59:14.2339244Z best={'block_sizes': [1, 32, 64], 2026-02-21T10:59:14.2340002Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T10:59:14.2340404Z 'l2_groupings': [64], 2026-02-21T10:59:14.2340692Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:59:14.2341010Z 'loop_orders': [[0, 1]], 2026-02-21T10:59:14.2341298Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:59:14.2341566Z 'num_stages': 3, 2026-02-21T10:59:14.2341808Z 'num_warps': 2, 2026-02-21T10:59:14.2342055Z 'pid_type': 'flat', 2026-02-21T10:59:14.2342325Z 'range_flattens': [None, None], 2026-02-21T10:59:14.2342797Z 'range_multi_buffers': [None, True], 2026-02-21T10:59:14.2343107Z 'range_num_stages': [0, 4], 2026-02-21T10:59:14.2343389Z 'range_unroll_factors': [0, 4], 2026-02-21T10:59:14.2343687Z 'range_warp_specializes': [], 2026-02-21T10:59:14.2343979Z 'waves_per_eu': 2} 2026-02-21T10:59:14.2506849Z [372s] Fitting surrogate: 920 points, 920 targets 2026-02-21T10:59:14.4831704Z [372s] Generation 14 starting: 10 neighbors, 1 active search path(s) 2026-02-21T10:59:17.6516203Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 4.2 configs/s 2026-02-21T10:59:18.3477899Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 18.2 configs/s 2026-02-21T10:59:18.8194826Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1486.0 2026-02-21T10:59:18.8195368Z configs/s 2026-02-21T10:59:19.5924968Z [377s] Generation 14 complete: 2026-02-21T10:59:19.5925231Z ok=12 2026-02-21T10:59:19.5925360Z min=0.1354 2026-02-21T10:59:19.5925505Z mid=0.2067 2026-02-21T10:59:19.5925628Z max=0.7286 2026-02-21T10:59:19.5925766Z best={'block_sizes': [1, 32, 64], 2026-02-21T10:59:19.5926286Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T10:59:19.5926522Z 'l2_groupings': [64], 2026-02-21T10:59:19.5926694Z 'load_eviction_policies': ['', '', ''], 2026-02-21T10:59:19.5926886Z 'loop_orders': [[0, 1]], 2026-02-21T10:59:19.5927060Z 'matrix_instr_nonkdim': 16, 2026-02-21T10:59:19.5927229Z 'num_stages': 3, 2026-02-21T10:59:19.5927376Z 'num_warps': 2, 2026-02-21T10:59:19.5927519Z 'pid_type': 'flat', 2026-02-21T10:59:19.5927675Z 'range_flattens': [None, None], 2026-02-21T10:59:19.5927861Z 'range_multi_buffers': [None, True], 2026-02-21T10:59:19.5928046Z 'range_num_stages': [0, 4], 2026-02-21T10:59:19.5928214Z 'range_unroll_factors': [0, 4], 2026-02-21T10:59:19.5928542Z 'range_warp_specializes': [], 2026-02-21T10:59:19.5928714Z 'waves_per_eu': 2} 2026-02-21T10:59:19.6025899Z [377s] Fitting surrogate: 932 points, 932 targets 2026-02-21T10:59:19.7303787Z [377s] Autotuning complete in 377.7s after searching 883 configs. 2026-02-21T10:59:19.7303995Z One can hardcode the best config and skip autotuning with: 2026-02-21T10:59:19.7304694Z @helion.kernel(config=helion.Config(block_sizes=[1, 32, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T10:59:19.7305320Z 2026-02-21T10:59:19.7305484Z [377s] Code of selected kernel: /tmp/torchinductor_root/dg/cdgqamsh5f3665xiepsy3mv7a5juf25esbrhspkp7ydkdbuw4q2z.py 2026-02-21T10:59:20.6969789Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T10:59:20.6970662Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T10:59:20.6971855Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T10:59:20.6972029Z WARNING:tritonbench.utils.triton_op:Completed input ID 2: 2026-02-21T10:59:20.6972185Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T10:59:20.6972310Z ------------------------------------------ 2026-02-21T10:59:20.6972430Z (4, 48, 512, 512, 128) 2026-02-21T10:59:20.6972503Z 2026-02-21T10:59:20.6976939Z 50%|█████ | 3/6 [14:54<15:51, 317.23s/it]WARNING:tritonbench.utils.triton_op:Running input ID 4: 2026-02-21T10:59:20.6977245Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T10:59:20.6977366Z ------------------------------------------ 2026-02-21T10:59:20.6977487Z (4, 48, 2048, 2048, 128) 2026-02-21T10:59:20.6978778Z INFO:tritonbench.utils.triton_op:Took 0.06ms to get benchmark function for aten 2026-02-21T10:59:21.6966569Z INFO:tritonbench.utils.triton_op:Took 1.44ms to get benchmark function for flex_attention 2026-02-21T10:59:23.2232115Z WARNING:__main__:Input tensor metadata: 2026-02-21T10:59:23.2232389Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T10:59:23.2232528Z 'dtype': 'torch.bfloat16', 2026-02-21T10:59:23.2232661Z 'shape': (4, 48, 2048, 128), 2026-02-21T10:59:23.2232801Z 'stride': (12582912, 262144, 128, 1)}, 2026-02-21T10:59:23.2232935Z { 'device': 'cuda:0', 2026-02-21T10:59:23.2233060Z 'dtype': 'torch.bfloat16', 2026-02-21T10:59:23.2233182Z 'shape': (4, 48, 2048, 128), 2026-02-21T10:59:23.2233321Z 'stride': (12582912, 262144, 128, 1)}, 2026-02-21T10:59:23.2233449Z { 'device': 'cuda:0', 2026-02-21T10:59:23.2233567Z 'dtype': 'torch.bfloat16', 2026-02-21T10:59:23.2233693Z 'shape': (4, 48, 2048, 128), 2026-02-21T10:59:23.2233825Z 'stride': (12582912, 262144, 128, 1)}), 2026-02-21T10:59:23.2233952Z 'kwargs': {}} 2026-02-21T10:59:23.2268584Z INFO:tritonbench.utils.triton_op:Took 4.12ms to get benchmark function for helion_attention 2026-02-21T10:59:23.4694352Z [0s] Autotune random seed: 2144140282 2026-02-21T10:59:23.6291759Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T10:59:57.0940209Z [33s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[3, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T10:59:57.6167359Z [33s] Timeout after 30s compiling Config(block_sizes=[1, 1, 2048], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:00:02.0213910Z [38s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:00:03.2498343Z [39s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:00:03.6512042Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:00:03.8180487Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 2, 2048], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:00:05.0804633Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:00:05.3701283Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 8, 1024], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:00:05.5019365Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 16, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:00:05.6578850Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 16, 256], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:00:06.4707463Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[4, 3], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:00:07.6472346Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 32, 1024], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=16, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[3, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:00:08.2504612Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 4, 2048], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[1, 2], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:00:08.7194187Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 2, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:00:08.9374382Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[2, 0], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:00:09.0858573Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 1, 1024], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, False], range_num_stages=[1, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:00:09.2385131Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:00:09.4728696Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 4, 256], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[4, 1], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:00:10.1980685Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:00:10.1998707Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.2 configs/s 2026-02-21T11:00:14.5738300Z /tmp/torchinductor_root/ry/cryea2bdhdua425js6x6ixzjrokt533cicfucyrleb5lw6rth7os.py:57:24: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T11:00:14.5740401Z k = tl.load(tl.make_block_ptr(k_view, [192, 128, 2048], [262144, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_1, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T11:00:14.5741268Z ^ 2026-02-21T11:00:14.5742998Z /tmp/torchinductor_root/ry/cryea2bdhdua425js6x6ixzjrokt533cicfucyrleb5lw6rth7os.py:59:145: note: - use: %144 = "tt.reshape"(<>) : (tensor<1x128x1024xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 8], order = [1, 0, 2]}>>) -> tensor<128x1024xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 8], order = [0, 1]}>> 2026-02-21T11:00:14.5744752Z 2026-02-21T11:00:14.5745696Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T11:00:14.5747072Z ^ 2026-02-21T11:00:14.5747577Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T11:00:14.5748255Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [2, 1, 0]}> 2026-02-21T11:00:14.5749135Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}> 2026-02-21T11:00:14.5749727Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}> 2026-02-21T11:00:14.5750265Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [2, 1, 0]}> 2026-02-21T11:00:14.5750794Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T11:00:14.5751333Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [2, 1, 0]}> 2026-02-21T11:00:14.5751855Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T11:00:14.5752352Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}> 2026-02-21T11:00:14.5752840Z #blocked8 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [8], order = [0]}> 2026-02-21T11:00:14.5753318Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [0, 1]}> 2026-02-21T11:00:14.5753960Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}> 2026-02-21T11:00:14.5754494Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [0, 1, 2]}> 2026-02-21T11:00:14.5755042Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [0, 1, 2]}> 2026-02-21T11:00:14.5755579Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 2], order = [2, 1, 0]}> 2026-02-21T11:00:14.5756116Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}> 2026-02-21T11:00:14.5756646Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [0, 1, 2]}> 2026-02-21T11:00:14.5757172Z #blocked16 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}> 2026-02-21T11:00:14.5757689Z #blocked17 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [0, 1, 2]}> 2026-02-21T11:00:14.5758304Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T11:00:14.5759192Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T11:00:14.5759805Z %c262144_i32 = arith.constant 262144 : i32 2026-02-21T11:00:14.5759969Z %c192_i64 = arith.constant 192 : i64 2026-02-21T11:00:14.5760125Z %c0_i64 = arith.constant 0 : i64 2026-02-21T11:00:14.5760280Z %c262144_i64 = arith.constant 262144 : i64 2026-02-21T11:00:14.5760503Z %cst = arith.constant dense<0.000000e+00> : tensor<1x128x1024xbf16, #blocked> 2026-02-21T11:00:14.5760766Z %cst_0 = arith.constant dense<2048> : tensor<1x1x1024xi64, #blocked> 2026-02-21T11:00:14.5761000Z %cst_1 = arith.constant dense<0> : tensor<1x1x1024xi64, #blocked> 2026-02-21T11:00:14.5761238Z %cst_2 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked1> 2026-02-21T11:00:14.5761489Z %cst_3 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked1> 2026-02-21T11:00:14.5761713Z %cst_4 = arith.constant dense<128> : tensor<1x1x1024xi64, #blocked> 2026-02-21T11:00:14.5761958Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked2> 2026-02-21T11:00:14.5762209Z %cst_6 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked3> 2026-02-21T11:00:14.5762430Z %cst_7 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked3> 2026-02-21T11:00:14.5762741Z %cst_8 = arith.constant dense<2048> : tensor<1x2x1xi64, #blocked4> 2026-02-21T11:00:14.5762969Z %cst_9 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked4> 2026-02-21T11:00:14.5763195Z %cst_10 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked4> 2026-02-21T11:00:14.5763398Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T11:00:14.5763550Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T11:00:14.5763704Z %c304_i32 = arith.constant 304 : i32 2026-02-21T11:00:14.5763889Z %cst_11 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked4> 2026-02-21T11:00:14.5764118Z %cst_12 = arith.constant dense<128> : tensor<1x1024x1xi32, #blocked5> 2026-02-21T11:00:14.5764367Z %cst_13 = arith.constant dense<0.127517432> : tensor<1x2x1024xf32, #blocked> 2026-02-21T11:00:14.5764626Z %cst_14 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5764881Z %cst_15 = arith.constant dense<0.000000e+00> : tensor<2x1024xf32, #blocked7> 2026-02-21T11:00:14.5765086Z %c0_i32 = arith.constant 0 : i32 2026-02-21T11:00:14.5765289Z %cst_16 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked2> 2026-02-21T11:00:14.5765573Z %cst_17 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5765849Z %cst_18 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5766048Z %c2_i32 = arith.constant 2 : i32 2026-02-21T11:00:14.5766192Z %c192_i32 = arith.constant 192 : i32 2026-02-21T11:00:14.5766347Z %c196608_i32 = arith.constant 196608 : i32 2026-02-21T11:00:14.5766505Z %0 = tt.get_program_id x : i32 2026-02-21T11:00:14.5766698Z %1 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked8> 2026-02-21T11:00:14.5766962Z %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked8> 2026-02-21T11:00:14.5767233Z %3 = tt.splat %arg0 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked2> 2026-02-21T11:00:14.5767492Z %4 = arith.extsi %1 : tensor<2xi32, #blocked8> to tensor<2xi64, #blocked8> 2026-02-21T11:00:14.5767750Z %5 = arith.extsi %2 : tensor<128xi32, #blocked8> to tensor<128xi64, #blocked8> 2026-02-21T11:00:14.5768086Z %6 = ttg.convert_layout %5 : tensor<128xi64, #blocked8> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:00:14.5768531Z %7 = tt.expand_dims %6 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi64, #blocked9> 2026-02-21T11:00:14.5768902Z %8 = ttg.convert_layout %7 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #blocked10> 2026-02-21T11:00:14.5769304Z %9 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> 2026-02-21T11:00:14.5769749Z %10 = tt.expand_dims %9 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi64, #blocked11> 2026-02-21T11:00:14.5770062Z %11 = ttg.convert_layout %10 : tensor<1x1x128xi64, #blocked11> -> tensor<1x1x128xi64, #blocked3> 2026-02-21T11:00:14.5770318Z %12 = tt.broadcast %11 : tensor<1x1x128xi64, #blocked3> -> tensor<1x2x128xi64, #blocked3> 2026-02-21T11:00:14.5770566Z %13 = ttg.convert_layout %12 : tensor<1x2x128xi64, #blocked3> -> tensor<1x2x128xi64, #blocked2> 2026-02-21T11:00:14.5770791Z %14 = arith.cmpi sge, %11, %cst_7 : tensor<1x1x128xi64, #blocked3> 2026-02-21T11:00:14.5770968Z %15 = arith.cmpi slt, %11, %cst_6 : tensor<1x1x128xi64, #blocked3> 2026-02-21T11:00:14.5771157Z %16 = arith.andi %14, %15 : tensor<1x1x128xi1, #blocked3> 2026-02-21T11:00:14.5771353Z %17 = tt.broadcast %16 : tensor<1x1x128xi1, #blocked3> -> tensor<1x2x128xi1, #blocked3> 2026-02-21T11:00:14.5771600Z %18 = ttg.convert_layout %17 : tensor<1x2x128xi1, #blocked3> -> tensor<1x2x128xi1, #blocked2> 2026-02-21T11:00:14.5771839Z %19 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked8> 2026-02-21T11:00:14.5772054Z %20 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x1024x!tt.ptr, #blocked> 2026-02-21T11:00:14.5772338Z %21 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked12}>> 2026-02-21T11:00:14.5772687Z %22 = tt.expand_dims %21 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x128x1xi64, #blocked12> 2026-02-21T11:00:14.5773004Z %23 = ttg.convert_layout %22 : tensor<1x128x1xi64, #blocked12> -> tensor<1x128x1xi64, #blocked1> 2026-02-21T11:00:14.5773263Z %24 = tt.broadcast %23 : tensor<1x128x1xi64, #blocked1> -> tensor<1x128x1024xi64, #blocked1> 2026-02-21T11:00:14.5781699Z %25 = ttg.convert_layout %24 : tensor<1x128x1024xi64, #blocked1> -> tensor<1x128x1024xi64, #blocked> 2026-02-21T11:00:14.5781971Z %26 = arith.extsi %19 : tensor<1024xi32, #blocked8> to tensor<1024xi64, #blocked8> 2026-02-21T11:00:14.5782171Z %27 = arith.cmpi sge, %23, %cst_3 : tensor<1x128x1xi64, #blocked1> 2026-02-21T11:00:14.5782347Z %28 = arith.cmpi slt, %23, %cst_2 : tensor<1x128x1xi64, #blocked1> 2026-02-21T11:00:14.5782515Z %29 = arith.andi %27, %28 : tensor<1x128x1xi1, #blocked1> 2026-02-21T11:00:14.5782798Z %30 = ttg.convert_layout %2 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:00:14.5783126Z %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9> 2026-02-21T11:00:14.5783416Z %32 = ttg.convert_layout %31 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked10> 2026-02-21T11:00:14.5783708Z %33 = ttg.convert_layout %32 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> 2026-02-21T11:00:14.5784050Z %34 = tt.expand_dims %33 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi32, #blocked11> 2026-02-21T11:00:14.5784348Z %35 = ttg.convert_layout %34 : tensor<1x1x128xi32, #blocked11> -> tensor<1x1x128xi32, #blocked3> 2026-02-21T11:00:14.5784601Z %36 = tt.broadcast %35 : tensor<1x1x128xi32, #blocked3> -> tensor<1x1024x128xi32, #blocked3> 2026-02-21T11:00:14.5784859Z %37 = ttg.convert_layout %36 : tensor<1x1024x128xi32, #blocked3> -> tensor<1x1024x128xi32, #blocked13> 2026-02-21T11:00:14.5785121Z %38 = tt.splat %arg2 : !tt.ptr -> tensor<1x1024x128x!tt.ptr, #blocked13> 2026-02-21T11:00:14.5785388Z %39 = ttg.convert_layout %2 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:00:14.5785708Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9> 2026-02-21T11:00:14.5786011Z %41 = ttg.convert_layout %40 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked10> 2026-02-21T11:00:14.5786296Z %42 = ttg.convert_layout %41 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> 2026-02-21T11:00:14.5786637Z %43 = tt.expand_dims %42 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi32, #blocked11> 2026-02-21T11:00:14.5786936Z %44 = ttg.convert_layout %43 : tensor<1x1x128xi32, #blocked11> -> tensor<1x1x128xi32, #blocked3> 2026-02-21T11:00:14.5787178Z %45 = tt.broadcast %44 : tensor<1x1x128xi32, #blocked3> -> tensor<1x2x128xi32, #blocked3> 2026-02-21T11:00:14.5787449Z %46 = ttg.convert_layout %45 : tensor<1x2x128xi32, #blocked3> -> tensor<1x2x128xi32, #blocked2> 2026-02-21T11:00:14.5787679Z %47 = tt.splat %arg3 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked2> 2026-02-21T11:00:14.5787865Z scf.for %arg4 = %0 to %c196608_i32 step %c304_i32 : i32 { 2026-02-21T11:00:14.5788014Z %48 = arith.remsi %arg4, %c192_i32 : i32 2026-02-21T11:00:14.5788138Z %49 = arith.divsi %arg4, %c192_i32 : i32 2026-02-21T11:00:14.5788261Z %50 = arith.muli %49, %c2_i32 : i32 2026-02-21T11:00:14.5788395Z %51 = tt.splat %50 : i32 -> tensor<2xi32, #blocked8> 2026-02-21T11:00:14.5788544Z %52 = arith.addi %51, %1 : tensor<2xi32, #blocked8> 2026-02-21T11:00:14.5788678Z %53 = arith.extsi %48 : i32 to i64 2026-02-21T11:00:14.5788793Z %54 = arith.extsi %50 : i32 to i64 2026-02-21T11:00:14.5788914Z %55 = arith.muli %53, %c262144_i64 : i64 2026-02-21T11:00:14.5789053Z %56 = tt.splat %55 : i64 -> tensor<1x2x128xi64, #blocked2> 2026-02-21T11:00:14.5789205Z %57 = tt.splat %54 : i64 -> tensor<2xi64, #blocked8> 2026-02-21T11:00:14.5789348Z %58 = arith.addi %57, %4 : tensor<2xi64, #blocked8> 2026-02-21T11:00:14.5789572Z %59 = ttg.convert_layout %58 : tensor<2xi64, #blocked8> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:00:14.5789890Z %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x2xi64, #blocked9> 2026-02-21T11:00:14.5790168Z %61 = ttg.convert_layout %60 : tensor<1x2xi64, #blocked9> -> tensor<1x2xi64, #blocked6> 2026-02-21T11:00:14.5790466Z %62 = ttg.convert_layout %61 : tensor<1x2xi64, #blocked6> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T11:00:14.5790795Z %63 = tt.expand_dims %62 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xi64, #blocked14> 2026-02-21T11:00:14.5791089Z %64 = ttg.convert_layout %63 : tensor<1x2x1xi64, #blocked14> -> tensor<1x2x1xi64, #blocked4> 2026-02-21T11:00:14.5791298Z %65 = arith.muli %64, %cst_10 : tensor<1x2x1xi64, #blocked4> 2026-02-21T11:00:14.5791490Z %66 = tt.broadcast %65 : tensor<1x2x1xi64, #blocked4> -> tensor<1x2x128xi64, #blocked4> 2026-02-21T11:00:14.5791730Z %67 = ttg.convert_layout %66 : tensor<1x2x128xi64, #blocked4> -> tensor<1x2x128xi64, #blocked2> 2026-02-21T11:00:14.5791936Z %68 = arith.addi %67, %13 : tensor<1x2x128xi64, #blocked2> 2026-02-21T11:00:14.5792093Z %69 = arith.addi %56, %68 : tensor<1x2x128xi64, #blocked2> 2026-02-21T11:00:14.5792296Z %70 = tt.addptr %3, %69 : tensor<1x2x128x!tt.ptr, #blocked2>, tensor<1x2x128xi64, #blocked2> 2026-02-21T11:00:14.5792487Z %71 = arith.cmpi sge, %53, %c0_i64 : i64 2026-02-21T11:00:14.5792616Z %72 = arith.cmpi slt, %53, %c192_i64 : i64 2026-02-21T11:00:14.5792759Z %73 = arith.andi %71, %72 : i1 2026-02-21T11:00:14.5792903Z %74 = arith.cmpi sge, %64, %cst_9 : tensor<1x2x1xi64, #blocked4> 2026-02-21T11:00:14.5793073Z %75 = arith.cmpi slt, %64, %cst_8 : tensor<1x2x1xi64, #blocked4> 2026-02-21T11:00:14.5793237Z %76 = arith.andi %74, %75 : tensor<1x2x1xi1, #blocked4> 2026-02-21T11:00:14.5793394Z %77 = tt.splat %73 : i1 -> tensor<1x2x1xi1, #blocked4> 2026-02-21T11:00:14.5793556Z %78 = arith.andi %77, %76 : tensor<1x2x1xi1, #blocked4> 2026-02-21T11:00:14.5793747Z %79 = tt.broadcast %78 : tensor<1x2x1xi1, #blocked4> -> tensor<1x2x128xi1, #blocked4> 2026-02-21T11:00:14.5793986Z %80 = ttg.convert_layout %79 : tensor<1x2x128xi1, #blocked4> -> tensor<1x2x128xi1, #blocked2> 2026-02-21T11:00:14.5794194Z %81 = arith.andi %80, %18 : tensor<1x2x128xi1, #blocked2> 2026-02-21T11:00:14.5794366Z %82 = tt.load %70, %81, %cst_5 : tensor<1x2x128x!tt.ptr, #blocked2> 2026-02-21T11:00:14.5794544Z %83 = tt.splat %55 : i64 -> tensor<1x128x1024xi64, #blocked> 2026-02-21T11:00:14.5794703Z %84 = tt.splat %73 : i1 -> tensor<1x128x1xi1, #blocked1> 2026-02-21T11:00:14.5794869Z %85 = arith.andi %84, %29 : tensor<1x128x1xi1, #blocked1> 2026-02-21T11:00:14.5795065Z %86 = tt.broadcast %85 : tensor<1x128x1xi1, #blocked1> -> tensor<1x128x1024xi1, #blocked1> 2026-02-21T11:00:14.5795313Z %87 = ttg.convert_layout %86 : tensor<1x128x1024xi1, #blocked1> -> tensor<1x128x1024xi1, #blocked> 2026-02-21T11:00:14.5795561Z %88 = tt.reshape %82 : tensor<1x2x128xbf16, #blocked2> -> tensor<2x128xbf16, #blocked10> 2026-02-21T11:00:14.5795743Z %89 = arith.muli %48, %c262144_i32 : i32 2026-02-21T11:00:14.5795881Z %90 = tt.splat %89 : i32 -> tensor<1x1024x1xi32, #blocked5> 2026-02-21T11:00:14.5796247Z %91:3 = scf.for %arg5 = %c0_i32 to %c2048_i32 step %c1024_i32 iter_args(%arg6 = %cst_18, %arg7 = %cst_17, %arg8 = %cst_16) -> (tensor<1x2xf32, #blocked6>, tensor<1x2xf32, #blocked6>, tensor<1x2x128xf32, #blocked2>) : i32 { 2026-02-21T11:00:14.5796605Z %113 = tt.splat %arg5 : i32 -> tensor<1024xi32, #blocked8> 2026-02-21T11:00:14.5796768Z %114 = arith.addi %113, %19 : tensor<1024xi32, #blocked8> 2026-02-21T11:00:14.5796911Z %115 = arith.extsi %arg5 : i32 to i64 2026-02-21T11:00:14.5797048Z %116 = tt.splat %115 : i64 -> tensor<1024xi64, #blocked8> 2026-02-21T11:00:14.5797204Z %117 = arith.addi %116, %26 : tensor<1024xi64, #blocked8> 2026-02-21T11:00:14.5797449Z %118 = ttg.convert_layout %117 : tensor<1024xi64, #blocked8> -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:00:14.5797792Z %119 = tt.expand_dims %118 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x1024xi64, #blocked9> 2026-02-21T11:00:14.5798116Z %120 = ttg.convert_layout %119 : tensor<1x1024xi64, #blocked9> -> tensor<1x1024xi64, #blocked7> 2026-02-21T11:00:14.5798414Z %121 = ttg.convert_layout %120 : tensor<1x1024xi64, #blocked7> -> tensor<1x1024xi64, #ttg.slice<{dim = 1, parent = #blocked15}>> 2026-02-21T11:00:14.5798772Z %122 = tt.expand_dims %121 {axis = 1 : i32} : tensor<1x1024xi64, #ttg.slice<{dim = 1, parent = #blocked15}>> -> tensor<1x1x1024xi64, #blocked15> 2026-02-21T11:00:14.5799089Z %123 = ttg.convert_layout %122 : tensor<1x1x1024xi64, #blocked15> -> tensor<1x1x1024xi64, #blocked> 2026-02-21T11:00:14.5799314Z %124 = arith.muli %123, %cst_4 : tensor<1x1x1024xi64, #blocked> 2026-02-21T11:00:14.5799525Z %125 = tt.broadcast %124 : tensor<1x1x1024xi64, #blocked> -> tensor<1x128x1024xi64, #blocked> 2026-02-21T11:00:14.5799731Z %126 = arith.addi %25, %125 : tensor<1x128x1024xi64, #blocked> 2026-02-21T11:00:14.5799897Z %127 = arith.addi %83, %126 : tensor<1x128x1024xi64, #blocked> 2026-02-21T11:00:14.5800112Z %128 = tt.addptr %20, %127 : tensor<1x128x1024x!tt.ptr, #blocked>, tensor<1x128x1024xi64, #blocked> 2026-02-21T11:00:14.5800360Z %129 = arith.cmpi sge, %123, %cst_1 : tensor<1x1x1024xi64, #blocked> 2026-02-21T11:00:14.5800543Z %130 = arith.cmpi slt, %123, %cst_0 : tensor<1x1x1024xi64, #blocked> 2026-02-21T11:00:14.5800715Z %131 = arith.andi %129, %130 : tensor<1x1x1024xi1, #blocked> 2026-02-21T11:00:14.5800919Z %132 = tt.broadcast %131 : tensor<1x1x1024xi1, #blocked> -> tensor<1x128x1024xi1, #blocked> 2026-02-21T11:00:14.5801122Z %133 = arith.andi %87, %132 : tensor<1x128x1024xi1, #blocked> 2026-02-21T11:00:14.5801317Z %134 = tt.load %128, %133, %cst : tensor<1x128x1024x!tt.ptr, #blocked> 2026-02-21T11:00:14.5801539Z %135 = tt.reshape %134 : tensor<1x128x1024xbf16, #blocked> -> tensor<128x1024xbf16, #blocked7> 2026-02-21T11:00:14.5801847Z %136 = ttg.convert_layout %88 : tensor<2x128xbf16, #blocked10> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked16}>> 2026-02-21T11:00:14.5802215Z %137 = ttg.convert_layout %135 : tensor<128x1024xbf16, #blocked7> -> tensor<128x1024xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked16}>> 2026-02-21T11:00:14.5802549Z %138 = ttg.convert_layout %cst_15 : tensor<2x1024xf32, #blocked7> -> tensor<2x1024xf32, #blocked16> 2026-02-21T11:00:14.5803027Z %139 = tt.dot %136, %137, %138, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked16}>> * tensor<128x1024xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked16}>> -> tensor<2x1024xf32, #blocked16> 2026-02-21T11:00:14.5803445Z %140 = ttg.convert_layout %139 : tensor<2x1024xf32, #blocked16> -> tensor<2x1024xf32, #blocked7> 2026-02-21T11:00:14.5803688Z %141 = tt.reshape %140 : tensor<2x1024xf32, #blocked7> -> tensor<1x2x1024xf32, #blocked> 2026-02-21T11:00:14.5803930Z %142 = arith.truncf %141 : tensor<1x2x1024xf32, #blocked> to tensor<1x2x1024xbf16, #blocked> 2026-02-21T11:00:14.5804172Z %143 = arith.extf %142 : tensor<1x2x1024xbf16, #blocked> to tensor<1x2x1024xf32, #blocked> 2026-02-21T11:00:14.5804367Z %144 = "tt.reduce"(%143) <{axis = 2 : i32}> ({ 2026-02-21T11:00:14.5804497Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T11:00:14.5804619Z %199 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T11:00:14.5804748Z tt.reduce.return %199 : f32 2026-02-21T11:00:14.5804936Z }) : (tensor<1x2x1024xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T11:00:14.5805227Z %145 = ttg.convert_layout %144 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5805497Z %146 = arith.truncf %145 : tensor<1x2xf32, #blocked6> to tensor<1x2xbf16, #blocked6> 2026-02-21T11:00:14.5805721Z %147 = arith.extf %146 : tensor<1x2xbf16, #blocked6> to tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5806086Z %148 = arith.mulf %147, %cst_14 : tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5806275Z %149 = arith.truncf %148 : tensor<1x2xf32, #blocked6> to tensor<1x2xbf16, #blocked6> 2026-02-21T11:00:14.5806495Z %150 = arith.extf %149 : tensor<1x2xbf16, #blocked6> to tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5806689Z %151 = arith.cmpf ogt, %arg6, %150 : tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5806863Z %152 = arith.cmpf une, %arg6, %arg6 : tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5807030Z %153 = arith.ori %151, %152 : tensor<1x2xi1, #blocked6> 2026-02-21T11:00:14.5807222Z %154 = arith.select %153, %arg6, %150 : tensor<1x2xi1, #blocked6>, tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5807429Z %155 = arith.mulf %143, %cst_13 : tensor<1x2x1024xf32, #blocked> 2026-02-21T11:00:14.5807634Z %156 = arith.truncf %155 : tensor<1x2x1024xf32, #blocked> to tensor<1x2x1024xbf16, #blocked> 2026-02-21T11:00:14.5807925Z %157 = ttg.convert_layout %154 : tensor<1x2xf32, #blocked6> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T11:00:14.5808267Z %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xf32, #blocked14> 2026-02-21T11:00:14.5808586Z %159 = ttg.convert_layout %158 : tensor<1x2x1xf32, #blocked14> -> tensor<1x2x1xf32, #blocked4> 2026-02-21T11:00:14.5808831Z %160 = arith.extf %156 : tensor<1x2x1024xbf16, #blocked> to tensor<1x2x1024xf32, #blocked> 2026-02-21T11:00:14.5809068Z %161 = tt.broadcast %159 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x1024xf32, #blocked4> 2026-02-21T11:00:14.5809343Z %162 = ttg.convert_layout %161 : tensor<1x2x1024xf32, #blocked4> -> tensor<1x2x1024xf32, #blocked> 2026-02-21T11:00:14.5809560Z %163 = arith.subf %160, %162 : tensor<1x2x1024xf32, #blocked> 2026-02-21T11:00:14.5809865Z %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x1024xf32, #blocked>) -> tensor<1x2x1024xf32, #blocked> 2026-02-21T11:00:14.5810155Z %165 = "tt.reduce"(%164) <{axis = 2 : i32}> ({ 2026-02-21T11:00:14.5810285Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T11:00:14.5810406Z %199 = arith.addf %arg9, %arg10 : f32 2026-02-21T11:00:14.5810529Z tt.reduce.return %199 : f32 2026-02-21T11:00:14.5810735Z }) : (tensor<1x2x1024xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T11:00:14.5811026Z %166 = ttg.convert_layout %165 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5811266Z %167 = arith.subf %arg6, %154 : tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5811554Z %168 = tt.extern_elementwise %167 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked6>) -> tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5811842Z %169 = arith.mulf %arg7, %168 : tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5811998Z %170 = arith.addf %169, %166 : tensor<1x2xf32, #blocked6> 2026-02-21T11:00:14.5812243Z %171 = ttg.convert_layout %168 : tensor<1x2xf32, #blocked6> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T11:00:14.5812578Z %172 = tt.expand_dims %171 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xf32, #blocked14> 2026-02-21T11:00:14.5812880Z %173 = ttg.convert_layout %172 : tensor<1x2x1xf32, #blocked14> -> tensor<1x2x1xf32, #blocked4> 2026-02-21T11:00:14.5813126Z %174 = tt.broadcast %173 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4> 2026-02-21T11:00:14.5813375Z %175 = ttg.convert_layout %174 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked2> 2026-02-21T11:00:14.5813590Z %176 = arith.mulf %arg8, %175 : tensor<1x2x128xf32, #blocked2> 2026-02-21T11:00:14.5813850Z %177 = ttg.convert_layout %114 : tensor<1024xi32, #blocked8> -> tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:00:14.5814189Z %178 = tt.expand_dims %177 {axis = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x1024xi32, #blocked9> 2026-02-21T11:00:14.5814492Z %179 = ttg.convert_layout %178 : tensor<1x1024xi32, #blocked9> -> tensor<1x1024xi32, #blocked7> 2026-02-21T11:00:14.5814787Z %180 = ttg.convert_layout %179 : tensor<1x1024xi32, #blocked7> -> tensor<1x1024xi32, #ttg.slice<{dim = 2, parent = #blocked17}>> 2026-02-21T11:00:14.5815139Z %181 = tt.expand_dims %180 {axis = 2 : i32} : tensor<1x1024xi32, #ttg.slice<{dim = 2, parent = #blocked17}>> -> tensor<1x1024x1xi32, #blocked17> 2026-02-21T11:00:14.5815455Z %182 = ttg.convert_layout %181 : tensor<1x1024x1xi32, #blocked17> -> tensor<1x1024x1xi32, #blocked5> 2026-02-21T11:00:14.5815681Z %183 = arith.muli %182, %cst_12 : tensor<1x1024x1xi32, #blocked5> 2026-02-21T11:00:14.5815852Z %184 = arith.addi %90, %183 : tensor<1x1024x1xi32, #blocked5> 2026-02-21T11:00:14.5816057Z %185 = tt.broadcast %184 : tensor<1x1024x1xi32, #blocked5> -> tensor<1x1024x128xi32, #blocked5> 2026-02-21T11:00:14.5816345Z %186 = ttg.convert_layout %185 : tensor<1x1024x128xi32, #blocked5> -> tensor<1x1024x128xi32, #blocked13> 2026-02-21T11:00:14.5816569Z %187 = arith.addi %186, %37 : tensor<1x1024x128xi32, #blocked13> 2026-02-21T11:00:14.5816800Z %188 = tt.addptr %38, %187 : tensor<1x1024x128x!tt.ptr, #blocked13>, tensor<1x1024x128xi32, #blocked13> 2026-02-21T11:00:14.5817032Z %189 = tt.load %188 : tensor<1x1024x128x!tt.ptr, #blocked13> 2026-02-21T11:00:14.5817264Z %190 = arith.truncf %164 : tensor<1x2x1024xf32, #blocked> to tensor<1x2x1024xbf16, #blocked> 2026-02-21T11:00:14.5817510Z %191 = tt.reshape %176 : tensor<1x2x128xf32, #blocked2> -> tensor<2x128xf32, #blocked10> 2026-02-21T11:00:14.5817745Z %192 = tt.reshape %190 : tensor<1x2x1024xbf16, #blocked> -> tensor<2x1024xbf16, #blocked7> 2026-02-21T11:00:14.5817995Z %193 = tt.reshape %189 : tensor<1x1024x128xbf16, #blocked13> -> tensor<1024x128xbf16, #blocked10> 2026-02-21T11:00:14.5818309Z %194 = ttg.convert_layout %192 : tensor<2x1024xbf16, #blocked7> -> tensor<2x1024xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> 2026-02-21T11:00:14.5818691Z %195 = ttg.convert_layout %193 : tensor<1024x128xbf16, #blocked10> -> tensor<1024x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> 2026-02-21T11:00:14.5819002Z %196 = ttg.convert_layout %191 : tensor<2x128xf32, #blocked10> -> tensor<2x128xf32, #blocked10> 2026-02-21T11:00:14.5819416Z %197 = tt.dot %194, %195, %196, inputPrecision = tf32 : tensor<2x1024xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<1024x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<2x128xf32, #blocked10> 2026-02-21T11:00:14.5819822Z %198 = tt.reshape %197 : tensor<2x128xf32, #blocked10> -> tensor<1x2x128xf32, #blocked2> 2026-02-21T11:00:14.5820093Z scf.yield %154, %170, %198 : tensor<1x2xf32, #blocked6>, tensor<1x2xf32, #blocked6>, tensor<1x2x128xf32, #blocked2> 2026-02-21T11:00:14.5820300Z } 2026-02-21T11:00:14.5820490Z %92 = ttg.convert_layout %91#1 : tensor<1x2xf32, #blocked6> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T11:00:14.5820826Z %93 = tt.expand_dims %92 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xf32, #blocked14> 2026-02-21T11:00:14.5821121Z %94 = ttg.convert_layout %93 : tensor<1x2x1xf32, #blocked14> -> tensor<1x2x1xf32, #blocked4> 2026-02-21T11:00:14.5821360Z %95 = tt.broadcast %94 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4> 2026-02-21T11:00:14.5821598Z %96 = ttg.convert_layout %95 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked2> 2026-02-21T11:00:14.5821809Z %97 = arith.divf %91#2, %96 : tensor<1x2x128xf32, #blocked2> 2026-02-21T11:00:14.5822025Z %98 = arith.truncf %97 : tensor<1x2x128xf32, #blocked2> to tensor<1x2x128xbf16, #blocked2> 2026-02-21T11:00:14.5822214Z %99 = arith.muli %48, %c262144_i32 : i32 2026-02-21T11:00:14.5822432Z %100 = ttg.convert_layout %52 : tensor<2xi32, #blocked8> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:00:14.5822751Z %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x2xi32, #blocked9> 2026-02-21T11:00:14.5823039Z %102 = ttg.convert_layout %101 : tensor<1x2xi32, #blocked9> -> tensor<1x2xi32, #blocked6> 2026-02-21T11:00:14.5823321Z %103 = ttg.convert_layout %102 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T11:00:14.5823658Z %104 = tt.expand_dims %103 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xi32, #blocked14> 2026-02-21T11:00:14.5823959Z %105 = ttg.convert_layout %104 : tensor<1x2x1xi32, #blocked14> -> tensor<1x2x1xi32, #blocked4> 2026-02-21T11:00:14.5824168Z %106 = arith.muli %105, %cst_11 : tensor<1x2x1xi32, #blocked4> 2026-02-21T11:00:14.5824351Z %107 = tt.splat %99 : i32 -> tensor<1x2x1xi32, #blocked4> 2026-02-21T11:00:14.5824511Z %108 = arith.addi %107, %106 : tensor<1x2x1xi32, #blocked4> 2026-02-21T11:00:14.5824706Z %109 = tt.broadcast %108 : tensor<1x2x1xi32, #blocked4> -> tensor<1x2x128xi32, #blocked4> 2026-02-21T11:00:14.5824951Z %110 = ttg.convert_layout %109 : tensor<1x2x128xi32, #blocked4> -> tensor<1x2x128xi32, #blocked2> 2026-02-21T11:00:14.5825174Z %111 = arith.addi %110, %46 : tensor<1x2x128xi32, #blocked2> 2026-02-21T11:00:14.5825386Z %112 = tt.addptr %47, %111 : tensor<1x2x128x!tt.ptr, #blocked2>, tensor<1x2x128xi32, #blocked2> 2026-02-21T11:00:14.5825605Z tt.store %112, %98 : tensor<1x2x128x!tt.ptr, #blocked2> 2026-02-21T11:00:14.5825750Z } {tt.loop_unroll_factor = 1 : i32} 2026-02-21T11:00:14.5825863Z tt.return 2026-02-21T11:00:14.5825944Z } 2026-02-21T11:00:14.5826021Z } 2026-02-21T11:00:14.5826068Z 2026-02-21T11:00:14.5826098Z {-# 2026-02-21T11:00:14.5826182Z external_resources: { 2026-02-21T11:00:14.5826281Z mlir_reproducer: { 2026-02-21T11:00:14.5828521Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T11:00:14.5830804Z disable_threading: false, 2026-02-21T11:00:14.5830909Z verify_each: true 2026-02-21T11:00:14.5831000Z } 2026-02-21T11:00:14.5831070Z } 2026-02-21T11:00:14.5831140Z #-} 2026-02-21T11:00:14.5831431Z /tmp/torchinductor_root/ry/cryea2bdhdua425js6x6ixzjrokt533cicfucyrleb5lw6rth7os.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T11:00:14.5832135Z /tmp/torchinductor_root/ry/cryea2bdhdua425js6x6ixzjrokt533cicfucyrleb5lw6rth7os.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T11:00:14.5832681Z [50s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T11:00:14.5833477Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 1024], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[0, 0], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T11:00:14.5834203Z Error: RuntimeError: PassManager::run failed 2026-02-21T11:00:14.5834369Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T11:00:45.3004339Z /tmp/torchinductor_root/5g/c5giwctfjccaue6bd75t5sbqdhwhf2ay7njm5q53dk3emny6uyli.py:55:130: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T11:00:45.3005023Z k = tl.load(k_view + (indices_0[:, None, None] * 262144 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None) 2026-02-21T11:00:45.3005415Z ^ 2026-02-21T11:00:45.3006846Z /tmp/torchinductor_root/5g/c5giwctfjccaue6bd75t5sbqdhwhf2ay7njm5q53dk3emny6uyli.py:57:141: note: - use: %132 = "tt.reshape"(<>) : (tensor<1x128x512xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 1], order = [1, 0, 2]}>>) -> tensor<128x512xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [0, 1]}>> 2026-02-21T11:00:45.3007741Z 2026-02-21T11:00:45.3008317Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T11:00:45.3008990Z ^ 2026-02-21T11:00:45.3009252Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T11:00:45.3010609Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T11:00:45.3011090Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T11:00:45.3011550Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T11:00:45.3011988Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T11:00:45.3012424Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T11:00:45.3012832Z #blocked5 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T11:00:45.3013244Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T11:00:45.3013699Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T11:00:45.3014159Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T11:00:45.3014734Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T11:00:45.3015186Z #blocked10 = #ttg.blocked<{sizePerThread = [2, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T11:00:45.3015624Z #blocked11 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T11:00:45.3016061Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T11:00:45.3016685Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T11:00:45.3017148Z %c262144_i32 = arith.constant 262144 : i32 2026-02-21T11:00:45.3017299Z %c192_i64 = arith.constant 192 : i64 2026-02-21T11:00:45.3017442Z %c0_i64 = arith.constant 0 : i64 2026-02-21T11:00:45.3017579Z %c262144_i64 = arith.constant 262144 : i64 2026-02-21T11:00:45.3017865Z %cst = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked> 2026-02-21T11:00:45.3018092Z %cst_0 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked> 2026-02-21T11:00:45.3018306Z %cst_1 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked> 2026-02-21T11:00:45.3018509Z %cst_2 = arith.constant dense<2048> : tensor<1x2x1xi64, #blocked1> 2026-02-21T11:00:45.3018720Z %cst_3 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked1> 2026-02-21T11:00:45.3018946Z %cst_4 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked1> 2026-02-21T11:00:45.3019120Z %c512_i32 = arith.constant 512 : i32 2026-02-21T11:00:45.3019258Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T11:00:45.3019394Z %c3072_i32 = arith.constant 3072 : i32 2026-02-21T11:00:45.3019558Z %cst_5 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked1> 2026-02-21T11:00:45.3019768Z %cst_6 = arith.constant dense<128> : tensor<1x512x1xi32, #blocked2> 2026-02-21T11:00:45.3019983Z %cst_7 = arith.constant dense<0.127517432> : tensor<1x2x512xf32, #blocked> 2026-02-21T11:00:45.3020216Z %cst_8 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3020462Z %cst_9 = arith.constant dense<0.000000e+00> : tensor<2x512xf32, #blocked4> 2026-02-21T11:00:45.3020681Z %cst_10 = arith.constant dense<128> : tensor<1x1x512xi32, #blocked> 2026-02-21T11:00:45.3020851Z %c0_i32 = arith.constant 0 : i32 2026-02-21T11:00:45.3021032Z %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked> 2026-02-21T11:00:45.3021256Z %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3021485Z %cst_13 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3021662Z %c2_i32 = arith.constant 2 : i32 2026-02-21T11:00:45.3021795Z %c16_i32 = arith.constant 16 : i32 2026-02-21T11:00:45.3021926Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T11:00:45.3022062Z %0 = tt.get_program_id x : i32 2026-02-21T11:00:45.3022198Z %1 = arith.divsi %0, %c3072_i32 : i32 2026-02-21T11:00:45.3022327Z %2 = arith.muli %1, %c16_i32 : i32 2026-02-21T11:00:45.3022460Z %3 = arith.subi %c1024_i32, %2 : i32 2026-02-21T11:00:45.3022587Z %4 = arith.minsi %3, %c16_i32 : i32 2026-02-21T11:00:45.3022722Z %5 = arith.remsi %0, %c3072_i32 : i32 2026-02-21T11:00:45.3022851Z %6 = arith.remsi %5, %4 : i32 2026-02-21T11:00:45.3022978Z %7 = arith.addi %2, %6 : i32 2026-02-21T11:00:45.3023103Z %8 = arith.divsi %5, %4 : i32 2026-02-21T11:00:45.3023229Z %9 = arith.muli %7, %c2_i32 : i32 2026-02-21T11:00:45.3023405Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked5> 2026-02-21T11:00:45.3023611Z %11 = tt.splat %9 : i32 -> tensor<2xi32, #blocked5> 2026-02-21T11:00:45.3023811Z %12 = arith.addi %11, %10 : tensor<2xi32, #blocked5> 2026-02-21T11:00:45.3024013Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked5> 2026-02-21T11:00:45.3024206Z %14 = arith.extsi %8 : i32 to i64 2026-02-21T11:00:45.3024334Z %15 = arith.extsi %9 : i32 to i64 2026-02-21T11:00:45.3024518Z %16 = tt.splat %arg0 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T11:00:45.3024712Z %17 = arith.muli %14, %c262144_i64 : i64 2026-02-21T11:00:45.3024871Z %18 = tt.splat %17 : i64 -> tensor<1x2x128xi64, #blocked> 2026-02-21T11:00:45.3025043Z %19 = tt.splat %15 : i64 -> tensor<2xi64, #blocked5> 2026-02-21T11:00:45.3025239Z %20 = arith.extsi %10 : tensor<2xi32, #blocked5> to tensor<2xi64, #blocked5> 2026-02-21T11:00:45.3025438Z %21 = arith.addi %19, %20 : tensor<2xi64, #blocked5> 2026-02-21T11:00:45.3025699Z %22 = ttg.convert_layout %21 : tensor<2xi64, #blocked5> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T11:00:45.3026022Z %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi64, #blocked6> 2026-02-21T11:00:45.3026319Z %24 = ttg.convert_layout %23 : tensor<1x2xi64, #blocked6> -> tensor<1x2xi64, #blocked3> 2026-02-21T11:00:45.3026596Z %25 = ttg.convert_layout %24 : tensor<1x2xi64, #blocked3> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T11:00:45.3026922Z %26 = tt.expand_dims %25 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi64, #blocked7> 2026-02-21T11:00:45.3027214Z %27 = ttg.convert_layout %26 : tensor<1x2x1xi64, #blocked7> -> tensor<1x2x1xi64, #blocked1> 2026-02-21T11:00:45.3027433Z %28 = arith.muli %27, %cst_4 : tensor<1x2x1xi64, #blocked1> 2026-02-21T11:00:45.3027627Z %29 = tt.broadcast %28 : tensor<1x2x1xi64, #blocked1> -> tensor<1x2x128xi64, #blocked1> 2026-02-21T11:00:45.3027864Z %30 = ttg.convert_layout %29 : tensor<1x2x128xi64, #blocked1> -> tensor<1x2x128xi64, #blocked> 2026-02-21T11:00:45.3028095Z %31 = arith.extsi %13 : tensor<128xi32, #blocked5> to tensor<128xi64, #blocked5> 2026-02-21T11:00:45.3028363Z %32 = ttg.convert_layout %31 : tensor<128xi64, #blocked5> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T11:00:45.3028702Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi64, #blocked6> 2026-02-21T11:00:45.3028986Z %34 = ttg.convert_layout %33 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #blocked4> 2026-02-21T11:00:45.3029269Z %35 = ttg.convert_layout %34 : tensor<1x128xi64, #blocked4> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T11:00:45.3029603Z %36 = tt.expand_dims %35 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi64, #blocked8> 2026-02-21T11:00:45.3029900Z %37 = ttg.convert_layout %36 : tensor<1x1x128xi64, #blocked8> -> tensor<1x1x128xi64, #blocked> 2026-02-21T11:00:45.3030136Z %38 = tt.broadcast %37 : tensor<1x1x128xi64, #blocked> -> tensor<1x2x128xi64, #blocked> 2026-02-21T11:00:45.3030332Z %39 = arith.addi %30, %38 : tensor<1x2x128xi64, #blocked> 2026-02-21T11:00:45.3030482Z %40 = arith.addi %18, %39 : tensor<1x2x128xi64, #blocked> 2026-02-21T11:00:45.3030679Z %41 = tt.addptr %16, %40 : tensor<1x2x128x!tt.ptr, #blocked>, tensor<1x2x128xi64, #blocked> 2026-02-21T11:00:45.3030871Z %42 = arith.cmpi sge, %14, %c0_i64 : i64 2026-02-21T11:00:45.3030993Z %43 = arith.cmpi slt, %14, %c192_i64 : i64 2026-02-21T11:00:45.3031112Z %44 = arith.andi %42, %43 : i1 2026-02-21T11:00:45.3031250Z %45 = arith.cmpi sge, %27, %cst_3 : tensor<1x2x1xi64, #blocked1> 2026-02-21T11:00:45.3031423Z %46 = arith.cmpi slt, %27, %cst_2 : tensor<1x2x1xi64, #blocked1> 2026-02-21T11:00:45.3031585Z %47 = arith.andi %45, %46 : tensor<1x2x1xi1, #blocked1> 2026-02-21T11:00:45.3031733Z %48 = tt.splat %44 : i1 -> tensor<1x2x1xi1, #blocked1> 2026-02-21T11:00:45.3031900Z %49 = arith.andi %48, %47 : tensor<1x2x1xi1, #blocked1> 2026-02-21T11:00:45.3032082Z %50 = tt.broadcast %49 : tensor<1x2x1xi1, #blocked1> -> tensor<1x2x128xi1, #blocked1> 2026-02-21T11:00:45.3032319Z %51 = ttg.convert_layout %50 : tensor<1x2x128xi1, #blocked1> -> tensor<1x2x128xi1, #blocked> 2026-02-21T11:00:45.3032526Z %52 = arith.cmpi sge, %37, %cst_1 : tensor<1x1x128xi64, #blocked> 2026-02-21T11:00:45.3032696Z %53 = arith.cmpi slt, %37, %cst_0 : tensor<1x1x128xi64, #blocked> 2026-02-21T11:00:45.3032857Z %54 = arith.andi %52, %53 : tensor<1x1x128xi1, #blocked> 2026-02-21T11:00:45.3033039Z %55 = tt.broadcast %54 : tensor<1x1x128xi1, #blocked> -> tensor<1x2x128xi1, #blocked> 2026-02-21T11:00:45.3033226Z %56 = arith.andi %51, %55 : tensor<1x2x128xi1, #blocked> 2026-02-21T11:00:45.3033386Z %57 = tt.load %41, %56, %cst : tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T11:00:45.3033579Z %58 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #blocked5> 2026-02-21T11:00:45.3033745Z %59 = arith.muli %8, %c262144_i32 : i32 2026-02-21T11:00:45.3033963Z %60 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T11:00:45.3034309Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6> 2026-02-21T11:00:45.3034595Z %62 = ttg.convert_layout %61 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4> 2026-02-21T11:00:45.3034877Z %63 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T11:00:45.3035229Z %64 = tt.expand_dims %63 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x128x1xi32, #blocked9> 2026-02-21T11:00:45.3035525Z %65 = ttg.convert_layout %64 : tensor<1x128x1xi32, #blocked9> -> tensor<1x128x1xi32, #blocked2> 2026-02-21T11:00:45.3035733Z %66 = tt.splat %59 : i32 -> tensor<1x128x1xi32, #blocked2> 2026-02-21T11:00:45.3035888Z %67 = arith.addi %66, %65 : tensor<1x128x1xi32, #blocked2> 2026-02-21T11:00:45.3036087Z %68 = tt.broadcast %67 : tensor<1x128x1xi32, #blocked2> -> tensor<1x128x512xi32, #blocked2> 2026-02-21T11:00:45.3036352Z %69 = ttg.convert_layout %68 : tensor<1x128x512xi32, #blocked2> -> tensor<1x128x512xi32, #blocked> 2026-02-21T11:00:45.3036588Z %70 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x512x!tt.ptr, #blocked> 2026-02-21T11:00:45.3036806Z %71 = tt.reshape %57 : tensor<1x2x128xbf16, #blocked> -> tensor<2x128xbf16, #blocked4> 2026-02-21T11:00:45.3036999Z %72 = tt.splat %59 : i32 -> tensor<1x512x1xi32, #blocked2> 2026-02-21T11:00:45.3037239Z %73 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T11:00:45.3037571Z %74 = tt.expand_dims %73 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8> 2026-02-21T11:00:45.3037871Z %75 = ttg.convert_layout %74 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked> 2026-02-21T11:00:45.3038115Z %76 = tt.broadcast %75 : tensor<1x1x128xi32, #blocked> -> tensor<1x512x128xi32, #blocked> 2026-02-21T11:00:45.3038333Z %77 = tt.splat %arg2 : !tt.ptr -> tensor<1x512x128x!tt.ptr, #blocked> 2026-02-21T11:00:45.3038720Z %78:3 = scf.for %arg4 = %c0_i32 to %c2048_i32 step %c512_i32 iter_args(%arg5 = %cst_13, %arg6 = %cst_12, %arg7 = %cst_11) -> (tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked>) : i32 { 2026-02-21T11:00:45.3039084Z %108 = tt.splat %arg4 : i32 -> tensor<512xi32, #blocked5> 2026-02-21T11:00:45.3039242Z %109 = arith.addi %108, %58 : tensor<512xi32, #blocked5> 2026-02-21T11:00:45.3039482Z %110 = ttg.convert_layout %109 : tensor<512xi32, #blocked5> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T11:00:45.3039850Z %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6> 2026-02-21T11:00:45.3040148Z %112 = ttg.convert_layout %111 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked4> 2026-02-21T11:00:45.3040441Z %113 = ttg.convert_layout %112 : tensor<1x512xi32, #blocked4> -> tensor<1x512xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T11:00:45.3040782Z %114 = tt.expand_dims %113 {axis = 1 : i32} : tensor<1x512xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x512xi32, #blocked8> 2026-02-21T11:00:45.3041088Z %115 = ttg.convert_layout %114 : tensor<1x1x512xi32, #blocked8> -> tensor<1x1x512xi32, #blocked> 2026-02-21T11:00:45.3041300Z %116 = arith.muli %115, %cst_10 : tensor<1x1x512xi32, #blocked> 2026-02-21T11:00:45.3041506Z %117 = tt.broadcast %116 : tensor<1x1x512xi32, #blocked> -> tensor<1x128x512xi32, #blocked> 2026-02-21T11:00:45.3041717Z %118 = arith.addi %69, %117 : tensor<1x128x512xi32, #blocked> 2026-02-21T11:00:45.3041928Z %119 = tt.addptr %70, %118 : tensor<1x128x512x!tt.ptr, #blocked>, tensor<1x128x512xi32, #blocked> 2026-02-21T11:00:45.3042167Z %120 = tt.load %119 : tensor<1x128x512x!tt.ptr, #blocked> 2026-02-21T11:00:45.3042372Z %121 = tt.reshape %120 : tensor<1x128x512xbf16, #blocked> -> tensor<128x512xbf16, #blocked4> 2026-02-21T11:00:45.3042729Z %122 = ttg.convert_layout %71 : tensor<2x128xbf16, #blocked4> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> 2026-02-21T11:00:45.3043089Z %123 = ttg.convert_layout %121 : tensor<128x512xbf16, #blocked4> -> tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> 2026-02-21T11:00:45.3043414Z %124 = ttg.convert_layout %cst_9 : tensor<2x512xf32, #blocked4> -> tensor<2x512xf32, #blocked10> 2026-02-21T11:00:45.3043837Z %125 = tt.dot %122, %123, %124, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<2x512xf32, #blocked10> 2026-02-21T11:00:45.3044247Z %126 = ttg.convert_layout %125 : tensor<2x512xf32, #blocked10> -> tensor<2x512xf32, #blocked4> 2026-02-21T11:00:45.3044508Z %127 = tt.reshape %126 : tensor<2x512xf32, #blocked4> -> tensor<1x2x512xf32, #blocked> 2026-02-21T11:00:45.3044769Z %128 = arith.truncf %127 : tensor<1x2x512xf32, #blocked> to tensor<1x2x512xbf16, #blocked> 2026-02-21T11:00:45.3045005Z %129 = arith.extf %128 : tensor<1x2x512xbf16, #blocked> to tensor<1x2x512xf32, #blocked> 2026-02-21T11:00:45.3045194Z %130 = "tt.reduce"(%129) <{axis = 2 : i32}> ({ 2026-02-21T11:00:45.3045324Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T11:00:45.3045446Z %183 = arith.maxnumf %arg8, %arg9 : f32 2026-02-21T11:00:45.3045570Z tt.reduce.return %183 : f32 2026-02-21T11:00:45.3045754Z }) : (tensor<1x2x512xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T11:00:45.3046042Z %131 = ttg.convert_layout %130 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3046314Z %132 = arith.truncf %131 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3> 2026-02-21T11:00:45.3046532Z %133 = arith.extf %132 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3046724Z %134 = arith.mulf %133, %cst_8 : tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3046911Z %135 = arith.truncf %134 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3> 2026-02-21T11:00:45.3047127Z %136 = arith.extf %135 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3047318Z %137 = arith.cmpf ogt, %arg5, %136 : tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3047490Z %138 = arith.cmpf une, %arg5, %arg5 : tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3047675Z %139 = arith.ori %137, %138 : tensor<1x2xi1, #blocked3> 2026-02-21T11:00:45.3047867Z %140 = arith.select %139, %arg5, %136 : tensor<1x2xi1, #blocked3>, tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3048074Z %141 = arith.mulf %129, %cst_7 : tensor<1x2x512xf32, #blocked> 2026-02-21T11:00:45.3048276Z %142 = arith.truncf %141 : tensor<1x2x512xf32, #blocked> to tensor<1x2x512xbf16, #blocked> 2026-02-21T11:00:45.3048559Z %143 = ttg.convert_layout %140 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T11:00:45.3048893Z %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7> 2026-02-21T11:00:45.3049187Z %145 = ttg.convert_layout %144 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1> 2026-02-21T11:00:45.3049428Z %146 = arith.extf %142 : tensor<1x2x512xbf16, #blocked> to tensor<1x2x512xf32, #blocked> 2026-02-21T11:00:45.3049678Z %147 = tt.broadcast %145 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x512xf32, #blocked1> 2026-02-21T11:00:45.3049920Z %148 = ttg.convert_layout %147 : tensor<1x2x512xf32, #blocked1> -> tensor<1x2x512xf32, #blocked> 2026-02-21T11:00:45.3050151Z %149 = arith.subf %146, %148 : tensor<1x2x512xf32, #blocked> 2026-02-21T11:00:45.3050450Z %150 = tt.extern_elementwise %149 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x512xf32, #blocked>) -> tensor<1x2x512xf32, #blocked> 2026-02-21T11:00:45.3050736Z %151 = "tt.reduce"(%150) <{axis = 2 : i32}> ({ 2026-02-21T11:00:45.3050861Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T11:00:45.3050993Z %183 = arith.addf %arg8, %arg9 : f32 2026-02-21T11:00:45.3051113Z tt.reduce.return %183 : f32 2026-02-21T11:00:45.3051292Z }) : (tensor<1x2x512xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T11:00:45.3051581Z %152 = ttg.convert_layout %151 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3051824Z %153 = arith.subf %arg5, %140 : tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3052109Z %154 = tt.extern_elementwise %153 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked3>) -> tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3052409Z %155 = arith.mulf %arg6, %154 : tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3052565Z %156 = arith.addf %155, %152 : tensor<1x2xf32, #blocked3> 2026-02-21T11:00:45.3052802Z %157 = ttg.convert_layout %154 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T11:00:45.3053136Z %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7> 2026-02-21T11:00:45.3053428Z %159 = ttg.convert_layout %158 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1> 2026-02-21T11:00:45.3053671Z %160 = tt.broadcast %159 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1> 2026-02-21T11:00:45.3053914Z %161 = ttg.convert_layout %160 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked> 2026-02-21T11:00:45.3054135Z %162 = arith.mulf %arg7, %161 : tensor<1x2x128xf32, #blocked> 2026-02-21T11:00:45.3054394Z %163 = ttg.convert_layout %112 : tensor<1x512xi32, #blocked4> -> tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T11:00:45.3054739Z %164 = tt.expand_dims %163 {axis = 2 : i32} : tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x512x1xi32, #blocked9> 2026-02-21T11:00:45.3055052Z %165 = ttg.convert_layout %164 : tensor<1x512x1xi32, #blocked9> -> tensor<1x512x1xi32, #blocked2> 2026-02-21T11:00:45.3055268Z %166 = arith.muli %165, %cst_6 : tensor<1x512x1xi32, #blocked2> 2026-02-21T11:00:45.3055443Z %167 = arith.addi %72, %166 : tensor<1x512x1xi32, #blocked2> 2026-02-21T11:00:45.3055674Z %168 = tt.broadcast %167 : tensor<1x512x1xi32, #blocked2> -> tensor<1x512x128xi32, #blocked2> 2026-02-21T11:00:45.3055935Z %169 = ttg.convert_layout %168 : tensor<1x512x128xi32, #blocked2> -> tensor<1x512x128xi32, #blocked> 2026-02-21T11:00:45.3056160Z %170 = arith.addi %169, %76 : tensor<1x512x128xi32, #blocked> 2026-02-21T11:00:45.3056377Z %171 = tt.addptr %77, %170 : tensor<1x512x128x!tt.ptr, #blocked>, tensor<1x512x128xi32, #blocked> 2026-02-21T11:00:45.3056601Z %172 = tt.load %171 : tensor<1x512x128x!tt.ptr, #blocked> 2026-02-21T11:00:45.3056810Z %173 = arith.truncf %150 : tensor<1x2x512xf32, #blocked> to tensor<1x2x512xbf16, #blocked> 2026-02-21T11:00:45.3057048Z %174 = tt.reshape %162 : tensor<1x2x128xf32, #blocked> -> tensor<2x128xf32, #blocked4> 2026-02-21T11:00:45.3057286Z %175 = tt.reshape %173 : tensor<1x2x512xbf16, #blocked> -> tensor<2x512xbf16, #blocked4> 2026-02-21T11:00:45.3057530Z %176 = tt.reshape %172 : tensor<1x512x128xbf16, #blocked> -> tensor<512x128xbf16, #blocked4> 2026-02-21T11:00:45.3057840Z %177 = ttg.convert_layout %175 : tensor<2x512xbf16, #blocked4> -> tensor<2x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked11}>> 2026-02-21T11:00:45.3058224Z %178 = ttg.convert_layout %176 : tensor<512x128xbf16, #blocked4> -> tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked11}>> 2026-02-21T11:00:45.3058532Z %179 = ttg.convert_layout %174 : tensor<2x128xf32, #blocked4> -> tensor<2x128xf32, #blocked11> 2026-02-21T11:00:45.3058950Z %180 = tt.dot %177, %178, %179, inputPrecision = tf32 : tensor<2x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked11}>> * tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked11}>> -> tensor<2x128xf32, #blocked11> 2026-02-21T11:00:45.3059377Z %181 = ttg.convert_layout %180 : tensor<2x128xf32, #blocked11> -> tensor<2x128xf32, #blocked4> 2026-02-21T11:00:45.3059624Z %182 = tt.reshape %181 : tensor<2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked> 2026-02-21T11:00:45.3059899Z scf.yield %140, %156, %182 : tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked> 2026-02-21T11:00:45.3060157Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32} 2026-02-21T11:00:45.3060441Z %79 = ttg.convert_layout %78#1 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T11:00:45.3060772Z %80 = tt.expand_dims %79 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7> 2026-02-21T11:00:45.3061069Z %81 = ttg.convert_layout %80 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1> 2026-02-21T11:00:45.3061311Z %82 = tt.broadcast %81 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1> 2026-02-21T11:00:45.3061550Z %83 = ttg.convert_layout %82 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked> 2026-02-21T11:00:45.3061763Z %84 = arith.divf %78#2, %83 : tensor<1x2x128xf32, #blocked> 2026-02-21T11:00:45.3061961Z %85 = arith.truncf %84 : tensor<1x2x128xf32, #blocked> to tensor<1x2x128xbf16, #blocked> 2026-02-21T11:00:45.3062148Z %86 = arith.muli %8, %c262144_i32 : i32 2026-02-21T11:00:45.3062368Z %87 = ttg.convert_layout %12 : tensor<2xi32, #blocked5> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T11:00:45.3062683Z %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi32, #blocked6> 2026-02-21T11:00:45.3062964Z %89 = ttg.convert_layout %88 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #blocked3> 2026-02-21T11:00:45.3063240Z %90 = ttg.convert_layout %89 : tensor<1x2xi32, #blocked3> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T11:00:45.3063569Z %91 = tt.expand_dims %90 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi32, #blocked7> 2026-02-21T11:00:45.3063879Z %92 = ttg.convert_layout %91 : tensor<1x2x1xi32, #blocked7> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T11:00:45.3064085Z %93 = arith.muli %92, %cst_5 : tensor<1x2x1xi32, #blocked1> 2026-02-21T11:00:45.3064250Z %94 = tt.splat %86 : i32 -> tensor<1x2x1xi32, #blocked1> 2026-02-21T11:00:45.3064404Z %95 = arith.addi %94, %93 : tensor<1x2x1xi32, #blocked1> 2026-02-21T11:00:45.3064642Z %96 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T11:00:45.3064969Z %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6> 2026-02-21T11:00:45.3065256Z %98 = ttg.convert_layout %97 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4> 2026-02-21T11:00:45.3065542Z %99 = ttg.convert_layout %98 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T11:00:45.3065880Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8> 2026-02-21T11:00:45.3066186Z %101 = ttg.convert_layout %100 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked> 2026-02-21T11:00:45.3066462Z %102 = tt.broadcast %95 : tensor<1x2x1xi32, #blocked1> -> tensor<1x2x128xi32, #blocked1> 2026-02-21T11:00:45.3066705Z %103 = ttg.convert_layout %102 : tensor<1x2x128xi32, #blocked1> -> tensor<1x2x128xi32, #blocked> 2026-02-21T11:00:45.3066954Z %104 = tt.broadcast %101 : tensor<1x1x128xi32, #blocked> -> tensor<1x2x128xi32, #blocked> 2026-02-21T11:00:45.3067157Z %105 = arith.addi %103, %104 : tensor<1x2x128xi32, #blocked> 2026-02-21T11:00:45.3067366Z %106 = tt.splat %arg3 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T11:00:45.3067607Z %107 = tt.addptr %106, %105 : tensor<1x2x128x!tt.ptr, #blocked>, tensor<1x2x128xi32, #blocked> 2026-02-21T11:00:45.3067820Z tt.store %107, %85 : tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T11:00:45.3067964Z tt.return 2026-02-21T11:00:45.3068048Z } 2026-02-21T11:00:45.3068132Z } 2026-02-21T11:00:45.3068179Z 2026-02-21T11:00:45.3068214Z {-# 2026-02-21T11:00:45.3068302Z external_resources: { 2026-02-21T11:00:45.3068409Z mlir_reproducer: { 2026-02-21T11:00:45.3070654Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T11:00:45.3072948Z disable_threading: false, 2026-02-21T11:00:45.3073060Z verify_each: true 2026-02-21T11:00:45.3073159Z } 2026-02-21T11:00:45.3073234Z } 2026-02-21T11:00:45.3073313Z #-} 2026-02-21T11:00:45.3073619Z /tmp/torchinductor_root/5g/c5giwctfjccaue6bd75t5sbqdhwhf2ay7njm5q53dk3emny6uyli.py:16:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T11:00:45.3074322Z /tmp/torchinductor_root/5g/c5giwctfjccaue6bd75t5sbqdhwhf2ay7njm5q53dk3emny6uyli.py:16:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T11:00:45.3074878Z [81s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T11:00:45.3075617Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 512], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T11:00:45.3076280Z Error: RuntimeError: PassManager::run failed 2026-02-21T11:00:45.3076457Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T11:01:22.7217953Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 2.5 configs/s 2026-02-21T11:01:22.7227683Z [119s] Adaptive compile timeout: 30s (90% percentile=30.0s, bounds=[30.0s, 30s]) 2026-02-21T11:01:22.7508858Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 - configs/s 2026-02-21T11:01:23.7567826Z [120s] Initial random population of 100, 5 starting points: 2026-02-21T11:01:23.7568291Z error=20 2026-02-21T11:01:23.7568513Z timeout=19 2026-02-21T11:01:23.7568725Z ok=61 2026-02-21T11:01:23.7569356Z min=2.2194 2026-02-21T11:01:23.7569563Z mid=14.6173 2026-02-21T11:01:23.7569765Z max=3143.3940 2026-02-21T11:01:23.7570019Z best={'block_sizes': [1, 256, 16], 2026-02-21T11:01:23.7570430Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T11:01:23.7570826Z 'l2_groupings': [8], 2026-02-21T11:01:23.7571101Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:01:23.7571438Z 'loop_orders': [[0, 1]], 2026-02-21T11:01:23.7571672Z 'matrix_instr_nonkdim': 16, 2026-02-21T11:01:23.7571936Z 'num_sm_multiplier': 16, 2026-02-21T11:01:23.7572163Z 'num_stages': 4, 2026-02-21T11:01:23.7572355Z 'num_warps': 16, 2026-02-21T11:01:23.7572572Z 'pid_type': 'persistent_blocked', 2026-02-21T11:01:23.7572951Z 'range_flattens': [False, True], 2026-02-21T11:01:23.7573216Z 'range_multi_buffers': [None, False], 2026-02-21T11:01:23.7573486Z 'range_num_stages': [1, 3], 2026-02-21T11:01:23.7573721Z 'range_unroll_factors': [2, 3], 2026-02-21T11:01:23.7573974Z 'range_warp_specializes': [], 2026-02-21T11:01:23.7574216Z 'waves_per_eu': 1} 2026-02-21T11:01:23.7584301Z [120s] Fitting surrogate: 100 points, 100 targets 2026-02-21T11:01:24.5109528Z [120s] Generation 1 starting: 77 neighbors, 5 active search path(s) 2026-02-21T11:02:01.9538464Z [158s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:02:01.9559233Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 0.6 configs/s 2026-02-21T11:02:09.4118078Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 10.9 configs/s 2026-02-21T11:02:09.5643567Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━ 115/115 2103.5 2026-02-21T11:02:09.5643929Z configs/s 2026-02-21T11:02:15.2693629Z [171s] Generation 1 complete: 2026-02-21T11:02:15.2694065Z error=2 2026-02-21T11:02:15.2694287Z timeout=1 2026-02-21T11:02:15.2694496Z ok=80 2026-02-21T11:02:15.2694705Z min=1.7978 2026-02-21T11:02:15.2694909Z mid=3.4986 2026-02-21T11:02:15.2695147Z max=20.8857 2026-02-21T11:02:15.2695427Z best={'block_sizes': [1, 64, 16], 2026-02-21T11:02:15.2695839Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T11:02:15.2696352Z 'l2_groupings': [4], 2026-02-21T11:02:15.2696632Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:02:15.2696953Z 'loop_orders': [[0, 1]], 2026-02-21T11:02:15.2697241Z 'matrix_instr_nonkdim': 16, 2026-02-21T11:02:15.2697511Z 'num_stages': 2, 2026-02-21T11:02:15.2697739Z 'num_warps': 4, 2026-02-21T11:02:15.2697974Z 'pid_type': 'flat', 2026-02-21T11:02:15.2698239Z 'range_flattens': [None, False], 2026-02-21T11:02:15.2698557Z 'range_multi_buffers': [None, True], 2026-02-21T11:02:15.2698865Z 'range_num_stages': [0, 1], 2026-02-21T11:02:15.2699144Z 'range_unroll_factors': [0, 4], 2026-02-21T11:02:15.2699442Z 'range_warp_specializes': [], 2026-02-21T11:02:15.2699717Z 'waves_per_eu': 4} 2026-02-21T11:02:15.2723165Z [171s] Fitting surrogate: 183 points, 183 targets 2026-02-21T11:02:16.0794248Z [172s] Generation 2 starting: 79 neighbors, 5 active search path(s) 2026-02-21T11:02:48.3935689Z [204s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[1, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:02:48.3948830Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 0.7 configs/s 2026-02-21T11:02:55.4598086Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 11.6 configs/s 2026-02-21T11:02:56.1408772Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 129/129 69.1 configs/s 2026-02-21T11:03:02.9796517Z [219s] Generation 2 complete: 2026-02-21T11:03:02.9796874Z error=3 2026-02-21T11:03:02.9797030Z timeout=1 2026-02-21T11:03:02.9797215Z ok=80 2026-02-21T11:03:02.9797366Z min=1.6125 2026-02-21T11:03:02.9797507Z mid=3.4795 2026-02-21T11:03:02.9797657Z max=19.0848 2026-02-21T11:03:02.9797851Z best={'block_sizes': [1, 64, 16], 2026-02-21T11:03:02.9798163Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T11:03:02.9798461Z 'l2_groupings': [4], 2026-02-21T11:03:02.9799023Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:03:02.9799262Z 'loop_orders': [[0, 1]], 2026-02-21T11:03:02.9799461Z 'matrix_instr_nonkdim': 16, 2026-02-21T11:03:02.9799656Z 'num_stages': 1, 2026-02-21T11:03:02.9799832Z 'num_warps': 4, 2026-02-21T11:03:02.9800057Z 'pid_type': 'flat', 2026-02-21T11:03:02.9800241Z 'range_flattens': [None, False], 2026-02-21T11:03:02.9800459Z 'range_multi_buffers': [None, None], 2026-02-21T11:03:02.9800674Z 'range_num_stages': [0, 1], 2026-02-21T11:03:02.9800871Z 'range_unroll_factors': [0, 4], 2026-02-21T11:03:02.9801089Z 'range_warp_specializes': [], 2026-02-21T11:03:02.9801285Z 'waves_per_eu': 4} 2026-02-21T11:03:02.9845051Z [219s] Fitting surrogate: 267 points, 267 targets 2026-02-21T11:03:03.8130008Z [220s] Generation 3 starting: 79 neighbors, 5 active search path(s) 2026-02-21T11:03:21.8827993Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 1.5 configs/s 2026-02-21T11:03:27.8371828Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 13.6 configs/s 2026-02-21T11:03:33.6406421Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 129/129 13.4 configs/s 2026-02-21T11:03:39.8989175Z [256s] Generation 3 complete: 2026-02-21T11:03:39.8991481Z error=6 2026-02-21T11:03:39.8991820Z ok=78 2026-02-21T11:03:39.8992037Z min=1.5153 2026-02-21T11:03:39.8992252Z mid=2.1004 2026-02-21T11:03:39.8992449Z max=22.4364 2026-02-21T11:03:39.8992688Z best={'block_sizes': [1, 64, 64], 2026-02-21T11:03:39.8993126Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T11:03:39.8993536Z 'l2_groupings': [4], 2026-02-21T11:03:39.8993820Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:03:39.8994150Z 'loop_orders': [[0, 1]], 2026-02-21T11:03:39.8994427Z 'matrix_instr_nonkdim': 16, 2026-02-21T11:03:39.8994715Z 'num_stages': 1, 2026-02-21T11:03:39.8994946Z 'num_warps': 4, 2026-02-21T11:03:39.8995176Z 'pid_type': 'flat', 2026-02-21T11:03:39.8995404Z 'range_flattens': [None, False], 2026-02-21T11:03:39.8995524Z 'range_multi_buffers': [None, None], 2026-02-21T11:03:39.8995640Z 'range_num_stages': [0, 1], 2026-02-21T11:03:39.8995744Z 'range_unroll_factors': [0, 4], 2026-02-21T11:03:39.8995859Z 'range_warp_specializes': [], 2026-02-21T11:03:39.8995963Z 'waves_per_eu': 4} 2026-02-21T11:03:39.9066071Z [256s] Fitting surrogate: 351 points, 351 targets 2026-02-21T11:03:40.7297183Z [257s] Generation 4 starting: 73 neighbors, 5 active search path(s) 2026-02-21T11:04:13.0906899Z [289s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:04:13.0921397Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.9 configs/s 2026-02-21T11:04:19.4633426Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 11.5 configs/s 2026-02-21T11:04:23.5479363Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 131/131 19.0 configs/s 2026-02-21T11:04:31.0596406Z [307s] Generation 4 complete: 2026-02-21T11:04:31.0598917Z error=3 2026-02-21T11:04:31.0599283Z timeout=1 2026-02-21T11:04:31.0599380Z ok=74 2026-02-21T11:04:31.0599460Z min=1.3760 2026-02-21T11:04:31.0599543Z mid=2.0470 2026-02-21T11:04:31.0599623Z max=18.5830 2026-02-21T11:04:31.0599721Z best={'block_sizes': [1, 64, 64], 2026-02-21T11:04:31.0599913Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T11:04:31.0600084Z 'l2_groupings': [4], 2026-02-21T11:04:31.0600187Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:04:31.0601569Z 'loop_orders': [[0, 1]], 2026-02-21T11:04:31.0601677Z 'matrix_instr_nonkdim': 16, 2026-02-21T11:04:31.0601784Z 'num_stages': 1, 2026-02-21T11:04:31.0601875Z 'num_warps': 4, 2026-02-21T11:04:31.0602241Z 'pid_type': 'flat', 2026-02-21T11:04:31.0602348Z 'range_flattens': [None, False], 2026-02-21T11:04:31.0602460Z 'range_multi_buffers': [None, None], 2026-02-21T11:04:31.0602631Z 'range_num_stages': [0, 1], 2026-02-21T11:04:31.0602742Z 'range_unroll_factors': [0, 4], 2026-02-21T11:04:31.0602852Z 'range_warp_specializes': [], 2026-02-21T11:04:31.0602957Z 'waves_per_eu': 4} 2026-02-21T11:04:31.0682816Z [307s] Fitting surrogate: 429 points, 429 targets 2026-02-21T11:04:31.7913124Z [308s] Generation 5 starting: 57 neighbors, 4 active search path(s) 2026-02-21T11:05:03.5082754Z [339s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[2, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:05:04.4310195Z [340s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:05:04.4328957Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 0.7 configs/s 2026-02-21T11:05:08.4304826Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 14.4 configs/s 2026-02-21T11:05:10.1507248Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 145/145 40.4 configs/s 2026-02-21T11:05:16.8629954Z [353s] Generation 5 complete: 2026-02-21T11:05:16.8630379Z error=3 2026-02-21T11:05:16.8632923Z timeout=2 2026-02-21T11:05:16.8633231Z ok=56 2026-02-21T11:05:16.8633433Z min=1.4232 2026-02-21T11:05:16.8633639Z mid=1.8395 2026-02-21T11:05:16.8633841Z max=22.4022 2026-02-21T11:05:16.8634077Z best={'block_sizes': [1, 64, 64], 2026-02-21T11:05:16.8634451Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T11:05:16.8635269Z 'l2_groupings': [4], 2026-02-21T11:05:16.8635486Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:05:16.8635733Z 'loop_orders': [[0, 1]], 2026-02-21T11:05:16.8635948Z 'matrix_instr_nonkdim': 16, 2026-02-21T11:05:16.8636161Z 'num_stages': 1, 2026-02-21T11:05:16.8636339Z 'num_warps': 4, 2026-02-21T11:05:16.8636557Z 'pid_type': 'flat', 2026-02-21T11:05:16.8636760Z 'range_flattens': [None, False], 2026-02-21T11:05:16.8637004Z 'range_multi_buffers': [None, None], 2026-02-21T11:05:16.8637394Z 'range_num_stages': [0, 1], 2026-02-21T11:05:16.8637608Z 'range_unroll_factors': [0, 4], 2026-02-21T11:05:16.8637837Z 'range_warp_specializes': [], 2026-02-21T11:05:16.8638058Z 'waves_per_eu': 4} 2026-02-21T11:05:16.8676229Z [353s] Fitting surrogate: 490 points, 490 targets 2026-02-21T11:05:17.4756648Z [353s] Generation 6 starting: 52 neighbors, 3 active search path(s) 2026-02-21T11:05:50.3436903Z [386s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:05:50.3460168Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 0.7 configs/s 2026-02-21T11:05:54.1070619Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 13.9 configs/s 2026-02-21T11:05:56.3837781Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 145/145 31.7 configs/s 2026-02-21T11:06:02.4587875Z [398s] Generation 6 complete: 2026-02-21T11:06:02.4588279Z error=3 2026-02-21T11:06:02.4588477Z timeout=1 2026-02-21T11:06:02.4588662Z ok=52 2026-02-21T11:06:02.4588840Z min=1.3725 2026-02-21T11:06:02.4589029Z mid=1.9183 2026-02-21T11:06:02.4589229Z max=18.5029 2026-02-21T11:06:02.4589447Z best={'block_sizes': [1, 64, 64], 2026-02-21T11:06:02.4589820Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T11:06:02.4590199Z 'l2_groupings': [4], 2026-02-21T11:06:02.4590454Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:06:02.4590740Z 'loop_orders': [[0, 1]], 2026-02-21T11:06:02.4591163Z 'matrix_instr_nonkdim': 16, 2026-02-21T11:06:02.4591405Z 'num_stages': 1, 2026-02-21T11:06:02.4591618Z 'num_warps': 4, 2026-02-21T11:06:02.4591853Z 'pid_type': 'flat', 2026-02-21T11:06:02.4592097Z 'range_flattens': [None, False], 2026-02-21T11:06:02.4592372Z 'range_multi_buffers': [None, None], 2026-02-21T11:06:02.4592654Z 'range_num_stages': [0, 1], 2026-02-21T11:06:02.4592921Z 'range_unroll_factors': [0, 4], 2026-02-21T11:06:02.4593191Z 'range_warp_specializes': [], 2026-02-21T11:06:02.4593449Z 'waves_per_eu': 4} 2026-02-21T11:06:02.4640379Z [398s] Fitting surrogate: 546 points, 546 targets 2026-02-21T11:06:03.6799416Z [400s] Generation 7 starting: 51 neighbors, 3 active search path(s) 2026-02-21T11:06:38.2377137Z [434s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[2, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:06:38.2396480Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 0.8 configs/s 2026-02-21T11:06:41.5318491Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 15.7 configs/s 2026-02-21T11:06:41.7265896Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━ 145/145 1016.5 2026-02-21T11:06:41.7269878Z configs/s 2026-02-21T11:06:47.8761539Z [444s] Generation 7 complete: 2026-02-21T11:06:47.8761810Z error=7 2026-02-21T11:06:47.8761894Z timeout=1 2026-02-21T11:06:47.8761978Z ok=47 2026-02-21T11:06:47.8762053Z min=1.3521 2026-02-21T11:06:47.8762135Z mid=1.9459 2026-02-21T11:06:47.8762209Z max=9.9115 2026-02-21T11:06:47.8762297Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:06:47.8762483Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:06:47.8762720Z 'l2_groupings': [4], 2026-02-21T11:06:47.8762826Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:06:47.8763269Z 'loop_orders': [[0, 1]], 2026-02-21T11:06:47.8763373Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:06:47.8763476Z 'num_stages': 3, 2026-02-21T11:06:47.8763572Z 'num_warps': 4, 2026-02-21T11:06:47.8766442Z 'pid_type': 'flat', 2026-02-21T11:06:47.8766663Z 'range_flattens': [None, None], 2026-02-21T11:06:47.8766829Z 'range_multi_buffers': [None, None], 2026-02-21T11:06:47.8766964Z 'range_num_stages': [0, 1], 2026-02-21T11:06:47.8767088Z 'range_unroll_factors': [0, 4], 2026-02-21T11:06:47.8767208Z 'range_warp_specializes': [], 2026-02-21T11:06:47.8767320Z 'waves_per_eu': 2} 2026-02-21T11:06:47.8823622Z [444s] Fitting surrogate: 601 points, 601 targets 2026-02-21T11:06:48.3618929Z [444s] Generation 8 starting: 37 neighbors, 2 active search path(s) 2026-02-21T11:07:20.3064509Z [476s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:07:24.1572503Z [480s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:07:24.1592501Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 0.9 configs/s 2026-02-21T11:07:26.5222693Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 15.9 configs/s 2026-02-21T11:07:27.8311182Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 147/147 38.7 configs/s 2026-02-21T11:07:33.0606900Z [489s] Generation 8 complete: 2026-02-21T11:07:33.0607337Z error=3 2026-02-21T11:07:33.0607699Z timeout=2 2026-02-21T11:07:33.0608048Z ok=35 2026-02-21T11:07:33.0608390Z min=1.4081 2026-02-21T11:07:33.0608602Z mid=1.6308 2026-02-21T11:07:33.0608802Z max=6.2155 2026-02-21T11:07:33.0609035Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:07:33.0609467Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:07:33.0609904Z 'l2_groupings': [4], 2026-02-21T11:07:33.0610191Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:07:33.0610524Z 'loop_orders': [[0, 1]], 2026-02-21T11:07:33.0610780Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:07:33.0611024Z 'num_stages': 3, 2026-02-21T11:07:33.0611238Z 'num_warps': 4, 2026-02-21T11:07:33.0611459Z 'pid_type': 'flat', 2026-02-21T11:07:33.0611704Z 'range_flattens': [None, None], 2026-02-21T11:07:33.0611973Z 'range_multi_buffers': [None, None], 2026-02-21T11:07:33.0612893Z 'range_num_stages': [0, 1], 2026-02-21T11:07:33.0613149Z 'range_unroll_factors': [0, 4], 2026-02-21T11:07:33.0613412Z 'range_warp_specializes': [], 2026-02-21T11:07:33.0613699Z 'waves_per_eu': 2} 2026-02-21T11:07:33.0658400Z [489s] Fitting surrogate: 641 points, 641 targets 2026-02-21T11:07:33.5645699Z [489s] Generation 9 starting: 39 neighbors, 2 active search path(s) 2026-02-21T11:08:06.7941657Z [523s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:08:09.3092780Z [525s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=2, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:08:09.8105734Z [526s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:08:09.8128102Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 1.0 configs/s 2026-02-21T11:08:12.8933018Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 12.8 configs/s 2026-02-21T11:08:13.0865179Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━ 147/147 1117.7 2026-02-21T11:08:13.0869252Z configs/s 2026-02-21T11:08:18.9014017Z [535s] Generation 9 complete: 2026-02-21T11:08:18.9014303Z error=4 2026-02-21T11:08:18.9014407Z timeout=3 2026-02-21T11:08:18.9014507Z ok=35 2026-02-21T11:08:18.9014646Z min=1.3740 2026-02-21T11:08:18.9014747Z mid=1.6504 2026-02-21T11:08:18.9014835Z max=6.6216 2026-02-21T11:08:18.9014956Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:08:18.9015213Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:08:18.9015405Z 'l2_groupings': [4], 2026-02-21T11:08:18.9015537Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:08:18.9015695Z 'loop_orders': [[0, 1]], 2026-02-21T11:08:18.9015836Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:08:18.9015976Z 'num_stages': 3, 2026-02-21T11:08:18.9016087Z 'num_warps': 4, 2026-02-21T11:08:18.9016219Z 'pid_type': 'flat', 2026-02-21T11:08:18.9016345Z 'range_flattens': [None, None], 2026-02-21T11:08:18.9016498Z 'range_multi_buffers': [None, None], 2026-02-21T11:08:18.9016646Z 'range_num_stages': [0, 1], 2026-02-21T11:08:18.9016768Z 'range_unroll_factors': [0, 4], 2026-02-21T11:08:18.9016914Z 'range_warp_specializes': [], 2026-02-21T11:08:18.9017041Z 'waves_per_eu': 2} 2026-02-21T11:08:18.9060030Z [535s] Fitting surrogate: 683 points, 683 targets 2026-02-21T11:08:19.3837491Z [535s] Generation 10 starting: 38 neighbors, 2 active search path(s) 2026-02-21T11:08:52.2365533Z [568s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:08:52.8284371Z [569s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:08:55.2354206Z [571s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:08:55.2368411Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 1.1 configs/s 2026-02-21T11:08:57.7107490Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 38/38 15.6 configs/s 2026-02-21T11:08:59.1580440Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 147/147 39.3 configs/s 2026-02-21T11:09:04.4554510Z [580s] Generation 10 complete: 2026-02-21T11:09:04.4554995Z error=4 2026-02-21T11:09:04.4555214Z timeout=3 2026-02-21T11:09:04.4555423Z ok=34 2026-02-21T11:09:04.4555631Z min=1.4854 2026-02-21T11:09:04.4555841Z mid=1.6330 2026-02-21T11:09:04.4556039Z max=26.4248 2026-02-21T11:09:04.4556279Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:09:04.4556702Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:09:04.4557720Z 'l2_groupings': [4], 2026-02-21T11:09:04.4558001Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:09:04.4558345Z 'loop_orders': [[0, 1]], 2026-02-21T11:09:04.4558629Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:09:04.4558904Z 'num_stages': 3, 2026-02-21T11:09:04.4559135Z 'num_warps': 4, 2026-02-21T11:09:04.4559375Z 'pid_type': 'flat', 2026-02-21T11:09:04.4559656Z 'range_flattens': [None, None], 2026-02-21T11:09:04.4559959Z 'range_multi_buffers': [None, None], 2026-02-21T11:09:04.4560275Z 'range_num_stages': [0, 1], 2026-02-21T11:09:04.4560558Z 'range_unroll_factors': [0, 4], 2026-02-21T11:09:04.4560858Z 'range_warp_specializes': [], 2026-02-21T11:09:04.4561135Z 'waves_per_eu': 2} 2026-02-21T11:09:04.4627339Z [580s] Fitting surrogate: 724 points, 724 targets 2026-02-21T11:09:04.9131000Z [581s] Generation 11 starting: 41 neighbors, 2 active search path(s) 2026-02-21T11:09:36.4360246Z [612s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:09:38.1870911Z [614s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:09:40.9085631Z [617s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:09:41.6874195Z [618s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:09:41.6889230Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 1.1 configs/s 2026-02-21T11:09:44.3449669Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 41/41 15.7 configs/s 2026-02-21T11:09:44.4798149Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━ 147/147 1633.3 2026-02-21T11:09:44.4802226Z configs/s 2026-02-21T11:09:48.8052195Z [625s] Generation 11 complete: 2026-02-21T11:09:48.8052572Z error=5 2026-02-21T11:09:48.8053215Z timeout=4 2026-02-21T11:09:48.8053428Z ok=35 2026-02-21T11:09:48.8053632Z min=1.3459 2026-02-21T11:09:48.8053846Z mid=2.0912 2026-02-21T11:09:48.8054050Z max=8.5581 2026-02-21T11:09:48.8054285Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:09:48.8054729Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:09:48.8055128Z 'l2_groupings': [4], 2026-02-21T11:09:48.8055450Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:09:48.8055751Z 'loop_orders': [[0, 1]], 2026-02-21T11:09:48.8056030Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:09:48.8056289Z 'num_stages': 3, 2026-02-21T11:09:48.8056516Z 'num_warps': 4, 2026-02-21T11:09:48.8056917Z 'pid_type': 'flat', 2026-02-21T11:09:48.8057174Z 'range_flattens': [None, None], 2026-02-21T11:09:48.8057485Z 'range_multi_buffers': [None, None], 2026-02-21T11:09:48.8057781Z 'range_num_stages': [0, 1], 2026-02-21T11:09:48.8058052Z 'range_unroll_factors': [0, 4], 2026-02-21T11:09:48.8058337Z 'range_warp_specializes': [], 2026-02-21T11:09:48.8058608Z 'waves_per_eu': 2} 2026-02-21T11:09:48.8091620Z [625s] Fitting surrogate: 768 points, 768 targets 2026-02-21T11:09:49.2550956Z [625s] Generation 12 starting: 37 neighbors, 2 active search path(s) 2026-02-21T11:10:21.7341762Z [658s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:10:22.3214801Z [658s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:10:24.1114793Z [660s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:10:24.1131417Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 1.1 configs/s 2026-02-21T11:10:26.6452782Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 14.8 configs/s 2026-02-21T11:10:26.7980533Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━ 148/148 1386.2 2026-02-21T11:10:26.7984005Z configs/s 2026-02-21T11:10:31.6333489Z [668s] Generation 12 complete: 2026-02-21T11:10:31.6333705Z error=2 2026-02-21T11:10:31.6333788Z timeout=3 2026-02-21T11:10:31.6333928Z ok=35 2026-02-21T11:10:31.6334004Z min=1.3744 2026-02-21T11:10:31.6334089Z mid=1.5638 2026-02-21T11:10:31.6334174Z max=13.1922 2026-02-21T11:10:31.6334272Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:10:31.6334437Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:10:31.6334593Z 'l2_groupings': [4], 2026-02-21T11:10:31.6334705Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:10:31.6334840Z 'loop_orders': [[0, 1]], 2026-02-21T11:10:31.6334951Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:10:31.6335063Z 'num_stages': 3, 2026-02-21T11:10:31.6335159Z 'num_warps': 4, 2026-02-21T11:10:31.6335249Z 'pid_type': 'flat', 2026-02-21T11:10:31.6335357Z 'range_flattens': [None, None], 2026-02-21T11:10:31.6335914Z 'range_multi_buffers': [None, None], 2026-02-21T11:10:31.6336030Z 'range_num_stages': [0, 1], 2026-02-21T11:10:31.6336141Z 'range_unroll_factors': [0, 4], 2026-02-21T11:10:31.6336257Z 'range_warp_specializes': [], 2026-02-21T11:10:31.6336367Z 'waves_per_eu': 2} 2026-02-21T11:10:31.6396255Z [668s] Fitting surrogate: 808 points, 808 targets 2026-02-21T11:10:32.1045775Z [668s] Generation 13 starting: 40 neighbors, 2 active search path(s) 2026-02-21T11:11:06.2862938Z [702s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:11:08.4065608Z [704s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[None, None], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:11:08.4088932Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 1.3 configs/s 2026-02-21T11:11:11.1514844Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 40/40 14.8 configs/s 2026-02-21T11:11:11.5335271Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━ 148/148 171.4 configs/s 2026-02-21T11:11:17.2540730Z [713s] Generation 13 complete: 2026-02-21T11:11:17.2540986Z error=3 2026-02-21T11:11:17.2541089Z timeout=2 2026-02-21T11:11:17.2541172Z ok=38 2026-02-21T11:11:17.2541253Z min=1.2978 2026-02-21T11:11:17.2541338Z mid=1.6165 2026-02-21T11:11:17.2541433Z max=11.5613 2026-02-21T11:11:17.2541532Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:11:17.2541692Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:11:17.2541850Z 'l2_groupings': [4], 2026-02-21T11:11:17.2541956Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:11:17.2542656Z 'loop_orders': [[0, 1]], 2026-02-21T11:11:17.2542796Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:11:17.2542907Z 'num_stages': 3, 2026-02-21T11:11:17.2543004Z 'num_warps': 4, 2026-02-21T11:11:17.2543093Z 'pid_type': 'flat', 2026-02-21T11:11:17.2543198Z 'range_flattens': [None, None], 2026-02-21T11:11:17.2543314Z 'range_multi_buffers': [None, None], 2026-02-21T11:11:17.2543444Z 'range_num_stages': [0, 1], 2026-02-21T11:11:17.2543550Z 'range_unroll_factors': [0, 4], 2026-02-21T11:11:17.2543764Z 'range_warp_specializes': [], 2026-02-21T11:11:17.2543874Z 'waves_per_eu': 2} 2026-02-21T11:11:17.2606023Z [713s] Fitting surrogate: 851 points, 851 targets 2026-02-21T11:11:17.7351644Z [714s] Generation 14 starting: 37 neighbors, 2 active search path(s) 2026-02-21T11:11:49.1114612Z [745s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:11:51.2721583Z [747s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[None, None], range_num_stages=[4, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:11:51.9902690Z [748s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:11:51.9922260Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 1.1 configs/s 2026-02-21T11:11:54.4943594Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 15.2 configs/s 2026-02-21T11:11:54.6420388Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━ 154/154 1466.7 2026-02-21T11:11:54.6423624Z configs/s 2026-02-21T11:11:59.4695959Z [755s] Generation 14 complete: 2026-02-21T11:11:59.4696367Z error=3 2026-02-21T11:11:59.4696574Z timeout=3 2026-02-21T11:11:59.4696780Z ok=34 2026-02-21T11:11:59.4696975Z min=1.3496 2026-02-21T11:11:59.4697185Z mid=1.5922 2026-02-21T11:11:59.4697388Z max=11.9982 2026-02-21T11:11:59.4697623Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:11:59.4698065Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:11:59.4698473Z 'l2_groupings': [4], 2026-02-21T11:11:59.4698762Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:11:59.4699082Z 'loop_orders': [[0, 1]], 2026-02-21T11:11:59.4699365Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:11:59.4699634Z 'num_stages': 3, 2026-02-21T11:11:59.4699881Z 'num_warps': 4, 2026-02-21T11:11:59.4700111Z 'pid_type': 'flat', 2026-02-21T11:11:59.4700375Z 'range_flattens': [None, None], 2026-02-21T11:11:59.4700686Z 'range_multi_buffers': [None, None], 2026-02-21T11:11:59.4700957Z 'range_num_stages': [0, 1], 2026-02-21T11:11:59.4701205Z 'range_unroll_factors': [0, 4], 2026-02-21T11:11:59.4701470Z 'range_warp_specializes': [], 2026-02-21T11:11:59.4701724Z 'waves_per_eu': 2} 2026-02-21T11:11:59.4744013Z [755s] Fitting surrogate: 891 points, 891 targets 2026-02-21T11:11:59.9976189Z [756s] Generation 15 starting: 41 neighbors, 2 active search path(s) 2026-02-21T11:12:32.6529140Z [789s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:12:32.8436577Z [789s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:12:33.0215778Z [789s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[2, 2], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:12:34.3421795Z [790s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:12:34.5693137Z [790s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:12:35.0233572Z [791s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, None], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:12:35.0254647Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 1.3 configs/s 2026-02-21T11:12:36.9964820Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 41/41 21.4 configs/s 2026-02-21T11:12:37.1443741Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━ 154/154 1499.0 2026-02-21T11:12:37.1447361Z configs/s 2026-02-21T11:12:41.9987658Z [798s] Generation 15 complete: 2026-02-21T11:12:41.9988097Z error=10 2026-02-21T11:12:41.9988307Z timeout=6 2026-02-21T11:12:41.9988512Z ok=28 2026-02-21T11:12:41.9988733Z min=1.3623 2026-02-21T11:12:41.9988932Z mid=1.5670 2026-02-21T11:12:41.9989134Z max=11.7756 2026-02-21T11:12:41.9989367Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:12:41.9989804Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:12:41.9990208Z 'l2_groupings': [4], 2026-02-21T11:12:41.9990487Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:12:41.9990801Z 'loop_orders': [[0, 1]], 2026-02-21T11:12:41.9991083Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:12:41.9992086Z 'num_stages': 3, 2026-02-21T11:12:41.9992318Z 'num_warps': 4, 2026-02-21T11:12:41.9992576Z 'pid_type': 'flat', 2026-02-21T11:12:41.9992832Z 'range_flattens': [None, None], 2026-02-21T11:12:41.9993136Z 'range_multi_buffers': [None, None], 2026-02-21T11:12:41.9993442Z 'range_num_stages': [0, 1], 2026-02-21T11:12:41.9993718Z 'range_unroll_factors': [0, 4], 2026-02-21T11:12:41.9994018Z 'range_warp_specializes': [], 2026-02-21T11:12:41.9994301Z 'waves_per_eu': 2} 2026-02-21T11:12:42.0035207Z [798s] Fitting surrogate: 935 points, 935 targets 2026-02-21T11:12:42.4793653Z [798s] Generation 16 starting: 37 neighbors, 2 active search path(s) 2026-02-21T11:13:13.9223528Z [830s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:13:16.1248794Z [832s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:13:16.1266460Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 1.2 configs/s 2026-02-21T11:13:18.7862298Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 14.1 configs/s 2026-02-21T11:13:20.0479952Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━ 154/154 41.2 configs/s 2026-02-21T11:13:25.2452005Z [841s] Generation 16 complete: 2026-02-21T11:13:25.2452443Z error=1 2026-02-21T11:13:25.2452799Z timeout=2 2026-02-21T11:13:25.2453096Z ok=37 2026-02-21T11:13:25.2453331Z min=1.3121 2026-02-21T11:13:25.2453679Z mid=1.5823 2026-02-21T11:13:25.2453902Z max=8.6099 2026-02-21T11:13:25.2454210Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:13:25.2455384Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:13:25.2455841Z 'l2_groupings': [4], 2026-02-21T11:13:25.2456135Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:13:25.2456470Z 'loop_orders': [[0, 1]], 2026-02-21T11:13:25.2456762Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:13:25.2457029Z 'num_stages': 3, 2026-02-21T11:13:25.2457269Z 'num_warps': 4, 2026-02-21T11:13:25.2457516Z 'pid_type': 'flat', 2026-02-21T11:13:25.2457788Z 'range_flattens': [None, None], 2026-02-21T11:13:25.2458095Z 'range_multi_buffers': [None, None], 2026-02-21T11:13:25.2458404Z 'range_num_stages': [0, 1], 2026-02-21T11:13:25.2458704Z 'range_unroll_factors': [0, 4], 2026-02-21T11:13:25.2458998Z 'range_warp_specializes': [], 2026-02-21T11:13:25.2459277Z 'waves_per_eu': 2} 2026-02-21T11:13:25.2513058Z [841s] Fitting surrogate: 975 points, 975 targets 2026-02-21T11:13:25.7328589Z [842s] Generation 17 starting: 38 neighbors, 2 active search path(s) 2026-02-21T11:14:00.7923925Z [877s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[2, 2], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:14:04.0228500Z [880s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:14:04.0252234Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 1.1 configs/s 2026-02-21T11:14:06.7681413Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 38/38 14.0 configs/s 2026-02-21T11:14:06.9332343Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━ 154/154 1329.3 2026-02-21T11:14:06.9335828Z configs/s 2026-02-21T11:14:12.3437838Z [888s] Generation 17 complete: 2026-02-21T11:14:12.3438257Z error=3 2026-02-21T11:14:12.3438467Z timeout=2 2026-02-21T11:14:12.3438672Z ok=36 2026-02-21T11:14:12.3438874Z min=1.3965 2026-02-21T11:14:12.3439116Z mid=1.6284 2026-02-21T11:14:12.3439324Z max=11.5837 2026-02-21T11:14:12.3439580Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:14:12.3440009Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:14:12.3440413Z 'l2_groupings': [4], 2026-02-21T11:14:12.3440694Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:14:12.3441015Z 'loop_orders': [[0, 1]], 2026-02-21T11:14:12.3441302Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:14:12.3441733Z 'num_stages': 3, 2026-02-21T11:14:12.3441972Z 'num_warps': 4, 2026-02-21T11:14:12.3442206Z 'pid_type': 'flat', 2026-02-21T11:14:12.3442491Z 'range_flattens': [None, None], 2026-02-21T11:14:12.3442912Z 'range_multi_buffers': [None, None], 2026-02-21T11:14:12.3443218Z 'range_num_stages': [0, 1], 2026-02-21T11:14:12.3443508Z 'range_unroll_factors': [0, 4], 2026-02-21T11:14:12.3443810Z 'range_warp_specializes': [], 2026-02-21T11:14:12.3444092Z 'waves_per_eu': 2} 2026-02-21T11:14:12.3511113Z [888s] Fitting surrogate: 1016 points, 1016 targets 2026-02-21T11:14:12.8333573Z [889s] Generation 18 starting: 39 neighbors, 2 active search path(s) 2026-02-21T11:14:45.9809781Z [922s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[3, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:14:45.9830061Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 1.2 configs/s 2026-02-21T11:14:48.9256532Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 39/39 13.4 configs/s 2026-02-21T11:14:50.1775665Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━ 154/154 53.4 configs/s 2026-02-21T11:14:55.8575378Z [932s] Generation 18 complete: 2026-02-21T11:14:55.8575804Z error=1 2026-02-21T11:14:55.8576024Z timeout=1 2026-02-21T11:14:55.8576282Z ok=40 2026-02-21T11:14:55.8576490Z min=1.3889 2026-02-21T11:14:55.8576693Z mid=1.6112 2026-02-21T11:14:55.8576899Z max=23.5224 2026-02-21T11:14:55.8577147Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:14:55.8577596Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:14:55.8578010Z 'l2_groupings': [4], 2026-02-21T11:14:55.8578306Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:14:55.8578632Z 'loop_orders': [[0, 1]], 2026-02-21T11:14:55.8578916Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:14:55.8579696Z 'num_stages': 3, 2026-02-21T11:14:55.8579920Z 'num_warps': 4, 2026-02-21T11:14:55.8580165Z 'pid_type': 'flat', 2026-02-21T11:14:55.8580425Z 'range_flattens': [None, None], 2026-02-21T11:14:55.8580774Z 'range_multi_buffers': [None, None], 2026-02-21T11:14:55.8581088Z 'range_num_stages': [0, 1], 2026-02-21T11:14:55.8581333Z 'range_unroll_factors': [0, 4], 2026-02-21T11:14:55.8581597Z 'range_warp_specializes': [], 2026-02-21T11:14:55.8581845Z 'waves_per_eu': 2} 2026-02-21T11:14:55.8645734Z [932s] Fitting surrogate: 1058 points, 1058 targets 2026-02-21T11:14:56.3091417Z [932s] Generation 19 starting: 34 neighbors, 2 active search path(s) 2026-02-21T11:15:29.3802160Z [965s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:15:31.8707484Z [968s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[2, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:15:32.0876205Z [968s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:15:32.0894568Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 0.7 configs/s 2026-02-21T11:15:34.3455969Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 34/34 15.3 configs/s 2026-02-21T11:15:34.5112645Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━ 154/154 1301.6 2026-02-21T11:15:34.5115643Z configs/s 2026-02-21T11:15:39.9258108Z [976s] Generation 19 complete: 2026-02-21T11:15:39.9258493Z error=1 2026-02-21T11:15:39.9258714Z timeout=3 2026-02-21T11:15:39.9258923Z ok=33 2026-02-21T11:15:39.9259124Z min=1.3823 2026-02-21T11:15:39.9259338Z mid=1.5900 2026-02-21T11:15:39.9259562Z max=8.6143 2026-02-21T11:15:39.9259833Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:15:39.9260254Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:15:39.9260695Z 'l2_groupings': [4], 2026-02-21T11:15:39.9260982Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:15:39.9261320Z 'loop_orders': [[0, 1]], 2026-02-21T11:15:39.9261637Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:15:39.9261940Z 'num_stages': 3, 2026-02-21T11:15:39.9262188Z 'num_warps': 4, 2026-02-21T11:15:39.9262425Z 'pid_type': 'flat', 2026-02-21T11:15:39.9262698Z 'range_flattens': [None, None], 2026-02-21T11:15:39.9263008Z 'range_multi_buffers': [None, None], 2026-02-21T11:15:39.9263713Z 'range_num_stages': [0, 1], 2026-02-21T11:15:39.9263990Z 'range_unroll_factors': [0, 4], 2026-02-21T11:15:39.9264268Z 'range_warp_specializes': [], 2026-02-21T11:15:39.9264377Z 'waves_per_eu': 2} 2026-02-21T11:15:39.9342566Z [976s] Fitting surrogate: 1095 points, 1095 targets 2026-02-21T11:15:40.3601168Z [976s] Generation 20 starting: 30 neighbors, 2 active search path(s) 2026-02-21T11:16:04.1225009Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 0.5 configs/s 2026-02-21T11:16:06.4191955Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 30/30 13.4 configs/s 2026-02-21T11:16:06.5441065Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 154/154 - configs/s 2026-02-21T11:16:10.6239567Z [1006s] Generation 20 complete: 2026-02-21T11:16:10.6242735Z error=4 2026-02-21T11:16:10.6243042Z ok=29 2026-02-21T11:16:10.6243245Z min=1.3522 2026-02-21T11:16:10.6243451Z mid=1.5795 2026-02-21T11:16:10.6243649Z max=25.1943 2026-02-21T11:16:10.6243944Z best={'block_sizes': [1, 128, 64], 2026-02-21T11:16:10.6244400Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:16:10.6244800Z 'l2_groupings': [4], 2026-02-21T11:16:10.6245069Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:16:10.6245366Z 'loop_orders': [[0, 1]], 2026-02-21T11:16:10.6245638Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:16:10.6245902Z 'num_stages': 3, 2026-02-21T11:16:10.6246138Z 'num_warps': 4, 2026-02-21T11:16:10.6246367Z 'pid_type': 'flat', 2026-02-21T11:16:10.6246641Z 'range_flattens': [None, None], 2026-02-21T11:16:10.6246937Z 'range_multi_buffers': [None, None], 2026-02-21T11:16:10.6247237Z 'range_num_stages': [0, 1], 2026-02-21T11:16:10.6247501Z 'range_unroll_factors': [0, 4], 2026-02-21T11:16:10.6248297Z 'range_warp_specializes': [], 2026-02-21T11:16:10.6248567Z 'waves_per_eu': 2} 2026-02-21T11:16:10.6291005Z [1007s] Fitting surrogate: 1128 points, 1128 targets 2026-02-21T11:16:10.7655015Z [1007s] Autotuning complete in 1007.1s after searching 1003 configs. 2026-02-21T11:16:10.7655584Z One can hardcode the best config and skip autotuning with: 2026-02-21T11:16:10.7657762Z @helion.kernel(config=helion.Config(block_sizes=[1, 128, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T11:16:10.7659320Z 2026-02-21T11:16:10.7659618Z [1007s] Code of selected kernel: /tmp/torchinductor_root/bl/cbldhhithzna2aj5wivr6b6fmcidpk7j44eu3rdkbssuax7iw2z4.py 2026-02-21T11:16:11.6798235Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T11:16:11.6799154Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T11:16:11.6800015Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T11:16:11.6800247Z WARNING:tritonbench.utils.triton_op:Completed input ID 4: 2026-02-21T11:16:11.6800416Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T11:16:11.6800551Z ------------------------------------------ 2026-02-21T11:16:11.6800696Z (4, 48, 2048, 2048, 128) 2026-02-21T11:16:11.6800763Z 2026-02-21T11:16:11.6808589Z 67%|██████▋ | 4/6 [31:45<19:42, 591.12s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5: 2026-02-21T11:16:11.6808862Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T11:16:11.6809366Z ------------------------------------------ 2026-02-21T11:16:11.6809504Z (4, 48, 4096, 4096, 128) 2026-02-21T11:16:11.6810273Z INFO:tritonbench.utils.triton_op:Took 0.09ms to get benchmark function for aten 2026-02-21T11:16:12.7003577Z INFO:tritonbench.utils.triton_op:Took 1.67ms to get benchmark function for flex_attention 2026-02-21T11:16:14.2678211Z WARNING:__main__:Input tensor metadata: 2026-02-21T11:16:14.2678508Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T11:16:14.2678725Z 'dtype': 'torch.bfloat16', 2026-02-21T11:16:14.2679263Z 'shape': (4, 48, 4096, 128), 2026-02-21T11:16:14.2679452Z 'stride': (25165824, 524288, 128, 1)}, 2026-02-21T11:16:14.2679638Z { 'device': 'cuda:0', 2026-02-21T11:16:14.2679825Z 'dtype': 'torch.bfloat16', 2026-02-21T11:16:14.2679995Z 'shape': (4, 48, 4096, 128), 2026-02-21T11:16:14.2680182Z 'stride': (25165824, 524288, 128, 1)}, 2026-02-21T11:16:14.2680363Z { 'device': 'cuda:0', 2026-02-21T11:16:14.2680541Z 'dtype': 'torch.bfloat16', 2026-02-21T11:16:14.2680704Z 'shape': (4, 48, 4096, 128), 2026-02-21T11:16:14.2680884Z 'stride': (25165824, 524288, 128, 1)}), 2026-02-21T11:16:14.2681058Z 'kwargs': {}} 2026-02-21T11:16:14.2722647Z INFO:tritonbench.utils.triton_op:Took 4.83ms to get benchmark function for helion_attention 2026-02-21T11:16:14.5166678Z [0s] Autotune random seed: 2144140282 2026-02-21T11:16:15.4864810Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T11:16:48.7144387Z [33s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[3, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:16:50.7280141Z [35s] Timeout after 30s compiling Config(block_sizes=[1, 8, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:16:56.1670290Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 1, 4096], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:16:56.7732820Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:16:57.1268103Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:16:59.2854632Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:16:59.7674687Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 8, 1024], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:16:59.9613169Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 16, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:17:00.1503382Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 16, 256], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:17:00.4396133Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 8, 4096], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:17:01.0962181Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[4, 3], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:17:02.3044985Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 32, 1024], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=16, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[3, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:17:03.0777761Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 4, 2048], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[1, 2], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:17:03.6832502Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 2, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:17:04.0959230Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[2, 0], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:17:04.3093106Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 1, 1024], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, False], range_num_stages=[1, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:17:04.4961889Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:17:05.4779296Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 32, 1024], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:17:05.8860676Z [50s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 32], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:17:05.8884149Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.3 configs/s 2026-02-21T11:19:18.4715345Z /tmp/torchinductor_root/qx/cqxoflqc4pbusmmbqhmay5me5afqx2e4hofhburtffelleatvdmk.py:57:24: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T11:19:18.4716469Z k = tl.load(tl.make_block_ptr(k_view, [192, 128, 4096], [524288, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_1, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T11:19:18.4717120Z ^ 2026-02-21T11:19:18.4718410Z /tmp/torchinductor_root/qx/cqxoflqc4pbusmmbqhmay5me5afqx2e4hofhburtffelleatvdmk.py:59:145: note: - use: %144 = "tt.reshape"(<>) : (tensor<1x128x1024xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 8], order = [1, 0, 2]}>>) -> tensor<128x1024xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 8], order = [0, 1]}>> 2026-02-21T11:19:18.4719643Z 2026-02-21T11:19:18.4720378Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T11:19:18.4721226Z ^ 2026-02-21T11:19:18.4722145Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T11:19:18.4722704Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [2, 1, 0]}> 2026-02-21T11:19:18.4723309Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}> 2026-02-21T11:19:18.4724011Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}> 2026-02-21T11:19:18.4724597Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [2, 1, 0]}> 2026-02-21T11:19:18.4725182Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T11:19:18.4725767Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [2, 1, 0]}> 2026-02-21T11:19:18.4726340Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T11:19:18.4726893Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}> 2026-02-21T11:19:18.4727437Z #blocked8 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [8], order = [0]}> 2026-02-21T11:19:18.4727972Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [0, 1]}> 2026-02-21T11:19:18.4728600Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}> 2026-02-21T11:19:18.4729179Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [0, 1, 2]}> 2026-02-21T11:19:18.4729769Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [0, 1, 2]}> 2026-02-21T11:19:18.4730312Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 2], order = [2, 1, 0]}> 2026-02-21T11:19:18.4730849Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}> 2026-02-21T11:19:18.4731274Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [0, 1, 2]}> 2026-02-21T11:19:18.4731705Z #blocked16 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}> 2026-02-21T11:19:18.4732133Z #blocked17 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [0, 1, 2]}> 2026-02-21T11:19:18.4732604Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T11:19:18.4733326Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T11:19:18.4733880Z %c524288_i32 = arith.constant 524288 : i32 2026-02-21T11:19:18.4734069Z %c192_i64 = arith.constant 192 : i64 2026-02-21T11:19:18.4734233Z %c0_i64 = arith.constant 0 : i64 2026-02-21T11:19:18.4734406Z %c524288_i64 = arith.constant 524288 : i64 2026-02-21T11:19:18.4734641Z %cst = arith.constant dense<0.000000e+00> : tensor<1x128x1024xbf16, #blocked> 2026-02-21T11:19:18.4734924Z %cst_0 = arith.constant dense<4096> : tensor<1x1x1024xi64, #blocked> 2026-02-21T11:19:18.4735180Z %cst_1 = arith.constant dense<0> : tensor<1x1x1024xi64, #blocked> 2026-02-21T11:19:18.4735428Z %cst_2 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked1> 2026-02-21T11:19:18.4735700Z %cst_3 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked1> 2026-02-21T11:19:18.4735944Z %cst_4 = arith.constant dense<128> : tensor<1x1x1024xi64, #blocked> 2026-02-21T11:19:18.4736205Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked2> 2026-02-21T11:19:18.4736475Z %cst_6 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked3> 2026-02-21T11:19:18.4736778Z %cst_7 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked3> 2026-02-21T11:19:18.4737041Z %cst_8 = arith.constant dense<4096> : tensor<1x2x1xi64, #blocked4> 2026-02-21T11:19:18.4737288Z %cst_9 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked4> 2026-02-21T11:19:18.4737531Z %cst_10 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked4> 2026-02-21T11:19:18.4737740Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T11:19:18.4737905Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T11:19:18.4738065Z %c304_i32 = arith.constant 304 : i32 2026-02-21T11:19:18.4738269Z %cst_11 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked4> 2026-02-21T11:19:18.4738519Z %cst_12 = arith.constant dense<128> : tensor<1x1024x1xi32, #blocked5> 2026-02-21T11:19:18.4738790Z %cst_13 = arith.constant dense<0.127517432> : tensor<1x2x1024xf32, #blocked> 2026-02-21T11:19:18.4739064Z %cst_14 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4739332Z %cst_15 = arith.constant dense<0.000000e+00> : tensor<2x1024xf32, #blocked7> 2026-02-21T11:19:18.4739560Z %c0_i32 = arith.constant 0 : i32 2026-02-21T11:19:18.4739777Z %cst_16 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked2> 2026-02-21T11:19:18.4740073Z %cst_17 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4740314Z %cst_18 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4740488Z %c2_i32 = arith.constant 2 : i32 2026-02-21T11:19:18.4740617Z %c192_i32 = arith.constant 192 : i32 2026-02-21T11:19:18.4740751Z %c393216_i32 = arith.constant 393216 : i32 2026-02-21T11:19:18.4740888Z %0 = tt.get_program_id x : i32 2026-02-21T11:19:18.4741056Z %1 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked8> 2026-02-21T11:19:18.4741284Z %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked8> 2026-02-21T11:19:18.4741530Z %3 = tt.splat %arg0 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked2> 2026-02-21T11:19:18.4741755Z %4 = arith.extsi %1 : tensor<2xi32, #blocked8> to tensor<2xi64, #blocked8> 2026-02-21T11:19:18.4741981Z %5 = arith.extsi %2 : tensor<128xi32, #blocked8> to tensor<128xi64, #blocked8> 2026-02-21T11:19:18.4742270Z %6 = ttg.convert_layout %5 : tensor<128xi64, #blocked8> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:19:18.4742628Z %7 = tt.expand_dims %6 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi64, #blocked9> 2026-02-21T11:19:18.4742945Z %8 = ttg.convert_layout %7 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #blocked10> 2026-02-21T11:19:18.4743263Z %9 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> 2026-02-21T11:19:18.4743645Z %10 = tt.expand_dims %9 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi64, #blocked11> 2026-02-21T11:19:18.4743978Z %11 = ttg.convert_layout %10 : tensor<1x1x128xi64, #blocked11> -> tensor<1x1x128xi64, #blocked3> 2026-02-21T11:19:18.4744256Z %12 = tt.broadcast %11 : tensor<1x1x128xi64, #blocked3> -> tensor<1x2x128xi64, #blocked3> 2026-02-21T11:19:18.4744521Z %13 = ttg.convert_layout %12 : tensor<1x2x128xi64, #blocked3> -> tensor<1x2x128xi64, #blocked2> 2026-02-21T11:19:18.4744762Z %14 = arith.cmpi sge, %11, %cst_7 : tensor<1x1x128xi64, #blocked3> 2026-02-21T11:19:18.4744954Z %15 = arith.cmpi slt, %11, %cst_6 : tensor<1x1x128xi64, #blocked3> 2026-02-21T11:19:18.4745148Z %16 = arith.andi %14, %15 : tensor<1x1x128xi1, #blocked3> 2026-02-21T11:19:18.4745361Z %17 = tt.broadcast %16 : tensor<1x1x128xi1, #blocked3> -> tensor<1x2x128xi1, #blocked3> 2026-02-21T11:19:18.4745620Z %18 = ttg.convert_layout %17 : tensor<1x2x128xi1, #blocked3> -> tensor<1x2x128xi1, #blocked2> 2026-02-21T11:19:18.4745878Z %19 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked8> 2026-02-21T11:19:18.4746117Z %20 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x1024x!tt.ptr, #blocked> 2026-02-21T11:19:18.4746431Z %21 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked12}>> 2026-02-21T11:19:18.4746807Z %22 = tt.expand_dims %21 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x128x1xi64, #blocked12> 2026-02-21T11:19:18.4747136Z %23 = ttg.convert_layout %22 : tensor<1x128x1xi64, #blocked12> -> tensor<1x128x1xi64, #blocked1> 2026-02-21T11:19:18.4747416Z %24 = tt.broadcast %23 : tensor<1x128x1xi64, #blocked1> -> tensor<1x128x1024xi64, #blocked1> 2026-02-21T11:19:18.4747696Z %25 = ttg.convert_layout %24 : tensor<1x128x1024xi64, #blocked1> -> tensor<1x128x1024xi64, #blocked> 2026-02-21T11:19:18.4747960Z %26 = arith.extsi %19 : tensor<1024xi32, #blocked8> to tensor<1024xi64, #blocked8> 2026-02-21T11:19:18.4748177Z %27 = arith.cmpi sge, %23, %cst_3 : tensor<1x128x1xi64, #blocked1> 2026-02-21T11:19:18.4748367Z %28 = arith.cmpi slt, %23, %cst_2 : tensor<1x128x1xi64, #blocked1> 2026-02-21T11:19:18.4748551Z %29 = arith.andi %27, %28 : tensor<1x128x1xi1, #blocked1> 2026-02-21T11:19:18.4748830Z %30 = ttg.convert_layout %2 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:19:18.4749179Z %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9> 2026-02-21T11:19:18.4749495Z %32 = ttg.convert_layout %31 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked10> 2026-02-21T11:19:18.4749805Z %33 = ttg.convert_layout %32 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> 2026-02-21T11:19:18.4750194Z %34 = tt.expand_dims %33 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi32, #blocked11> 2026-02-21T11:19:18.4750529Z %35 = ttg.convert_layout %34 : tensor<1x1x128xi32, #blocked11> -> tensor<1x1x128xi32, #blocked3> 2026-02-21T11:19:18.4750791Z %36 = tt.broadcast %35 : tensor<1x1x128xi32, #blocked3> -> tensor<1x1024x128xi32, #blocked3> 2026-02-21T11:19:18.4751064Z %37 = ttg.convert_layout %36 : tensor<1x1024x128xi32, #blocked3> -> tensor<1x1024x128xi32, #blocked13> 2026-02-21T11:19:18.4751324Z %38 = tt.splat %arg2 : !tt.ptr -> tensor<1x1024x128x!tt.ptr, #blocked13> 2026-02-21T11:19:18.4751601Z %39 = ttg.convert_layout %2 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:19:18.4751929Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9> 2026-02-21T11:19:18.4752223Z %41 = ttg.convert_layout %40 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked10> 2026-02-21T11:19:18.4752517Z %42 = ttg.convert_layout %41 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> 2026-02-21T11:19:18.4752864Z %43 = tt.expand_dims %42 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi32, #blocked11> 2026-02-21T11:19:18.4753178Z %44 = ttg.convert_layout %43 : tensor<1x1x128xi32, #blocked11> -> tensor<1x1x128xi32, #blocked3> 2026-02-21T11:19:18.4753422Z %45 = tt.broadcast %44 : tensor<1x1x128xi32, #blocked3> -> tensor<1x2x128xi32, #blocked3> 2026-02-21T11:19:18.4753689Z %46 = ttg.convert_layout %45 : tensor<1x2x128xi32, #blocked3> -> tensor<1x2x128xi32, #blocked2> 2026-02-21T11:19:18.4753921Z %47 = tt.splat %arg3 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked2> 2026-02-21T11:19:18.4754117Z scf.for %arg4 = %0 to %c393216_i32 step %c304_i32 : i32 { 2026-02-21T11:19:18.4754272Z %48 = arith.remsi %arg4, %c192_i32 : i32 2026-02-21T11:19:18.4754402Z %49 = arith.divsi %arg4, %c192_i32 : i32 2026-02-21T11:19:18.4754531Z %50 = arith.muli %49, %c2_i32 : i32 2026-02-21T11:19:18.4754700Z %51 = tt.splat %50 : i32 -> tensor<2xi32, #blocked8> 2026-02-21T11:19:18.4754855Z %52 = arith.addi %51, %1 : tensor<2xi32, #blocked8> 2026-02-21T11:19:18.4754996Z %53 = arith.extsi %48 : i32 to i64 2026-02-21T11:19:18.4755121Z %54 = arith.extsi %50 : i32 to i64 2026-02-21T11:19:18.4755244Z %55 = arith.muli %53, %c524288_i64 : i64 2026-02-21T11:19:18.4755394Z %56 = tt.splat %55 : i64 -> tensor<1x2x128xi64, #blocked2> 2026-02-21T11:19:18.4755553Z %57 = tt.splat %54 : i64 -> tensor<2xi64, #blocked8> 2026-02-21T11:19:18.4755701Z %58 = arith.addi %57, %4 : tensor<2xi64, #blocked8> 2026-02-21T11:19:18.4755937Z %59 = ttg.convert_layout %58 : tensor<2xi64, #blocked8> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:19:18.4756255Z %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x2xi64, #blocked9> 2026-02-21T11:19:18.4756542Z %61 = ttg.convert_layout %60 : tensor<1x2xi64, #blocked9> -> tensor<1x2xi64, #blocked6> 2026-02-21T11:19:18.4756830Z %62 = ttg.convert_layout %61 : tensor<1x2xi64, #blocked6> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T11:19:18.4757182Z %63 = tt.expand_dims %62 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xi64, #blocked14> 2026-02-21T11:19:18.4757486Z %64 = ttg.convert_layout %63 : tensor<1x2x1xi64, #blocked14> -> tensor<1x2x1xi64, #blocked4> 2026-02-21T11:19:18.4757700Z %65 = arith.muli %64, %cst_10 : tensor<1x2x1xi64, #blocked4> 2026-02-21T11:19:18.4757901Z %66 = tt.broadcast %65 : tensor<1x2x1xi64, #blocked4> -> tensor<1x2x128xi64, #blocked4> 2026-02-21T11:19:18.4758151Z %67 = ttg.convert_layout %66 : tensor<1x2x128xi64, #blocked4> -> tensor<1x2x128xi64, #blocked2> 2026-02-21T11:19:18.4758381Z %68 = arith.addi %67, %13 : tensor<1x2x128xi64, #blocked2> 2026-02-21T11:19:18.4758546Z %69 = arith.addi %56, %68 : tensor<1x2x128xi64, #blocked2> 2026-02-21T11:19:18.4758756Z %70 = tt.addptr %3, %69 : tensor<1x2x128x!tt.ptr, #blocked2>, tensor<1x2x128xi64, #blocked2> 2026-02-21T11:19:18.4758957Z %71 = arith.cmpi sge, %53, %c0_i64 : i64 2026-02-21T11:19:18.4759095Z %72 = arith.cmpi slt, %53, %c192_i64 : i64 2026-02-21T11:19:18.4759222Z %73 = arith.andi %71, %72 : i1 2026-02-21T11:19:18.4759368Z %74 = arith.cmpi sge, %64, %cst_9 : tensor<1x2x1xi64, #blocked4> 2026-02-21T11:19:18.4759542Z %75 = arith.cmpi slt, %64, %cst_8 : tensor<1x2x1xi64, #blocked4> 2026-02-21T11:19:18.4759711Z %76 = arith.andi %74, %75 : tensor<1x2x1xi1, #blocked4> 2026-02-21T11:19:18.4759866Z %77 = tt.splat %73 : i1 -> tensor<1x2x1xi1, #blocked4> 2026-02-21T11:19:18.4760024Z %78 = arith.andi %77, %76 : tensor<1x2x1xi1, #blocked4> 2026-02-21T11:19:18.4760213Z %79 = tt.broadcast %78 : tensor<1x2x1xi1, #blocked4> -> tensor<1x2x128xi1, #blocked4> 2026-02-21T11:19:18.4760461Z %80 = ttg.convert_layout %79 : tensor<1x2x128xi1, #blocked4> -> tensor<1x2x128xi1, #blocked2> 2026-02-21T11:19:18.4760675Z %81 = arith.andi %80, %18 : tensor<1x2x128xi1, #blocked2> 2026-02-21T11:19:18.4760849Z %82 = tt.load %70, %81, %cst_5 : tensor<1x2x128x!tt.ptr, #blocked2> 2026-02-21T11:19:18.4761030Z %83 = tt.splat %55 : i64 -> tensor<1x128x1024xi64, #blocked> 2026-02-21T11:19:18.4761189Z %84 = tt.splat %73 : i1 -> tensor<1x128x1xi1, #blocked1> 2026-02-21T11:19:18.4761389Z %85 = arith.andi %84, %29 : tensor<1x128x1xi1, #blocked1> 2026-02-21T11:19:18.4761594Z %86 = tt.broadcast %85 : tensor<1x128x1xi1, #blocked1> -> tensor<1x128x1024xi1, #blocked1> 2026-02-21T11:19:18.4761852Z %87 = ttg.convert_layout %86 : tensor<1x128x1024xi1, #blocked1> -> tensor<1x128x1024xi1, #blocked> 2026-02-21T11:19:18.4762109Z %88 = tt.reshape %82 : tensor<1x2x128xbf16, #blocked2> -> tensor<2x128xbf16, #blocked10> 2026-02-21T11:19:18.4762293Z %89 = arith.muli %48, %c524288_i32 : i32 2026-02-21T11:19:18.4762458Z %90 = tt.splat %89 : i32 -> tensor<1x1024x1xi32, #blocked5> 2026-02-21T11:19:18.4762880Z %91:3 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c1024_i32 iter_args(%arg6 = %cst_18, %arg7 = %cst_17, %arg8 = %cst_16) -> (tensor<1x2xf32, #blocked6>, tensor<1x2xf32, #blocked6>, tensor<1x2x128xf32, #blocked2>) : i32 { 2026-02-21T11:19:18.4763242Z %113 = tt.splat %arg5 : i32 -> tensor<1024xi32, #blocked8> 2026-02-21T11:19:18.4763413Z %114 = arith.addi %113, %19 : tensor<1024xi32, #blocked8> 2026-02-21T11:19:18.4763558Z %115 = arith.extsi %arg5 : i32 to i64 2026-02-21T11:19:18.4763709Z %116 = tt.splat %115 : i64 -> tensor<1024xi64, #blocked8> 2026-02-21T11:19:18.4763869Z %117 = arith.addi %116, %26 : tensor<1024xi64, #blocked8> 2026-02-21T11:19:18.4764127Z %118 = ttg.convert_layout %117 : tensor<1024xi64, #blocked8> -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:19:18.4764484Z %119 = tt.expand_dims %118 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x1024xi64, #blocked9> 2026-02-21T11:19:18.4764817Z %120 = ttg.convert_layout %119 : tensor<1x1024xi64, #blocked9> -> tensor<1x1024xi64, #blocked7> 2026-02-21T11:19:18.4765124Z %121 = ttg.convert_layout %120 : tensor<1x1024xi64, #blocked7> -> tensor<1x1024xi64, #ttg.slice<{dim = 1, parent = #blocked15}>> 2026-02-21T11:19:18.4765487Z %122 = tt.expand_dims %121 {axis = 1 : i32} : tensor<1x1024xi64, #ttg.slice<{dim = 1, parent = #blocked15}>> -> tensor<1x1x1024xi64, #blocked15> 2026-02-21T11:19:18.4765811Z %123 = ttg.convert_layout %122 : tensor<1x1x1024xi64, #blocked15> -> tensor<1x1x1024xi64, #blocked> 2026-02-21T11:19:18.4766044Z %124 = arith.muli %123, %cst_4 : tensor<1x1x1024xi64, #blocked> 2026-02-21T11:19:18.4766282Z %125 = tt.broadcast %124 : tensor<1x1x1024xi64, #blocked> -> tensor<1x128x1024xi64, #blocked> 2026-02-21T11:19:18.4766501Z %126 = arith.addi %25, %125 : tensor<1x128x1024xi64, #blocked> 2026-02-21T11:19:18.4766674Z %127 = arith.addi %83, %126 : tensor<1x128x1024xi64, #blocked> 2026-02-21T11:19:18.4766896Z %128 = tt.addptr %20, %127 : tensor<1x128x1024x!tt.ptr, #blocked>, tensor<1x128x1024xi64, #blocked> 2026-02-21T11:19:18.4767133Z %129 = arith.cmpi sge, %123, %cst_1 : tensor<1x1x1024xi64, #blocked> 2026-02-21T11:19:18.4767316Z %130 = arith.cmpi slt, %123, %cst_0 : tensor<1x1x1024xi64, #blocked> 2026-02-21T11:19:18.4767499Z %131 = arith.andi %129, %130 : tensor<1x1x1024xi1, #blocked> 2026-02-21T11:19:18.4767705Z %132 = tt.broadcast %131 : tensor<1x1x1024xi1, #blocked> -> tensor<1x128x1024xi1, #blocked> 2026-02-21T11:19:18.4767921Z %133 = arith.andi %87, %132 : tensor<1x128x1024xi1, #blocked> 2026-02-21T11:19:18.4768108Z %134 = tt.load %128, %133, %cst : tensor<1x128x1024x!tt.ptr, #blocked> 2026-02-21T11:19:18.4768335Z %135 = tt.reshape %134 : tensor<1x128x1024xbf16, #blocked> -> tensor<128x1024xbf16, #blocked7> 2026-02-21T11:19:18.4768654Z %136 = ttg.convert_layout %88 : tensor<2x128xbf16, #blocked10> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked16}>> 2026-02-21T11:19:18.4769025Z %137 = ttg.convert_layout %135 : tensor<128x1024xbf16, #blocked7> -> tensor<128x1024xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked16}>> 2026-02-21T11:19:18.4769353Z %138 = ttg.convert_layout %cst_15 : tensor<2x1024xf32, #blocked7> -> tensor<2x1024xf32, #blocked16> 2026-02-21T11:19:18.4769810Z %139 = tt.dot %136, %137, %138, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked16}>> * tensor<128x1024xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked16}>> -> tensor<2x1024xf32, #blocked16> 2026-02-21T11:19:18.4770231Z %140 = ttg.convert_layout %139 : tensor<2x1024xf32, #blocked16> -> tensor<2x1024xf32, #blocked7> 2026-02-21T11:19:18.4770485Z %141 = tt.reshape %140 : tensor<2x1024xf32, #blocked7> -> tensor<1x2x1024xf32, #blocked> 2026-02-21T11:19:18.4770752Z %142 = arith.truncf %141 : tensor<1x2x1024xf32, #blocked> to tensor<1x2x1024xbf16, #blocked> 2026-02-21T11:19:18.4770998Z %143 = arith.extf %142 : tensor<1x2x1024xbf16, #blocked> to tensor<1x2x1024xf32, #blocked> 2026-02-21T11:19:18.4771196Z %144 = "tt.reduce"(%143) <{axis = 2 : i32}> ({ 2026-02-21T11:19:18.4771328Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T11:19:18.4771457Z %199 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T11:19:18.4771588Z tt.reduce.return %199 : f32 2026-02-21T11:19:18.4771785Z }) : (tensor<1x2x1024xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T11:19:18.4772083Z %145 = ttg.convert_layout %144 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4772359Z %146 = arith.truncf %145 : tensor<1x2xf32, #blocked6> to tensor<1x2xbf16, #blocked6> 2026-02-21T11:19:18.4772621Z %147 = arith.extf %146 : tensor<1x2xbf16, #blocked6> to tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4772824Z %148 = arith.mulf %147, %cst_14 : tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4773049Z %149 = arith.truncf %148 : tensor<1x2xf32, #blocked6> to tensor<1x2xbf16, #blocked6> 2026-02-21T11:19:18.4773275Z %150 = arith.extf %149 : tensor<1x2xbf16, #blocked6> to tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4773474Z %151 = arith.cmpf ogt, %arg6, %150 : tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4773653Z %152 = arith.cmpf une, %arg6, %arg6 : tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4773829Z %153 = arith.ori %151, %152 : tensor<1x2xi1, #blocked6> 2026-02-21T11:19:18.4774030Z %154 = arith.select %153, %arg6, %150 : tensor<1x2xi1, #blocked6>, tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4774265Z %155 = arith.mulf %143, %cst_13 : tensor<1x2x1024xf32, #blocked> 2026-02-21T11:19:18.4774477Z %156 = arith.truncf %155 : tensor<1x2x1024xf32, #blocked> to tensor<1x2x1024xbf16, #blocked> 2026-02-21T11:19:18.4774779Z %157 = ttg.convert_layout %154 : tensor<1x2xf32, #blocked6> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T11:19:18.4775130Z %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xf32, #blocked14> 2026-02-21T11:19:18.4775436Z %159 = ttg.convert_layout %158 : tensor<1x2x1xf32, #blocked14> -> tensor<1x2x1xf32, #blocked4> 2026-02-21T11:19:18.4775692Z %160 = arith.extf %156 : tensor<1x2x1024xbf16, #blocked> to tensor<1x2x1024xf32, #blocked> 2026-02-21T11:19:18.4775938Z %161 = tt.broadcast %159 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x1024xf32, #blocked4> 2026-02-21T11:19:18.4776200Z %162 = ttg.convert_layout %161 : tensor<1x2x1024xf32, #blocked4> -> tensor<1x2x1024xf32, #blocked> 2026-02-21T11:19:18.4776425Z %163 = arith.subf %160, %162 : tensor<1x2x1024xf32, #blocked> 2026-02-21T11:19:18.4776741Z %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x1024xf32, #blocked>) -> tensor<1x2x1024xf32, #blocked> 2026-02-21T11:19:18.4777043Z %165 = "tt.reduce"(%164) <{axis = 2 : i32}> ({ 2026-02-21T11:19:18.4777176Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T11:19:18.4777304Z %199 = arith.addf %arg9, %arg10 : f32 2026-02-21T11:19:18.4777451Z tt.reduce.return %199 : f32 2026-02-21T11:19:18.4777640Z }) : (tensor<1x2x1024xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T11:19:18.4777936Z %166 = ttg.convert_layout %165 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4778182Z %167 = arith.subf %arg6, %154 : tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4778475Z %168 = tt.extern_elementwise %167 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked6>) -> tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4778788Z %169 = arith.mulf %arg7, %168 : tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4778950Z %170 = arith.addf %169, %166 : tensor<1x2xf32, #blocked6> 2026-02-21T11:19:18.4779202Z %171 = ttg.convert_layout %168 : tensor<1x2xf32, #blocked6> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T11:19:18.4779544Z %172 = tt.expand_dims %171 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xf32, #blocked14> 2026-02-21T11:19:18.4779856Z %173 = ttg.convert_layout %172 : tensor<1x2x1xf32, #blocked14> -> tensor<1x2x1xf32, #blocked4> 2026-02-21T11:19:18.4780108Z %174 = tt.broadcast %173 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4> 2026-02-21T11:19:18.4780365Z %175 = ttg.convert_layout %174 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked2> 2026-02-21T11:19:18.4780591Z %176 = arith.mulf %arg8, %175 : tensor<1x2x128xf32, #blocked2> 2026-02-21T11:19:18.4780861Z %177 = ttg.convert_layout %114 : tensor<1024xi32, #blocked8> -> tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:19:18.4781208Z %178 = tt.expand_dims %177 {axis = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x1024xi32, #blocked9> 2026-02-21T11:19:18.4781519Z %179 = ttg.convert_layout %178 : tensor<1x1024xi32, #blocked9> -> tensor<1x1024xi32, #blocked7> 2026-02-21T11:19:18.4781820Z %180 = ttg.convert_layout %179 : tensor<1x1024xi32, #blocked7> -> tensor<1x1024xi32, #ttg.slice<{dim = 2, parent = #blocked17}>> 2026-02-21T11:19:18.4782202Z %181 = tt.expand_dims %180 {axis = 2 : i32} : tensor<1x1024xi32, #ttg.slice<{dim = 2, parent = #blocked17}>> -> tensor<1x1024x1xi32, #blocked17> 2026-02-21T11:19:18.4782530Z %182 = ttg.convert_layout %181 : tensor<1x1024x1xi32, #blocked17> -> tensor<1x1024x1xi32, #blocked5> 2026-02-21T11:19:18.4782761Z %183 = arith.muli %182, %cst_12 : tensor<1x1024x1xi32, #blocked5> 2026-02-21T11:19:18.4782940Z %184 = arith.addi %90, %183 : tensor<1x1024x1xi32, #blocked5> 2026-02-21T11:19:18.4783153Z %185 = tt.broadcast %184 : tensor<1x1024x1xi32, #blocked5> -> tensor<1x1024x128xi32, #blocked5> 2026-02-21T11:19:18.4783430Z %186 = ttg.convert_layout %185 : tensor<1x1024x128xi32, #blocked5> -> tensor<1x1024x128xi32, #blocked13> 2026-02-21T11:19:18.4783660Z %187 = arith.addi %186, %37 : tensor<1x1024x128xi32, #blocked13> 2026-02-21T11:19:18.4783896Z %188 = tt.addptr %38, %187 : tensor<1x1024x128x!tt.ptr, #blocked13>, tensor<1x1024x128xi32, #blocked13> 2026-02-21T11:19:18.4784137Z %189 = tt.load %188 : tensor<1x1024x128x!tt.ptr, #blocked13> 2026-02-21T11:19:18.4784352Z %190 = arith.truncf %164 : tensor<1x2x1024xf32, #blocked> to tensor<1x2x1024xbf16, #blocked> 2026-02-21T11:19:18.4784603Z %191 = tt.reshape %176 : tensor<1x2x128xf32, #blocked2> -> tensor<2x128xf32, #blocked10> 2026-02-21T11:19:18.4784844Z %192 = tt.reshape %190 : tensor<1x2x1024xbf16, #blocked> -> tensor<2x1024xbf16, #blocked7> 2026-02-21T11:19:18.4785105Z %193 = tt.reshape %189 : tensor<1x1024x128xbf16, #blocked13> -> tensor<1024x128xbf16, #blocked10> 2026-02-21T11:19:18.4785425Z %194 = ttg.convert_layout %192 : tensor<2x1024xbf16, #blocked7> -> tensor<2x1024xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> 2026-02-21T11:19:18.4785829Z %195 = ttg.convert_layout %193 : tensor<1024x128xbf16, #blocked10> -> tensor<1024x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> 2026-02-21T11:19:18.4786148Z %196 = ttg.convert_layout %191 : tensor<2x128xf32, #blocked10> -> tensor<2x128xf32, #blocked10> 2026-02-21T11:19:18.4786574Z %197 = tt.dot %194, %195, %196, inputPrecision = tf32 : tensor<2x1024xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<1024x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<2x128xf32, #blocked10> 2026-02-21T11:19:18.4787001Z %198 = tt.reshape %197 : tensor<2x128xf32, #blocked10> -> tensor<1x2x128xf32, #blocked2> 2026-02-21T11:19:18.4787279Z scf.yield %154, %170, %198 : tensor<1x2xf32, #blocked6>, tensor<1x2xf32, #blocked6>, tensor<1x2x128xf32, #blocked2> 2026-02-21T11:19:18.4787489Z } 2026-02-21T11:19:18.4787688Z %92 = ttg.convert_layout %91#1 : tensor<1x2xf32, #blocked6> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T11:19:18.4788035Z %93 = tt.expand_dims %92 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xf32, #blocked14> 2026-02-21T11:19:18.4788333Z %94 = ttg.convert_layout %93 : tensor<1x2x1xf32, #blocked14> -> tensor<1x2x1xf32, #blocked4> 2026-02-21T11:19:18.4788580Z %95 = tt.broadcast %94 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4> 2026-02-21T11:19:18.4788822Z %96 = ttg.convert_layout %95 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked2> 2026-02-21T11:19:18.4789041Z %97 = arith.divf %91#2, %96 : tensor<1x2x128xf32, #blocked2> 2026-02-21T11:19:18.4789263Z %98 = arith.truncf %97 : tensor<1x2x128xf32, #blocked2> to tensor<1x2x128xbf16, #blocked2> 2026-02-21T11:19:18.4789456Z %99 = arith.muli %48, %c524288_i32 : i32 2026-02-21T11:19:18.4789686Z %100 = ttg.convert_layout %52 : tensor<2xi32, #blocked8> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:19:18.4790014Z %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x2xi32, #blocked9> 2026-02-21T11:19:18.4790309Z %102 = ttg.convert_layout %101 : tensor<1x2xi32, #blocked9> -> tensor<1x2xi32, #blocked6> 2026-02-21T11:19:18.4790621Z %103 = ttg.convert_layout %102 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T11:19:18.4790961Z %104 = tt.expand_dims %103 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xi32, #blocked14> 2026-02-21T11:19:18.4791275Z %105 = ttg.convert_layout %104 : tensor<1x2x1xi32, #blocked14> -> tensor<1x2x1xi32, #blocked4> 2026-02-21T11:19:18.4791491Z %106 = arith.muli %105, %cst_11 : tensor<1x2x1xi32, #blocked4> 2026-02-21T11:19:18.4791665Z %107 = tt.splat %99 : i32 -> tensor<1x2x1xi32, #blocked4> 2026-02-21T11:19:18.4791835Z %108 = arith.addi %107, %106 : tensor<1x2x1xi32, #blocked4> 2026-02-21T11:19:18.4792041Z %109 = tt.broadcast %108 : tensor<1x2x1xi32, #blocked4> -> tensor<1x2x128xi32, #blocked4> 2026-02-21T11:19:18.4792294Z %110 = ttg.convert_layout %109 : tensor<1x2x128xi32, #blocked4> -> tensor<1x2x128xi32, #blocked2> 2026-02-21T11:19:18.4792510Z %111 = arith.addi %110, %46 : tensor<1x2x128xi32, #blocked2> 2026-02-21T11:19:18.4792731Z %112 = tt.addptr %47, %111 : tensor<1x2x128x!tt.ptr, #blocked2>, tensor<1x2x128xi32, #blocked2> 2026-02-21T11:19:18.4792958Z tt.store %112, %98 : tensor<1x2x128x!tt.ptr, #blocked2> 2026-02-21T11:19:18.4793108Z } {tt.loop_unroll_factor = 1 : i32} 2026-02-21T11:19:18.4793228Z tt.return 2026-02-21T11:19:18.4793317Z } 2026-02-21T11:19:18.4793401Z } 2026-02-21T11:19:18.4793447Z 2026-02-21T11:19:18.4793480Z {-# 2026-02-21T11:19:18.4793568Z external_resources: { 2026-02-21T11:19:18.4793675Z mlir_reproducer: { 2026-02-21T11:19:18.4795933Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T11:19:18.4798296Z disable_threading: false, 2026-02-21T11:19:18.4798406Z verify_each: true 2026-02-21T11:19:18.4798508Z } 2026-02-21T11:19:18.4798582Z } 2026-02-21T11:19:18.4798659Z #-} 2026-02-21T11:19:18.4798939Z /tmp/torchinductor_root/qx/cqxoflqc4pbusmmbqhmay5me5afqx2e4hofhburtffelleatvdmk.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T11:19:18.4799693Z /tmp/torchinductor_root/qx/cqxoflqc4pbusmmbqhmay5me5afqx2e4hofhburtffelleatvdmk.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T11:19:18.4800255Z [182s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T11:19:18.4801077Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 1024], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[0, 0], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T11:19:18.4801823Z Error: RuntimeError: PassManager::run failed 2026-02-21T11:19:18.4802002Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T11:21:15.9693968Z /tmp/torchinductor_root/qe/cqe554l7qr33o6wxmnihp3prhkkxheyptmy4cme6pxp6ul4zcf3u.py:55:130: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T11:21:15.9695144Z k = tl.load(k_view + (indices_0[:, None, None] * 524288 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None) 2026-02-21T11:21:15.9695826Z ^ 2026-02-21T11:21:15.9697617Z /tmp/torchinductor_root/qe/cqe554l7qr33o6wxmnihp3prhkkxheyptmy4cme6pxp6ul4zcf3u.py:57:141: note: - use: %132 = "tt.reshape"(<>) : (tensor<1x128x512xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 1], order = [1, 0, 2]}>>) -> tensor<128x512xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [0, 1]}>> 2026-02-21T11:21:15.9699170Z 2026-02-21T11:21:15.9700064Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T11:21:15.9701782Z ^ 2026-02-21T11:21:15.9702255Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T11:21:15.9702903Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T11:21:15.9703733Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T11:21:15.9704714Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T11:21:15.9705368Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T11:21:15.9705915Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T11:21:15.9706429Z #blocked5 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T11:21:15.9706944Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T11:21:15.9707499Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T11:21:15.9708073Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T11:21:15.9708718Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T11:21:15.9709268Z #blocked10 = #ttg.blocked<{sizePerThread = [2, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T11:21:15.9709809Z #blocked11 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T11:21:15.9710407Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T11:21:15.9711494Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T11:21:15.9712218Z %c524288_i32 = arith.constant 524288 : i32 2026-02-21T11:21:15.9712454Z %c192_i64 = arith.constant 192 : i64 2026-02-21T11:21:15.9712668Z %c0_i64 = arith.constant 0 : i64 2026-02-21T11:21:15.9712892Z %c524288_i64 = arith.constant 524288 : i64 2026-02-21T11:21:15.9713191Z %cst = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked> 2026-02-21T11:21:15.9713551Z %cst_0 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked> 2026-02-21T11:21:15.9713880Z %cst_1 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked> 2026-02-21T11:21:15.9714205Z %cst_2 = arith.constant dense<4096> : tensor<1x2x1xi64, #blocked1> 2026-02-21T11:21:15.9714529Z %cst_3 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked1> 2026-02-21T11:21:15.9714827Z %cst_4 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked1> 2026-02-21T11:21:15.9715039Z %c512_i32 = arith.constant 512 : i32 2026-02-21T11:21:15.9715206Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T11:21:15.9715366Z %c3072_i32 = arith.constant 3072 : i32 2026-02-21T11:21:15.9715566Z %cst_5 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked1> 2026-02-21T11:21:15.9715806Z %cst_6 = arith.constant dense<128> : tensor<1x512x1xi32, #blocked2> 2026-02-21T11:21:15.9716061Z %cst_7 = arith.constant dense<0.127517432> : tensor<1x2x512xf32, #blocked> 2026-02-21T11:21:15.9716323Z %cst_8 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9716606Z %cst_9 = arith.constant dense<0.000000e+00> : tensor<2x512xf32, #blocked4> 2026-02-21T11:21:15.9716863Z %cst_10 = arith.constant dense<128> : tensor<1x1x512xi32, #blocked> 2026-02-21T11:21:15.9717060Z %c0_i32 = arith.constant 0 : i32 2026-02-21T11:21:15.9717265Z %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked> 2026-02-21T11:21:15.9717529Z %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9717788Z %cst_13 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9718017Z %c2_i32 = arith.constant 2 : i32 2026-02-21T11:21:15.9718172Z %c16_i32 = arith.constant 16 : i32 2026-02-21T11:21:15.9718331Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T11:21:15.9718486Z %0 = tt.get_program_id x : i32 2026-02-21T11:21:15.9718642Z %1 = arith.divsi %0, %c3072_i32 : i32 2026-02-21T11:21:15.9718801Z %2 = arith.muli %1, %c16_i32 : i32 2026-02-21T11:21:15.9718953Z %3 = arith.subi %c2048_i32, %2 : i32 2026-02-21T11:21:15.9719104Z %4 = arith.minsi %3, %c16_i32 : i32 2026-02-21T11:21:15.9719258Z %5 = arith.remsi %0, %c3072_i32 : i32 2026-02-21T11:21:15.9719408Z %6 = arith.remsi %5, %4 : i32 2026-02-21T11:21:15.9719558Z %7 = arith.addi %2, %6 : i32 2026-02-21T11:21:15.9719704Z %8 = arith.divsi %5, %4 : i32 2026-02-21T11:21:15.9719850Z %9 = arith.muli %7, %c2_i32 : i32 2026-02-21T11:21:15.9720061Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked5> 2026-02-21T11:21:15.9720304Z %11 = tt.splat %9 : i32 -> tensor<2xi32, #blocked5> 2026-02-21T11:21:15.9720505Z %12 = arith.addi %11, %10 : tensor<2xi32, #blocked5> 2026-02-21T11:21:15.9720760Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked5> 2026-02-21T11:21:15.9720986Z %14 = arith.extsi %8 : i32 to i64 2026-02-21T11:21:15.9721140Z %15 = arith.extsi %9 : i32 to i64 2026-02-21T11:21:15.9721350Z %16 = tt.splat %arg0 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T11:21:15.9721579Z %17 = arith.muli %14, %c524288_i64 : i64 2026-02-21T11:21:15.9721767Z %18 = tt.splat %17 : i64 -> tensor<1x2x128xi64, #blocked> 2026-02-21T11:21:15.9721972Z %19 = tt.splat %15 : i64 -> tensor<2xi64, #blocked5> 2026-02-21T11:21:15.9722222Z %20 = arith.extsi %10 : tensor<2xi32, #blocked5> to tensor<2xi64, #blocked5> 2026-02-21T11:21:15.9722456Z %21 = arith.addi %19, %20 : tensor<2xi64, #blocked5> 2026-02-21T11:21:15.9722859Z %22 = ttg.convert_layout %21 : tensor<2xi64, #blocked5> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T11:21:15.9723286Z %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi64, #blocked6> 2026-02-21T11:21:15.9723664Z %24 = ttg.convert_layout %23 : tensor<1x2xi64, #blocked6> -> tensor<1x2xi64, #blocked3> 2026-02-21T11:21:15.9724031Z %25 = ttg.convert_layout %24 : tensor<1x2xi64, #blocked3> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T11:21:15.9724474Z %26 = tt.expand_dims %25 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi64, #blocked7> 2026-02-21T11:21:15.9724865Z %27 = ttg.convert_layout %26 : tensor<1x2x1xi64, #blocked7> -> tensor<1x2x1xi64, #blocked1> 2026-02-21T11:21:15.9725143Z %28 = arith.muli %27, %cst_4 : tensor<1x2x1xi64, #blocked1> 2026-02-21T11:21:15.9725408Z %29 = tt.broadcast %28 : tensor<1x2x1xi64, #blocked1> -> tensor<1x2x128xi64, #blocked1> 2026-02-21T11:21:15.9725699Z %30 = ttg.convert_layout %29 : tensor<1x2x128xi64, #blocked1> -> tensor<1x2x128xi64, #blocked> 2026-02-21T11:21:15.9725950Z %31 = arith.extsi %13 : tensor<128xi32, #blocked5> to tensor<128xi64, #blocked5> 2026-02-21T11:21:15.9726234Z %32 = ttg.convert_layout %31 : tensor<128xi64, #blocked5> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T11:21:15.9726576Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi64, #blocked6> 2026-02-21T11:21:15.9726902Z %34 = ttg.convert_layout %33 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #blocked4> 2026-02-21T11:21:15.9727201Z %35 = ttg.convert_layout %34 : tensor<1x128xi64, #blocked4> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T11:21:15.9727563Z %36 = tt.expand_dims %35 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi64, #blocked8> 2026-02-21T11:21:15.9727902Z %37 = ttg.convert_layout %36 : tensor<1x1x128xi64, #blocked8> -> tensor<1x1x128xi64, #blocked> 2026-02-21T11:21:15.9728156Z %38 = tt.broadcast %37 : tensor<1x1x128xi64, #blocked> -> tensor<1x2x128xi64, #blocked> 2026-02-21T11:21:15.9728366Z %39 = arith.addi %30, %38 : tensor<1x2x128xi64, #blocked> 2026-02-21T11:21:15.9728532Z %40 = arith.addi %18, %39 : tensor<1x2x128xi64, #blocked> 2026-02-21T11:21:15.9728749Z %41 = tt.addptr %16, %40 : tensor<1x2x128x!tt.ptr, #blocked>, tensor<1x2x128xi64, #blocked> 2026-02-21T11:21:15.9728953Z %42 = arith.cmpi sge, %14, %c0_i64 : i64 2026-02-21T11:21:15.9729087Z %43 = arith.cmpi slt, %14, %c192_i64 : i64 2026-02-21T11:21:15.9729219Z %44 = arith.andi %42, %43 : i1 2026-02-21T11:21:15.9729369Z %45 = arith.cmpi sge, %27, %cst_3 : tensor<1x2x1xi64, #blocked1> 2026-02-21T11:21:15.9729557Z %46 = arith.cmpi slt, %27, %cst_2 : tensor<1x2x1xi64, #blocked1> 2026-02-21T11:21:15.9729730Z %47 = arith.andi %45, %46 : tensor<1x2x1xi1, #blocked1> 2026-02-21T11:21:15.9729896Z %48 = tt.splat %44 : i1 -> tensor<1x2x1xi1, #blocked1> 2026-02-21T11:21:15.9730077Z %49 = arith.andi %48, %47 : tensor<1x2x1xi1, #blocked1> 2026-02-21T11:21:15.9730272Z %50 = tt.broadcast %49 : tensor<1x2x1xi1, #blocked1> -> tensor<1x2x128xi1, #blocked1> 2026-02-21T11:21:15.9730525Z %51 = ttg.convert_layout %50 : tensor<1x2x128xi1, #blocked1> -> tensor<1x2x128xi1, #blocked> 2026-02-21T11:21:15.9730749Z %52 = arith.cmpi sge, %37, %cst_1 : tensor<1x1x128xi64, #blocked> 2026-02-21T11:21:15.9730935Z %53 = arith.cmpi slt, %37, %cst_0 : tensor<1x1x128xi64, #blocked> 2026-02-21T11:21:15.9731107Z %54 = arith.andi %52, %53 : tensor<1x1x128xi1, #blocked> 2026-02-21T11:21:15.9731329Z %55 = tt.broadcast %54 : tensor<1x1x128xi1, #blocked> -> tensor<1x2x128xi1, #blocked> 2026-02-21T11:21:15.9731533Z %56 = arith.andi %51, %55 : tensor<1x2x128xi1, #blocked> 2026-02-21T11:21:15.9731710Z %57 = tt.load %41, %56, %cst : tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T11:21:15.9731915Z %58 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #blocked5> 2026-02-21T11:21:15.9732095Z %59 = arith.muli %8, %c524288_i32 : i32 2026-02-21T11:21:15.9732330Z %60 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T11:21:15.9732675Z %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6> 2026-02-21T11:21:15.9732978Z %62 = ttg.convert_layout %61 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4> 2026-02-21T11:21:15.9733282Z %63 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T11:21:15.9733638Z %64 = tt.expand_dims %63 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x128x1xi32, #blocked9> 2026-02-21T11:21:15.9733957Z %65 = ttg.convert_layout %64 : tensor<1x128x1xi32, #blocked9> -> tensor<1x128x1xi32, #blocked2> 2026-02-21T11:21:15.9734182Z %66 = tt.splat %59 : i32 -> tensor<1x128x1xi32, #blocked2> 2026-02-21T11:21:15.9734348Z %67 = arith.addi %66, %65 : tensor<1x128x1xi32, #blocked2> 2026-02-21T11:21:15.9734561Z %68 = tt.broadcast %67 : tensor<1x128x1xi32, #blocked2> -> tensor<1x128x512xi32, #blocked2> 2026-02-21T11:21:15.9734850Z %69 = ttg.convert_layout %68 : tensor<1x128x512xi32, #blocked2> -> tensor<1x128x512xi32, #blocked> 2026-02-21T11:21:15.9735091Z %70 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x512x!tt.ptr, #blocked> 2026-02-21T11:21:15.9735313Z %71 = tt.reshape %57 : tensor<1x2x128xbf16, #blocked> -> tensor<2x128xbf16, #blocked4> 2026-02-21T11:21:15.9735508Z %72 = tt.splat %59 : i32 -> tensor<1x512x1xi32, #blocked2> 2026-02-21T11:21:15.9735752Z %73 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T11:21:15.9736108Z %74 = tt.expand_dims %73 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8> 2026-02-21T11:21:15.9736410Z %75 = ttg.convert_layout %74 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked> 2026-02-21T11:21:15.9736656Z %76 = tt.broadcast %75 : tensor<1x1x128xi32, #blocked> -> tensor<1x512x128xi32, #blocked> 2026-02-21T11:21:15.9736882Z %77 = tt.splat %arg2 : !tt.ptr -> tensor<1x512x128x!tt.ptr, #blocked> 2026-02-21T11:21:15.9737278Z %78:3 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c512_i32 iter_args(%arg5 = %cst_13, %arg6 = %cst_12, %arg7 = %cst_11) -> (tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked>) : i32 { 2026-02-21T11:21:15.9737643Z %108 = tt.splat %arg4 : i32 -> tensor<512xi32, #blocked5> 2026-02-21T11:21:15.9737819Z %109 = arith.addi %108, %58 : tensor<512xi32, #blocked5> 2026-02-21T11:21:15.9738065Z %110 = ttg.convert_layout %109 : tensor<512xi32, #blocked5> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T11:21:15.9738427Z %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6> 2026-02-21T11:21:15.9738725Z %112 = ttg.convert_layout %111 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked4> 2026-02-21T11:21:15.9739022Z %113 = ttg.convert_layout %112 : tensor<1x512xi32, #blocked4> -> tensor<1x512xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T11:21:15.9739371Z %114 = tt.expand_dims %113 {axis = 1 : i32} : tensor<1x512xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x512xi32, #blocked8> 2026-02-21T11:21:15.9739702Z %115 = ttg.convert_layout %114 : tensor<1x1x512xi32, #blocked8> -> tensor<1x1x512xi32, #blocked> 2026-02-21T11:21:15.9739917Z %116 = arith.muli %115, %cst_10 : tensor<1x1x512xi32, #blocked> 2026-02-21T11:21:15.9740136Z %117 = tt.broadcast %116 : tensor<1x1x512xi32, #blocked> -> tensor<1x128x512xi32, #blocked> 2026-02-21T11:21:15.9740345Z %118 = arith.addi %69, %117 : tensor<1x128x512xi32, #blocked> 2026-02-21T11:21:15.9740563Z %119 = tt.addptr %70, %118 : tensor<1x128x512x!tt.ptr, #blocked>, tensor<1x128x512xi32, #blocked> 2026-02-21T11:21:15.9740788Z %120 = tt.load %119 : tensor<1x128x512x!tt.ptr, #blocked> 2026-02-21T11:21:15.9740995Z %121 = tt.reshape %120 : tensor<1x128x512xbf16, #blocked> -> tensor<128x512xbf16, #blocked4> 2026-02-21T11:21:15.9741298Z %122 = ttg.convert_layout %71 : tensor<2x128xbf16, #blocked4> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> 2026-02-21T11:21:15.9741655Z %123 = ttg.convert_layout %121 : tensor<128x512xbf16, #blocked4> -> tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> 2026-02-21T11:21:15.9741968Z %124 = ttg.convert_layout %cst_9 : tensor<2x512xf32, #blocked4> -> tensor<2x512xf32, #blocked10> 2026-02-21T11:21:15.9742390Z %125 = tt.dot %122, %123, %124, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<2x512xf32, #blocked10> 2026-02-21T11:21:15.9742814Z %126 = ttg.convert_layout %125 : tensor<2x512xf32, #blocked10> -> tensor<2x512xf32, #blocked4> 2026-02-21T11:21:15.9743051Z %127 = tt.reshape %126 : tensor<2x512xf32, #blocked4> -> tensor<1x2x512xf32, #blocked> 2026-02-21T11:21:15.9743310Z %128 = arith.truncf %127 : tensor<1x2x512xf32, #blocked> to tensor<1x2x512xbf16, #blocked> 2026-02-21T11:21:15.9743543Z %129 = arith.extf %128 : tensor<1x2x512xbf16, #blocked> to tensor<1x2x512xf32, #blocked> 2026-02-21T11:21:15.9743730Z %130 = "tt.reduce"(%129) <{axis = 2 : i32}> ({ 2026-02-21T11:21:15.9743858Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T11:21:15.9743980Z %183 = arith.maxnumf %arg8, %arg9 : f32 2026-02-21T11:21:15.9744125Z tt.reduce.return %183 : f32 2026-02-21T11:21:15.9744312Z }) : (tensor<1x2x512xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T11:21:15.9744604Z %131 = ttg.convert_layout %130 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9744873Z %132 = arith.truncf %131 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3> 2026-02-21T11:21:15.9745097Z %133 = arith.extf %132 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9745314Z %134 = arith.mulf %133, %cst_8 : tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9745504Z %135 = arith.truncf %134 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3> 2026-02-21T11:21:15.9745722Z %136 = arith.extf %135 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9745913Z %137 = arith.cmpf ogt, %arg5, %136 : tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9746090Z %138 = arith.cmpf une, %arg5, %arg5 : tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9746255Z %139 = arith.ori %137, %138 : tensor<1x2xi1, #blocked3> 2026-02-21T11:21:15.9746472Z %140 = arith.select %139, %arg5, %136 : tensor<1x2xi1, #blocked3>, tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9746680Z %141 = arith.mulf %129, %cst_7 : tensor<1x2x512xf32, #blocked> 2026-02-21T11:21:15.9746880Z %142 = arith.truncf %141 : tensor<1x2x512xf32, #blocked> to tensor<1x2x512xbf16, #blocked> 2026-02-21T11:21:15.9747167Z %143 = ttg.convert_layout %140 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T11:21:15.9747504Z %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7> 2026-02-21T11:21:15.9747818Z %145 = ttg.convert_layout %144 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1> 2026-02-21T11:21:15.9748062Z %146 = arith.extf %142 : tensor<1x2x512xbf16, #blocked> to tensor<1x2x512xf32, #blocked> 2026-02-21T11:21:15.9748295Z %147 = tt.broadcast %145 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x512xf32, #blocked1> 2026-02-21T11:21:15.9748538Z %148 = ttg.convert_layout %147 : tensor<1x2x512xf32, #blocked1> -> tensor<1x2x512xf32, #blocked> 2026-02-21T11:21:15.9748749Z %149 = arith.subf %146, %148 : tensor<1x2x512xf32, #blocked> 2026-02-21T11:21:15.9749051Z %150 = tt.extern_elementwise %149 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x512xf32, #blocked>) -> tensor<1x2x512xf32, #blocked> 2026-02-21T11:21:15.9749348Z %151 = "tt.reduce"(%150) <{axis = 2 : i32}> ({ 2026-02-21T11:21:15.9749472Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T11:21:15.9749590Z %183 = arith.addf %arg8, %arg9 : f32 2026-02-21T11:21:15.9749708Z tt.reduce.return %183 : f32 2026-02-21T11:21:15.9749889Z }) : (tensor<1x2x512xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> 2026-02-21T11:21:15.9750178Z %152 = ttg.convert_layout %151 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9750420Z %153 = arith.subf %arg5, %140 : tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9750706Z %154 = tt.extern_elementwise %153 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked3>) -> tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9751006Z %155 = arith.mulf %arg6, %154 : tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9751164Z %156 = arith.addf %155, %152 : tensor<1x2xf32, #blocked3> 2026-02-21T11:21:15.9751403Z %157 = ttg.convert_layout %154 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T11:21:15.9751734Z %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7> 2026-02-21T11:21:15.9752047Z %159 = ttg.convert_layout %158 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1> 2026-02-21T11:21:15.9752285Z %160 = tt.broadcast %159 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1> 2026-02-21T11:21:15.9752527Z %161 = ttg.convert_layout %160 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked> 2026-02-21T11:21:15.9752738Z %162 = arith.mulf %arg7, %161 : tensor<1x2x128xf32, #blocked> 2026-02-21T11:21:15.9752984Z %163 = ttg.convert_layout %112 : tensor<1x512xi32, #blocked4> -> tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T11:21:15.9753326Z %164 = tt.expand_dims %163 {axis = 2 : i32} : tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x512x1xi32, #blocked9> 2026-02-21T11:21:15.9753629Z %165 = ttg.convert_layout %164 : tensor<1x512x1xi32, #blocked9> -> tensor<1x512x1xi32, #blocked2> 2026-02-21T11:21:15.9753847Z %166 = arith.muli %165, %cst_6 : tensor<1x512x1xi32, #blocked2> 2026-02-21T11:21:15.9754016Z %167 = arith.addi %72, %166 : tensor<1x512x1xi32, #blocked2> 2026-02-21T11:21:15.9754230Z %168 = tt.broadcast %167 : tensor<1x512x1xi32, #blocked2> -> tensor<1x512x128xi32, #blocked2> 2026-02-21T11:21:15.9754488Z %169 = ttg.convert_layout %168 : tensor<1x512x128xi32, #blocked2> -> tensor<1x512x128xi32, #blocked> 2026-02-21T11:21:15.9754702Z %170 = arith.addi %169, %76 : tensor<1x512x128xi32, #blocked> 2026-02-21T11:21:15.9754917Z %171 = tt.addptr %77, %170 : tensor<1x512x128x!tt.ptr, #blocked>, tensor<1x512x128xi32, #blocked> 2026-02-21T11:21:15.9755133Z %172 = tt.load %171 : tensor<1x512x128x!tt.ptr, #blocked> 2026-02-21T11:21:15.9755338Z %173 = arith.truncf %150 : tensor<1x2x512xf32, #blocked> to tensor<1x2x512xbf16, #blocked> 2026-02-21T11:21:15.9755592Z %174 = tt.reshape %162 : tensor<1x2x128xf32, #blocked> -> tensor<2x128xf32, #blocked4> 2026-02-21T11:21:15.9755819Z %175 = tt.reshape %173 : tensor<1x2x512xbf16, #blocked> -> tensor<2x512xbf16, #blocked4> 2026-02-21T11:21:15.9756056Z %176 = tt.reshape %172 : tensor<1x512x128xbf16, #blocked> -> tensor<512x128xbf16, #blocked4> 2026-02-21T11:21:15.9756355Z %177 = ttg.convert_layout %175 : tensor<2x512xbf16, #blocked4> -> tensor<2x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked11}>> 2026-02-21T11:21:15.9756709Z %178 = ttg.convert_layout %176 : tensor<512x128xbf16, #blocked4> -> tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked11}>> 2026-02-21T11:21:15.9757015Z %179 = ttg.convert_layout %174 : tensor<2x128xf32, #blocked4> -> tensor<2x128xf32, #blocked11> 2026-02-21T11:21:15.9757425Z %180 = tt.dot %177, %178, %179, inputPrecision = tf32 : tensor<2x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked11}>> * tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked11}>> -> tensor<2x128xf32, #blocked11> 2026-02-21T11:21:15.9757833Z %181 = ttg.convert_layout %180 : tensor<2x128xf32, #blocked11> -> tensor<2x128xf32, #blocked4> 2026-02-21T11:21:15.9758071Z %182 = tt.reshape %181 : tensor<2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked> 2026-02-21T11:21:15.9758335Z scf.yield %140, %156, %182 : tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked> 2026-02-21T11:21:15.9758587Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32} 2026-02-21T11:21:15.9758853Z %79 = ttg.convert_layout %78#1 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T11:21:15.9759197Z %80 = tt.expand_dims %79 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7> 2026-02-21T11:21:15.9759489Z %81 = ttg.convert_layout %80 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1> 2026-02-21T11:21:15.9759723Z %82 = tt.broadcast %81 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1> 2026-02-21T11:21:15.9759962Z %83 = ttg.convert_layout %82 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked> 2026-02-21T11:21:15.9760186Z %84 = arith.divf %78#2, %83 : tensor<1x2x128xf32, #blocked> 2026-02-21T11:21:15.9760378Z %85 = arith.truncf %84 : tensor<1x2x128xf32, #blocked> to tensor<1x2x128xbf16, #blocked> 2026-02-21T11:21:15.9760558Z %86 = arith.muli %8, %c524288_i32 : i32 2026-02-21T11:21:15.9760770Z %87 = ttg.convert_layout %12 : tensor<2xi32, #blocked5> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T11:21:15.9761083Z %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi32, #blocked6> 2026-02-21T11:21:15.9761364Z %89 = ttg.convert_layout %88 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #blocked3> 2026-02-21T11:21:15.9761636Z %90 = ttg.convert_layout %89 : tensor<1x2xi32, #blocked3> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T11:21:15.9761960Z %91 = tt.expand_dims %90 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi32, #blocked7> 2026-02-21T11:21:15.9762266Z %92 = ttg.convert_layout %91 : tensor<1x2x1xi32, #blocked7> -> tensor<1x2x1xi32, #blocked1> 2026-02-21T11:21:15.9762469Z %93 = arith.muli %92, %cst_5 : tensor<1x2x1xi32, #blocked1> 2026-02-21T11:21:15.9762668Z %94 = tt.splat %86 : i32 -> tensor<1x2x1xi32, #blocked1> 2026-02-21T11:21:15.9762818Z %95 = arith.addi %94, %93 : tensor<1x2x1xi32, #blocked1> 2026-02-21T11:21:15.9763052Z %96 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> 2026-02-21T11:21:15.9763370Z %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6> 2026-02-21T11:21:15.9763675Z %98 = ttg.convert_layout %97 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4> 2026-02-21T11:21:15.9763959Z %99 = ttg.convert_layout %98 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> 2026-02-21T11:21:15.9764295Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8> 2026-02-21T11:21:15.9764595Z %101 = ttg.convert_layout %100 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked> 2026-02-21T11:21:15.9764836Z %102 = tt.broadcast %95 : tensor<1x2x1xi32, #blocked1> -> tensor<1x2x128xi32, #blocked1> 2026-02-21T11:21:15.9765079Z %103 = ttg.convert_layout %102 : tensor<1x2x128xi32, #blocked1> -> tensor<1x2x128xi32, #blocked> 2026-02-21T11:21:15.9765323Z %104 = tt.broadcast %101 : tensor<1x1x128xi32, #blocked> -> tensor<1x2x128xi32, #blocked> 2026-02-21T11:21:15.9765519Z %105 = arith.addi %103, %104 : tensor<1x2x128xi32, #blocked> 2026-02-21T11:21:15.9765703Z %106 = tt.splat %arg3 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T11:21:15.9765936Z %107 = tt.addptr %106, %105 : tensor<1x2x128x!tt.ptr, #blocked>, tensor<1x2x128xi32, #blocked> 2026-02-21T11:21:15.9766150Z tt.store %107, %85 : tensor<1x2x128x!tt.ptr, #blocked> 2026-02-21T11:21:15.9766282Z tt.return 2026-02-21T11:21:15.9766366Z } 2026-02-21T11:21:15.9766441Z } 2026-02-21T11:21:15.9766483Z 2026-02-21T11:21:15.9766513Z {-# 2026-02-21T11:21:15.9766597Z external_resources: { 2026-02-21T11:21:15.9766695Z mlir_reproducer: { 2026-02-21T11:21:15.9768958Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T11:21:15.9782368Z disable_threading: false, 2026-02-21T11:21:15.9782479Z verify_each: true 2026-02-21T11:21:15.9782568Z } 2026-02-21T11:21:15.9782644Z } 2026-02-21T11:21:15.9782716Z #-} 2026-02-21T11:21:15.9782990Z /tmp/torchinductor_root/qe/cqe554l7qr33o6wxmnihp3prhkkxheyptmy4cme6pxp6ul4zcf3u.py:16:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T11:21:15.9783719Z /tmp/torchinductor_root/qe/cqe554l7qr33o6wxmnihp3prhkkxheyptmy4cme6pxp6ul4zcf3u.py:16:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T11:21:15.9784272Z [300s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T11:21:15.9785024Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 512], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T11:21:15.9785688Z Error: RuntimeError: PassManager::run failed 2026-02-21T11:21:15.9785855Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T11:24:02.1995452Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 1.1 configs/s 2026-02-21T11:24:02.2005745Z [466s] Adaptive compile timeout: 30s (90% percentile=30.0s, bounds=[30.0s, 30s]) 2026-02-21T11:24:02.2377803Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 - configs/s 2026-02-21T11:24:02.9888146Z [467s] Initial random population of 100, 5 starting points: 2026-02-21T11:24:02.9888585Z error=23 2026-02-21T11:24:02.9888806Z timeout=19 2026-02-21T11:24:02.9889005Z ok=58 2026-02-21T11:24:02.9889204Z min=8.1410 2026-02-21T11:24:02.9889403Z mid=59.0059 2026-02-21T11:24:02.9889632Z max=12096.0410 2026-02-21T11:24:02.9889878Z best={'block_sizes': [1, 256, 16], 2026-02-21T11:24:02.9890278Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T11:24:02.9890696Z 'l2_groupings': [8], 2026-02-21T11:24:02.9890966Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:24:02.9891306Z 'loop_orders': [[0, 1]], 2026-02-21T11:24:02.9891589Z 'matrix_instr_nonkdim': 16, 2026-02-21T11:24:02.9891899Z 'num_sm_multiplier': 16, 2026-02-21T11:24:02.9892151Z 'num_stages': 4, 2026-02-21T11:24:02.9892379Z 'num_warps': 16, 2026-02-21T11:24:02.9892625Z 'pid_type': 'persistent_blocked', 2026-02-21T11:24:02.9893210Z 'range_flattens': [False, True], 2026-02-21T11:24:02.9893514Z 'range_multi_buffers': [None, False], 2026-02-21T11:24:02.9893821Z 'range_num_stages': [1, 3], 2026-02-21T11:24:02.9894098Z 'range_unroll_factors': [2, 3], 2026-02-21T11:24:02.9894393Z 'range_warp_specializes': [], 2026-02-21T11:24:02.9894631Z 'waves_per_eu': 1} 2026-02-21T11:24:02.9903256Z [467s] Fitting surrogate: 100 points, 100 targets 2026-02-21T11:24:03.8537636Z [468s] Generation 1 starting: 84 neighbors, 5 active search path(s) 2026-02-21T11:24:38.8742162Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 0.5 configs/s 2026-02-21T11:24:59.1466955Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 86/86 4.3 configs/s 2026-02-21T11:24:59.3677996Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 27/27 - configs/s 2026-02-21T11:25:04.5778644Z [529s] Generation 1 complete: 2026-02-21T11:25:04.5781669Z ok=89 2026-02-21T11:25:04.5781982Z min=7.1705 2026-02-21T11:25:04.5782217Z mid=14.2522 2026-02-21T11:25:04.5782371Z max=192.5479 2026-02-21T11:25:04.5782588Z best={'block_sizes': [1, 256, 32], 2026-02-21T11:25:04.5782925Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T11:25:04.5783240Z 'l2_groupings': [8], 2026-02-21T11:25:04.5783453Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:25:04.5783694Z 'loop_orders': [[0, 1]], 2026-02-21T11:25:04.5783918Z 'matrix_instr_nonkdim': 16, 2026-02-21T11:25:04.5784129Z 'num_sm_multiplier': 8, 2026-02-21T11:25:04.5784341Z 'num_stages': 4, 2026-02-21T11:25:04.5784516Z 'num_warps': 8, 2026-02-21T11:25:04.5784714Z 'pid_type': 'persistent_blocked', 2026-02-21T11:25:04.5784945Z 'range_flattens': [None, True], 2026-02-21T11:25:04.5785548Z 'range_multi_buffers': [None, False], 2026-02-21T11:25:04.5785785Z 'range_num_stages': [1, 3], 2026-02-21T11:25:04.5785993Z 'range_unroll_factors': [2, 3], 2026-02-21T11:25:04.5786215Z 'range_warp_specializes': [], 2026-02-21T11:25:04.5786419Z 'waves_per_eu': 1} 2026-02-21T11:25:04.5801384Z [529s] Fitting surrogate: 189 points, 189 targets 2026-02-21T11:25:05.5203144Z [530s] Generation 2 starting: 88 neighbors, 5 active search path(s) 2026-02-21T11:25:39.6176177Z [564s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[1, 3], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:25:43.6066609Z [568s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:25:43.9458873Z [568s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[1, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:25:44.6402464Z [569s] Timeout after 30s compiling Config(block_sizes=[1, 256, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:25:44.6426611Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 0.8 configs/s 2026-02-21T11:26:01.8498940Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 91/91 5.3 configs/s 2026-02-21T11:26:02.1636623Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 29/29 - configs/s 2026-02-21T11:26:09.9604388Z [594s] Generation 2 complete: 2026-02-21T11:26:09.9605454Z error=3 2026-02-21T11:26:09.9605658Z timeout=4 2026-02-21T11:26:09.9605861Z ok=86 2026-02-21T11:26:09.9606049Z min=6.7719 2026-02-21T11:26:09.9606256Z mid=12.1696 2026-02-21T11:26:09.9606451Z max=75.6151 2026-02-21T11:26:09.9606836Z best={'block_sizes': [1, 128, 16], 2026-02-21T11:26:09.9607240Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T11:26:09.9607638Z 'l2_groupings': [64], 2026-02-21T11:26:09.9607910Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:26:09.9608236Z 'loop_orders': [[0, 1]], 2026-02-21T11:26:09.9608501Z 'matrix_instr_nonkdim': 16, 2026-02-21T11:26:09.9608757Z 'num_stages': 4, 2026-02-21T11:26:09.9608983Z 'num_warps': 4, 2026-02-21T11:26:09.9609213Z 'pid_type': 'flat', 2026-02-21T11:26:09.9609466Z 'range_flattens': [None, None], 2026-02-21T11:26:09.9609741Z 'range_multi_buffers': [None, True], 2026-02-21T11:26:09.9610052Z 'range_num_stages': [0, 4], 2026-02-21T11:26:09.9610315Z 'range_unroll_factors': [0, 4], 2026-02-21T11:26:09.9610600Z 'range_warp_specializes': [], 2026-02-21T11:26:09.9610870Z 'waves_per_eu': 2} 2026-02-21T11:26:09.9632626Z [594s] Fitting surrogate: 282 points, 282 targets 2026-02-21T11:26:11.5116247Z [596s] Generation 3 starting: 80 neighbors, 5 active search path(s) 2026-02-21T11:26:48.0444715Z [632s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:26:48.2670704Z [632s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:26:49.6951747Z [634s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:26:50.0880143Z [634s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:26:50.0901107Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 0.9 configs/s 2026-02-21T11:27:10.5125322Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 81/81 3.9 configs/s 2026-02-21T11:27:10.8083048Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 30/30 - configs/s 2026-02-21T11:27:18.5333234Z [663s] Generation 3 complete: 2026-02-21T11:27:18.5336006Z error=6 2026-02-21T11:27:18.5336323Z timeout=4 2026-02-21T11:27:18.5336545Z ok=75 2026-02-21T11:27:18.5336986Z min=6.6709 2026-02-21T11:27:18.5337232Z mid=13.1824 2026-02-21T11:27:18.5337460Z max=163.7117 2026-02-21T11:27:18.5340431Z best={'block_sizes': [1, 256, 32], 2026-02-21T11:27:18.5340954Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T11:27:18.5341368Z 'l2_groupings': [16], 2026-02-21T11:27:18.5342201Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:27:18.5342523Z 'loop_orders': [[0, 1]], 2026-02-21T11:27:18.5342812Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:27:18.5343245Z 'num_sm_multiplier': 8, 2026-02-21T11:27:18.5343515Z 'num_stages': 4, 2026-02-21T11:27:18.5343747Z 'num_warps': 8, 2026-02-21T11:27:18.5344020Z 'pid_type': 'persistent_blocked', 2026-02-21T11:27:18.5344333Z 'range_flattens': [None, True], 2026-02-21T11:27:18.5344590Z 'range_multi_buffers': [None, False], 2026-02-21T11:27:18.5344837Z 'range_num_stages': [1, 3], 2026-02-21T11:27:18.5345057Z 'range_unroll_factors': [3, 3], 2026-02-21T11:27:18.5345291Z 'range_warp_specializes': [], 2026-02-21T11:27:18.5345505Z 'waves_per_eu': 1} 2026-02-21T11:27:18.5360898Z [663s] Fitting surrogate: 367 points, 367 targets 2026-02-21T11:27:19.3621391Z [663s] Generation 4 starting: 72 neighbors, 5 active search path(s) 2026-02-21T11:27:53.2125402Z [697s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:27:55.6738950Z [700s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:27:55.8807216Z [700s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:27:57.2050761Z [701s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:27:58.3562673Z [702s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[1, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:27:58.3587239Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.5 configs/s 2026-02-21T11:28:09.2170982Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 73/73 6.8 configs/s 2026-02-21T11:28:09.5326169Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 33/33 - configs/s 2026-02-21T11:28:18.3384737Z [722s] Generation 4 complete: 2026-02-21T11:28:18.3385058Z error=2 2026-02-21T11:28:18.3385233Z timeout=5 2026-02-21T11:28:18.3385396Z ok=70 2026-02-21T11:28:18.3385573Z min=6.2016 2026-02-21T11:28:18.3385766Z mid=8.3862 2026-02-21T11:28:18.3385926Z max=75.5474 2026-02-21T11:28:18.3386127Z best={'block_sizes': [1, 256, 32], 2026-02-21T11:28:18.3386501Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T11:28:18.3386855Z 'l2_groupings': [16], 2026-02-21T11:28:18.3387079Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:28:18.3387630Z 'loop_orders': [[0, 1]], 2026-02-21T11:28:18.3387859Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:28:18.3388101Z 'num_sm_multiplier': 8, 2026-02-21T11:28:18.3388315Z 'num_stages': 4, 2026-02-21T11:28:18.3388509Z 'num_warps': 8, 2026-02-21T11:28:18.3388735Z 'pid_type': 'persistent_interleaved', 2026-02-21T11:28:18.3389011Z 'range_flattens': [None, True], 2026-02-21T11:28:18.3389263Z 'range_multi_buffers': [None, False], 2026-02-21T11:28:18.3389522Z 'range_num_stages': [1, 3], 2026-02-21T11:28:18.3389751Z 'range_unroll_factors': [3, 3], 2026-02-21T11:28:18.3389997Z 'range_warp_specializes': [], 2026-02-21T11:28:18.3390372Z 'waves_per_eu': 1} 2026-02-21T11:28:18.3418192Z [722s] Fitting surrogate: 444 points, 444 targets 2026-02-21T11:28:19.1005389Z [723s] Generation 5 starting: 67 neighbors, 4 active search path(s) 2026-02-21T11:28:54.3124062Z [758s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:28:56.0005995Z [760s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:28:56.2370889Z [760s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:28:56.2390292Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 0.7 configs/s 2026-02-21T11:29:07.0291593Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 68/68 6.3 configs/s 2026-02-21T11:29:07.2975880Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s 2026-02-21T11:29:14.9157794Z [779s] Generation 5 complete: 2026-02-21T11:29:14.9158109Z error=7 2026-02-21T11:29:14.9160380Z timeout=3 2026-02-21T11:29:14.9160906Z ok=61 2026-02-21T11:29:14.9161047Z min=5.8681 2026-02-21T11:29:14.9161196Z mid=8.4334 2026-02-21T11:29:14.9161339Z max=107.0934 2026-02-21T11:29:14.9161519Z best={'block_sizes': [1, 256, 64], 2026-02-21T11:29:14.9161834Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T11:29:14.9162136Z 'l2_groupings': [16], 2026-02-21T11:29:14.9162330Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:29:14.9162559Z 'loop_orders': [[0, 1]], 2026-02-21T11:29:14.9162927Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:29:14.9163131Z 'num_sm_multiplier': 8, 2026-02-21T11:29:14.9163318Z 'num_stages': 4, 2026-02-21T11:29:14.9163481Z 'num_warps': 8, 2026-02-21T11:29:14.9163667Z 'pid_type': 'persistent_interleaved', 2026-02-21T11:29:14.9163900Z 'range_flattens': [None, None], 2026-02-21T11:29:14.9164122Z 'range_multi_buffers': [None, False], 2026-02-21T11:29:14.9164346Z 'range_num_stages': [1, 2], 2026-02-21T11:29:14.9164548Z 'range_unroll_factors': [3, 3], 2026-02-21T11:29:14.9164761Z 'range_warp_specializes': [], 2026-02-21T11:29:14.9164967Z 'waves_per_eu': 1} 2026-02-21T11:29:14.9189807Z [779s] Fitting surrogate: 515 points, 515 targets 2026-02-21T11:29:15.5914235Z [780s] Generation 6 starting: 66 neighbors, 4 active search path(s) 2026-02-21T11:29:50.0773860Z [814s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:29:51.1574239Z [815s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:29:51.1592585Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 0.6 configs/s 2026-02-21T11:30:03.5275135Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 67/67 5.4 configs/s 2026-02-21T11:30:03.7005072Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s 2026-02-21T11:30:08.5285647Z [833s] Generation 6 complete: 2026-02-21T11:30:08.5285966Z error=4 2026-02-21T11:30:08.5286127Z timeout=2 2026-02-21T11:30:08.5286281Z ok=64 2026-02-21T11:30:08.5286431Z min=5.8744 2026-02-21T11:30:08.5286589Z mid=15.0732 2026-02-21T11:30:08.5286747Z max=136.5611 2026-02-21T11:30:08.5286939Z best={'block_sizes': [1, 256, 64], 2026-02-21T11:30:08.5287285Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T11:30:08.5287608Z 'l2_groupings': [16], 2026-02-21T11:30:08.5287846Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:30:08.5288088Z 'loop_orders': [[0, 1]], 2026-02-21T11:30:08.5288306Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:30:08.5288522Z 'num_sm_multiplier': 8, 2026-02-21T11:30:08.5288759Z 'num_stages': 4, 2026-02-21T11:30:08.5288938Z 'num_warps': 8, 2026-02-21T11:30:08.5289142Z 'pid_type': 'persistent_interleaved', 2026-02-21T11:30:08.5289385Z 'range_flattens': [None, None], 2026-02-21T11:30:08.5289620Z 'range_multi_buffers': [False, False], 2026-02-21T11:30:08.5290142Z 'range_num_stages': [1, 2], 2026-02-21T11:30:08.5290359Z 'range_unroll_factors': [3, 3], 2026-02-21T11:30:08.5290592Z 'range_warp_specializes': [], 2026-02-21T11:30:08.5290804Z 'waves_per_eu': 2} 2026-02-21T11:30:08.5314865Z [833s] Fitting surrogate: 585 points, 585 targets 2026-02-21T11:30:09.1997517Z [833s] Generation 7 starting: 65 neighbors, 4 active search path(s) 2026-02-21T11:30:42.1807940Z [866s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[1, 2], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:30:47.8761511Z [872s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[3, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:30:48.3139412Z [872s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:30:48.3161349Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 0.6 configs/s 2026-02-21T11:31:02.7862714Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 67/67 4.6 configs/s 2026-02-21T11:31:02.9874931Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s 2026-02-21T11:31:08.6643822Z [893s] Generation 7 complete: 2026-02-21T11:31:08.6644258Z error=3 2026-02-21T11:31:08.6644478Z timeout=3 2026-02-21T11:31:08.6644685Z ok=63 2026-02-21T11:31:08.6645419Z min=5.8937 2026-02-21T11:31:08.6645628Z mid=13.1874 2026-02-21T11:31:08.6645838Z max=110.0586 2026-02-21T11:31:08.6646101Z best={'block_sizes': [1, 256, 64], 2026-02-21T11:31:08.6646632Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T11:31:08.6647094Z 'l2_groupings': [16], 2026-02-21T11:31:08.6647391Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:31:08.6647736Z 'loop_orders': [[0, 1]], 2026-02-21T11:31:08.6648013Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:31:08.6648312Z 'num_sm_multiplier': 8, 2026-02-21T11:31:08.6648546Z 'num_stages': 4, 2026-02-21T11:31:08.6648748Z 'num_warps': 8, 2026-02-21T11:31:08.6648965Z 'pid_type': 'persistent_interleaved', 2026-02-21T11:31:08.6649227Z 'range_flattens': [None, None], 2026-02-21T11:31:08.6649478Z 'range_multi_buffers': [False, False], 2026-02-21T11:31:08.6649733Z 'range_num_stages': [1, 1], 2026-02-21T11:31:08.6649971Z 'range_unroll_factors': [3, 3], 2026-02-21T11:31:08.6650214Z 'range_warp_specializes': [], 2026-02-21T11:31:08.6650454Z 'waves_per_eu': 2} 2026-02-21T11:31:08.6677397Z [893s] Fitting surrogate: 654 points, 654 targets 2026-02-21T11:31:09.3244715Z [893s] Generation 8 starting: 62 neighbors, 4 active search path(s) 2026-02-21T11:31:44.2029485Z [928s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:31:45.2382135Z [929s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 32], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:31:45.6986877Z [930s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:31:46.1649944Z [930s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:31:47.3389427Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 0.6 configs/s 2026-02-21T11:31:59.1618274Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 64/64 5.5 configs/s 2026-02-21T11:31:59.3424833Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s 2026-02-21T11:32:04.4088518Z [948s] Generation 8 complete: 2026-02-21T11:32:04.4088945Z error=5 2026-02-21T11:32:04.4089222Z timeout=4 2026-02-21T11:32:04.4089433Z ok=57 2026-02-21T11:32:04.4089633Z min=5.8528 2026-02-21T11:32:04.4089840Z mid=12.1601 2026-02-21T11:32:04.4090040Z max=87.8726 2026-02-21T11:32:04.4090276Z best={'block_sizes': [1, 256, 64], 2026-02-21T11:32:04.4090697Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T11:32:04.4091110Z 'l2_groupings': [16], 2026-02-21T11:32:04.4091903Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:32:04.4092236Z 'loop_orders': [[0, 1]], 2026-02-21T11:32:04.4092540Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:32:04.4092823Z 'num_sm_multiplier': 8, 2026-02-21T11:32:04.4093083Z 'num_stages': 4, 2026-02-21T11:32:04.4093316Z 'num_warps': 8, 2026-02-21T11:32:04.4093624Z 'pid_type': 'persistent_interleaved', 2026-02-21T11:32:04.4093905Z 'range_flattens': [False, None], 2026-02-21T11:32:04.4094243Z 'range_multi_buffers': [False, False], 2026-02-21T11:32:04.4094675Z 'range_num_stages': [1, 1], 2026-02-21T11:32:04.4095058Z 'range_unroll_factors': [3, 3], 2026-02-21T11:32:04.4095481Z 'range_warp_specializes': [], 2026-02-21T11:32:04.4095848Z 'waves_per_eu': 2} 2026-02-21T11:32:04.4119471Z [948s] Fitting surrogate: 720 points, 720 targets 2026-02-21T11:32:05.1049211Z [949s] Generation 9 starting: 66 neighbors, 4 active search path(s) 2026-02-21T11:32:39.3461156Z [983s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 32], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:32:39.8180733Z [984s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:32:40.0957633Z [984s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:32:41.3959661Z [985s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:32:41.3979465Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 0.5 configs/s 2026-02-21T11:32:55.5907172Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 68/68 4.8 configs/s 2026-02-21T11:32:55.7418554Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s 2026-02-21T11:32:59.9719309Z [1004s] Generation 9 complete: 2026-02-21T11:32:59.9719664Z error=3 2026-02-21T11:32:59.9719999Z timeout=4 2026-02-21T11:32:59.9720341Z ok=63 2026-02-21T11:32:59.9721016Z min=5.8623 2026-02-21T11:32:59.9721300Z mid=14.7878 2026-02-21T11:32:59.9721523Z max=171.9621 2026-02-21T11:32:59.9721833Z best={'block_sizes': [1, 256, 64], 2026-02-21T11:32:59.9722263Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T11:32:59.9722774Z 'l2_groupings': [16], 2026-02-21T11:32:59.9723034Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:32:59.9723326Z 'loop_orders': [[0, 1]], 2026-02-21T11:32:59.9723582Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:32:59.9723837Z 'num_sm_multiplier': 8, 2026-02-21T11:32:59.9724078Z 'num_stages': 4, 2026-02-21T11:32:59.9724288Z 'num_warps': 8, 2026-02-21T11:32:59.9724699Z 'pid_type': 'persistent_interleaved', 2026-02-21T11:32:59.9724994Z 'range_flattens': [False, None], 2026-02-21T11:32:59.9725290Z 'range_multi_buffers': [False, False], 2026-02-21T11:32:59.9725586Z 'range_num_stages': [1, 1], 2026-02-21T11:32:59.9725841Z 'range_unroll_factors': [3, 3], 2026-02-21T11:32:59.9726121Z 'range_warp_specializes': [], 2026-02-21T11:32:59.9726365Z 'waves_per_eu': 2} 2026-02-21T11:32:59.9748140Z [1004s] Fitting surrogate: 790 points, 790 targets 2026-02-21T11:33:00.7117366Z [1005s] Generation 10 starting: 73 neighbors, 4 active search path(s) 2026-02-21T11:33:33.9087155Z [1038s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:33:34.1868315Z [1038s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[1, 2], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:33:34.1887764Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 1.0 configs/s 2026-02-21T11:33:47.0945702Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 5.8 configs/s 2026-02-21T11:33:47.3314895Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s 2026-02-21T11:33:54.0437617Z [1058s] Generation 10 complete: 2026-02-21T11:33:54.0438047Z error=1 2026-02-21T11:33:54.0438333Z timeout=2 2026-02-21T11:33:54.0438538Z ok=74 2026-02-21T11:33:54.0438740Z min=5.7939 2026-02-21T11:33:54.0438946Z mid=10.3790 2026-02-21T11:33:54.0439771Z max=65.0460 2026-02-21T11:33:54.0440047Z best={'block_sizes': [1, 256, 64], 2026-02-21T11:33:54.0440462Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T11:33:54.0441004Z 'l2_groupings': [16], 2026-02-21T11:33:54.0441283Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:33:54.0441601Z 'loop_orders': [[0, 1]], 2026-02-21T11:33:54.0441874Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:33:54.0442153Z 'num_sm_multiplier': 8, 2026-02-21T11:33:54.0442435Z 'num_stages': 4, 2026-02-21T11:33:54.0442794Z 'num_warps': 8, 2026-02-21T11:33:54.0443053Z 'pid_type': 'persistent_interleaved', 2026-02-21T11:33:54.0443379Z 'range_flattens': [False, None], 2026-02-21T11:33:54.0443684Z 'range_multi_buffers': [None, True], 2026-02-21T11:33:54.0443988Z 'range_num_stages': [1, 1], 2026-02-21T11:33:54.0444273Z 'range_unroll_factors': [3, 3], 2026-02-21T11:33:54.0444574Z 'range_warp_specializes': [], 2026-02-21T11:33:54.0444851Z 'waves_per_eu': 2} 2026-02-21T11:33:54.0473073Z [1058s] Fitting surrogate: 867 points, 867 targets 2026-02-21T11:33:54.8541203Z [1059s] Generation 11 starting: 73 neighbors, 4 active search path(s) 2026-02-21T11:34:29.4836223Z [1093s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[1, 1], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:34:32.0071633Z [1096s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[1, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:34:33.0563717Z [1097s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:34:34.4338718Z [1098s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:34:34.4360959Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 0.5 configs/s 2026-02-21T11:34:45.4709007Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 6.7 configs/s 2026-02-21T11:34:45.7447264Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s 2026-02-21T11:34:53.6450982Z [1118s] Generation 11 complete: 2026-02-21T11:34:53.6451648Z error=3 2026-02-21T11:34:53.6451771Z timeout=4 2026-02-21T11:34:53.6452898Z ok=70 2026-02-21T11:34:53.6453955Z min=5.7165 2026-02-21T11:34:53.6454113Z mid=7.4070 2026-02-21T11:34:53.6454199Z max=59.7397 2026-02-21T11:34:53.6454307Z best={'block_sizes': [1, 256, 64], 2026-02-21T11:34:53.6454513Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T11:34:53.6454713Z 'l2_groupings': [16], 2026-02-21T11:34:53.6454819Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:34:53.6454974Z 'loop_orders': [[0, 1]], 2026-02-21T11:34:53.6455078Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:34:53.6455186Z 'num_sm_multiplier': 8, 2026-02-21T11:34:53.6455283Z 'num_stages': 4, 2026-02-21T11:34:53.6455604Z 'num_warps': 8, 2026-02-21T11:34:53.6455708Z 'pid_type': 'persistent_interleaved', 2026-02-21T11:34:53.6455830Z 'range_flattens': [False, None], 2026-02-21T11:34:53.6455948Z 'range_multi_buffers': [None, True], 2026-02-21T11:34:53.6456062Z 'range_num_stages': [2, 1], 2026-02-21T11:34:53.6456184Z 'range_unroll_factors': [3, 2], 2026-02-21T11:34:53.6456331Z 'range_warp_specializes': [], 2026-02-21T11:34:53.6456435Z 'waves_per_eu': 2} 2026-02-21T11:34:53.6495288Z [1118s] Fitting surrogate: 944 points, 944 targets 2026-02-21T11:34:55.2004244Z [1119s] Generation 12 starting: 71 neighbors, 4 active search path(s) 2026-02-21T11:35:36.6685586Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 0.5 configs/s 2026-02-21T11:35:49.8047757Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 5.5 configs/s 2026-02-21T11:35:50.1326124Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s 2026-02-21T11:35:59.6248198Z [1184s] Generation 12 complete: 2026-02-21T11:35:59.6248643Z error=4 2026-02-21T11:35:59.6248856Z ok=71 2026-02-21T11:35:59.6249072Z min=5.6851 2026-02-21T11:35:59.6249280Z mid=6.1906 2026-02-21T11:35:59.6249494Z max=139.4411 2026-02-21T11:35:59.6249752Z best={'block_sizes': [1, 256, 64], 2026-02-21T11:35:59.6250201Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T11:35:59.6250612Z 'l2_groupings': [16], 2026-02-21T11:35:59.6250892Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:35:59.6251226Z 'loop_orders': [[0, 1]], 2026-02-21T11:35:59.6251512Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:35:59.6251801Z 'num_sm_multiplier': 8, 2026-02-21T11:35:59.6252080Z 'num_stages': 4, 2026-02-21T11:35:59.6252357Z 'num_warps': 8, 2026-02-21T11:35:59.6252629Z 'pid_type': 'persistent_interleaved', 2026-02-21T11:35:59.6252971Z 'range_flattens': [None, None], 2026-02-21T11:35:59.6253281Z 'range_multi_buffers': [None, True], 2026-02-21T11:35:59.6253588Z 'range_num_stages': [2, 1], 2026-02-21T11:35:59.6253885Z 'range_unroll_factors': [3, 2], 2026-02-21T11:35:59.6254175Z 'range_warp_specializes': [], 2026-02-21T11:35:59.6254461Z 'waves_per_eu': 2} 2026-02-21T11:35:59.6292651Z [1184s] Fitting surrogate: 1019 points, 1019 targets 2026-02-21T11:36:00.4012661Z [1184s] Generation 13 starting: 74 neighbors, 4 active search path(s) 2026-02-21T11:36:23.8029101Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 1.7 configs/s 2026-02-21T11:36:34.0408465Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 7.4 configs/s 2026-02-21T11:36:34.3268795Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s 2026-02-21T11:36:42.6111208Z [1227s] Generation 13 complete: 2026-02-21T11:36:42.6111533Z error=10 2026-02-21T11:36:42.6111683Z ok=68 2026-02-21T11:36:42.6112257Z min=5.7041 2026-02-21T11:36:42.6112404Z mid=7.1150 2026-02-21T11:36:42.6112555Z max=61.7277 2026-02-21T11:36:42.6112740Z best={'block_sizes': [1, 256, 64], 2026-02-21T11:36:42.6113053Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T11:36:42.6113347Z 'l2_groupings': [16], 2026-02-21T11:36:42.6113544Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:36:42.6113777Z 'loop_orders': [[0, 1]], 2026-02-21T11:36:42.6113976Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:36:42.6114306Z 'num_sm_multiplier': 8, 2026-02-21T11:36:42.6114493Z 'num_stages': 4, 2026-02-21T11:36:42.6114663Z 'num_warps': 8, 2026-02-21T11:36:42.6114848Z 'pid_type': 'persistent_interleaved', 2026-02-21T11:36:42.6115084Z 'range_flattens': [None, None], 2026-02-21T11:36:42.6115306Z 'range_multi_buffers': [None, True], 2026-02-21T11:36:42.6115523Z 'range_num_stages': [2, 1], 2026-02-21T11:36:42.6115744Z 'range_unroll_factors': [4, 2], 2026-02-21T11:36:42.6115956Z 'range_warp_specializes': [], 2026-02-21T11:36:42.6116161Z 'waves_per_eu': 2} 2026-02-21T11:36:42.6152155Z [1227s] Fitting surrogate: 1097 points, 1097 targets 2026-02-21T11:36:43.4076660Z [1227s] Generation 14 starting: 77 neighbors, 4 active search path(s) 2026-02-21T11:37:18.3023313Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 1.0 configs/s 2026-02-21T11:37:30.3250760Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 6.5 configs/s 2026-02-21T11:37:30.6301529Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s 2026-02-21T11:37:39.3687261Z [1283s] Generation 14 complete: 2026-02-21T11:37:39.3687662Z error=2 2026-02-21T11:37:39.3687881Z ok=79 2026-02-21T11:37:39.3688100Z min=5.6552 2026-02-21T11:37:39.3688309Z mid=7.1259 2026-02-21T11:37:39.3688515Z max=68.7424 2026-02-21T11:37:39.3688748Z best={'block_sizes': [1, 256, 64], 2026-02-21T11:37:39.3689594Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T11:37:39.3690004Z 'l2_groupings': [16], 2026-02-21T11:37:39.3690307Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:37:39.3690622Z 'loop_orders': [[0, 1]], 2026-02-21T11:37:39.3690904Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:37:39.3691184Z 'num_sm_multiplier': 8, 2026-02-21T11:37:39.3691463Z 'num_stages': 4, 2026-02-21T11:37:39.3691704Z 'num_warps': 8, 2026-02-21T11:37:39.3691964Z 'pid_type': 'persistent_interleaved', 2026-02-21T11:37:39.3692295Z 'range_flattens': [None, None], 2026-02-21T11:37:39.3692594Z 'range_multi_buffers': [None, True], 2026-02-21T11:37:39.3692916Z 'range_num_stages': [2, 1], 2026-02-21T11:37:39.3693193Z 'range_unroll_factors': [4, 2], 2026-02-21T11:37:39.3693410Z 'range_warp_specializes': [], 2026-02-21T11:37:39.3693515Z 'waves_per_eu': 2} 2026-02-21T11:37:39.3734387Z [1283s] Fitting surrogate: 1178 points, 1178 targets 2026-02-21T11:37:40.2469318Z [1284s] Generation 15 starting: 76 neighbors, 4 active search path(s) 2026-02-21T11:38:13.4800530Z [1317s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[2, 1], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:38:13.9969503Z [1318s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:38:21.9646644Z [1326s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[3, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:38:21.9662646Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 1.0 configs/s 2026-02-21T11:38:31.2676964Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 8.1 configs/s 2026-02-21T11:38:31.5864855Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s 2026-02-21T11:38:40.6860305Z [1345s] Generation 15 complete: 2026-02-21T11:38:40.6860515Z error=4 2026-02-21T11:38:40.6860609Z timeout=3 2026-02-21T11:38:40.6860690Z ok=73 2026-02-21T11:38:40.6860825Z min=5.5934 2026-02-21T11:38:40.6860904Z mid=5.8480 2026-02-21T11:38:40.6860991Z max=73.9800 2026-02-21T11:38:40.6861094Z best={'block_sizes': [1, 64, 32], 2026-02-21T11:38:40.6861257Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T11:38:40.6861409Z 'l2_groupings': [8], 2026-02-21T11:38:40.6862079Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:38:40.6862201Z 'loop_orders': [[0, 1]], 2026-02-21T11:38:40.6862313Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:38:40.6862421Z 'num_sm_multiplier': 128, 2026-02-21T11:38:40.6862525Z 'num_stages': 4, 2026-02-21T11:38:40.6862631Z 'num_warps': 2, 2026-02-21T11:38:40.6862728Z 'pid_type': 'persistent_blocked', 2026-02-21T11:38:40.6862850Z 'range_flattens': [None, None], 2026-02-21T11:38:40.6862964Z 'range_multi_buffers': [False, False], 2026-02-21T11:38:40.6863085Z 'range_num_stages': [1, 1], 2026-02-21T11:38:40.6863189Z 'range_unroll_factors': [4, 4], 2026-02-21T11:38:40.6863306Z 'range_warp_specializes': [], 2026-02-21T11:38:40.6863526Z 'waves_per_eu': 2} 2026-02-21T11:38:40.6908653Z [1345s] Fitting surrogate: 1258 points, 1258 targets 2026-02-21T11:38:41.1490241Z [1345s] Generation 16 starting: 39 neighbors, 2 active search path(s) 2026-02-21T11:39:12.2084057Z [1376s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[1, 2], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:39:16.7610149Z [1381s] Timeout after 30s compiling Config(block_sizes=[1, 64, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:39:16.7631322Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 0.9 configs/s 2026-02-21T11:39:22.0606233Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 7.3 configs/s 2026-02-21T11:39:22.2055874Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s 2026-02-21T11:39:26.2583551Z [1390s] Generation 16 complete: 2026-02-21T11:39:26.2584321Z error=5 2026-02-21T11:39:26.2584479Z timeout=2 2026-02-21T11:39:26.2584637Z ok=34 2026-02-21T11:39:26.2584787Z min=5.6074 2026-02-21T11:39:26.2584944Z mid=5.8169 2026-02-21T11:39:26.2585094Z max=65.2135 2026-02-21T11:39:26.2585276Z best={'block_sizes': [1, 64, 32], 2026-02-21T11:39:26.2585578Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T11:39:26.2585894Z 'l2_groupings': [8], 2026-02-21T11:39:26.2586102Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:39:26.2586347Z 'loop_orders': [[0, 1]], 2026-02-21T11:39:26.2586701Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:39:26.2586911Z 'num_sm_multiplier': 128, 2026-02-21T11:39:26.2587115Z 'num_stages': 3, 2026-02-21T11:39:26.2587284Z 'num_warps': 2, 2026-02-21T11:39:26.2587494Z 'pid_type': 'persistent_blocked', 2026-02-21T11:39:26.2587726Z 'range_flattens': [None, None], 2026-02-21T11:39:26.2587953Z 'range_multi_buffers': [False, False], 2026-02-21T11:39:26.2588187Z 'range_num_stages': [1, 1], 2026-02-21T11:39:26.2588400Z 'range_unroll_factors': [4, 4], 2026-02-21T11:39:26.2588619Z 'range_warp_specializes': [], 2026-02-21T11:39:26.2588829Z 'waves_per_eu': 2} 2026-02-21T11:39:26.2625268Z [1390s] Fitting surrogate: 1299 points, 1299 targets 2026-02-21T11:39:26.5620425Z [1391s] Generation 17 starting: 20 neighbors, 1 active search path(s) 2026-02-21T11:39:50.7786813Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 0.9 configs/s 2026-02-21T11:39:54.4052513Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 5.6 configs/s 2026-02-21T11:39:54.4984884Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s 2026-02-21T11:39:57.0845453Z [1421s] Generation 17 complete: 2026-02-21T11:39:57.0849145Z ok=21 2026-02-21T11:39:57.0850075Z min=5.6250 2026-02-21T11:39:57.0850301Z mid=7.2491 2026-02-21T11:39:57.0850523Z max=66.9534 2026-02-21T11:39:57.0850764Z best={'block_sizes': [1, 64, 32], 2026-02-21T11:39:57.0851222Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T11:39:57.0851663Z 'l2_groupings': [8], 2026-02-21T11:39:57.0851949Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:39:57.0852273Z 'loop_orders': [[0, 1]], 2026-02-21T11:39:57.0852543Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:39:57.0852831Z 'num_sm_multiplier': 128, 2026-02-21T11:39:57.0853098Z 'num_stages': 3, 2026-02-21T11:39:57.0853321Z 'num_warps': 2, 2026-02-21T11:39:57.0853736Z 'pid_type': 'persistent_blocked', 2026-02-21T11:39:57.0854058Z 'range_flattens': [None, None], 2026-02-21T11:39:57.0854386Z 'range_multi_buffers': [False, None], 2026-02-21T11:39:57.0854697Z 'range_num_stages': [1, 1], 2026-02-21T11:39:57.0854987Z 'range_unroll_factors': [4, 3], 2026-02-21T11:39:57.0855291Z 'range_warp_specializes': [], 2026-02-21T11:39:57.0855563Z 'waves_per_eu': 2} 2026-02-21T11:39:57.0886034Z [1421s] Fitting surrogate: 1320 points, 1320 targets 2026-02-21T11:39:57.4187736Z [1421s] Generation 18 starting: 20 neighbors, 1 active search path(s) 2026-02-21T11:40:10.4873386Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 2.0 configs/s 2026-02-21T11:40:13.7086928Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 6.3 configs/s 2026-02-21T11:40:13.7915489Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s 2026-02-21T11:40:16.1193156Z [1440s] Generation 18 complete: 2026-02-21T11:40:16.1193617Z ok=21 2026-02-21T11:40:16.1193891Z min=5.5970 2026-02-21T11:40:16.1194099Z mid=7.5278 2026-02-21T11:40:16.1194303Z max=67.0089 2026-02-21T11:40:16.1195141Z best={'block_sizes': [1, 64, 32], 2026-02-21T11:40:16.1195556Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T11:40:16.1195957Z 'l2_groupings': [8], 2026-02-21T11:40:16.1196252Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:40:16.1196568Z 'loop_orders': [[0, 1]], 2026-02-21T11:40:16.1196846Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:40:16.1197129Z 'num_sm_multiplier': 64, 2026-02-21T11:40:16.1197386Z 'num_stages': 3, 2026-02-21T11:40:16.1197754Z 'num_warps': 2, 2026-02-21T11:40:16.1198013Z 'pid_type': 'persistent_blocked', 2026-02-21T11:40:16.1198329Z 'range_flattens': [None, None], 2026-02-21T11:40:16.1198632Z 'range_multi_buffers': [False, False], 2026-02-21T11:40:16.1198946Z 'range_num_stages': [1, 1], 2026-02-21T11:40:16.1199221Z 'range_unroll_factors': [4, 3], 2026-02-21T11:40:16.1199514Z 'range_warp_specializes': [], 2026-02-21T11:40:16.1199791Z 'waves_per_eu': 2} 2026-02-21T11:40:16.1237604Z [1440s] Fitting surrogate: 1341 points, 1341 targets 2026-02-21T11:40:16.4304717Z [1440s] Generation 19 starting: 18 neighbors, 1 active search path(s) 2026-02-21T11:40:31.3639025Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 1.5 configs/s 2026-02-21T11:40:33.4804988Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 18/18 8.9 configs/s 2026-02-21T11:40:33.5983758Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s 2026-02-21T11:40:36.9386664Z [1461s] Generation 19 complete: 2026-02-21T11:40:36.9389251Z ok=19 2026-02-21T11:40:36.9389563Z min=5.6048 2026-02-21T11:40:36.9390085Z mid=5.6658 2026-02-21T11:40:36.9390349Z max=19.9376 2026-02-21T11:40:36.9390634Z best={'block_sizes': [1, 64, 32], 2026-02-21T11:40:36.9391064Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T11:40:36.9391466Z 'l2_groupings': [8], 2026-02-21T11:40:36.9392290Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:40:36.9392641Z 'loop_orders': [[0, 1]], 2026-02-21T11:40:36.9392916Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:40:36.9393212Z 'num_sm_multiplier': 64, 2026-02-21T11:40:36.9393479Z 'num_stages': 3, 2026-02-21T11:40:36.9393712Z 'num_warps': 2, 2026-02-21T11:40:36.9393973Z 'pid_type': 'persistent_blocked', 2026-02-21T11:40:36.9394303Z 'range_flattens': [None, None], 2026-02-21T11:40:36.9394606Z 'range_multi_buffers': [False, False], 2026-02-21T11:40:36.9394925Z 'range_num_stages': [1, 1], 2026-02-21T11:40:36.9395210Z 'range_unroll_factors': [3, 2], 2026-02-21T11:40:36.9395516Z 'range_warp_specializes': [], 2026-02-21T11:40:36.9395794Z 'waves_per_eu': 2} 2026-02-21T11:40:36.9431617Z [1461s] Fitting surrogate: 1360 points, 1360 targets 2026-02-21T11:40:37.2567677Z [1461s] Generation 20 starting: 20 neighbors, 1 active search path(s) 2026-02-21T11:40:46.7160710Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 1.9 configs/s 2026-02-21T11:40:49.6016614Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 7.4 configs/s 2026-02-21T11:40:49.6844841Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 - configs/s 2026-02-21T11:40:52.0010313Z [1476s] Generation 20 complete: 2026-02-21T11:40:52.0010541Z error=2 2026-02-21T11:40:52.0010657Z ok=19 2026-02-21T11:40:52.0010770Z min=5.5780 2026-02-21T11:40:52.0010847Z mid=5.9763 2026-02-21T11:40:52.0010927Z max=74.4706 2026-02-21T11:40:52.0011017Z best={'block_sizes': [1, 64, 32], 2026-02-21T11:40:52.0011166Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T11:40:52.0011327Z 'l2_groupings': [8], 2026-02-21T11:40:52.0011430Z 'load_eviction_policies': ['', '', ''], 2026-02-21T11:40:52.0011549Z 'loop_orders': [[0, 1]], 2026-02-21T11:40:52.0011651Z 'matrix_instr_nonkdim': 32, 2026-02-21T11:40:52.0011757Z 'num_sm_multiplier': 128, 2026-02-21T11:40:52.0011855Z 'num_stages': 3, 2026-02-21T11:40:52.0011942Z 'num_warps': 2, 2026-02-21T11:40:52.0012044Z 'pid_type': 'persistent_blocked', 2026-02-21T11:40:52.0012159Z 'range_flattens': [None, None], 2026-02-21T11:40:52.0012271Z 'range_multi_buffers': [False, False], 2026-02-21T11:40:52.0012825Z 'range_num_stages': [1, 1], 2026-02-21T11:40:52.0012929Z 'range_unroll_factors': [3, 2], 2026-02-21T11:40:52.0013037Z 'range_warp_specializes': [], 2026-02-21T11:40:52.0013145Z 'waves_per_eu': 2} 2026-02-21T11:40:52.0064919Z [1476s] Fitting surrogate: 1381 points, 1381 targets 2026-02-21T11:40:52.1712119Z [1476s] Autotuning complete in 1476.7s after searching 1273 configs. 2026-02-21T11:40:52.1712328Z One can hardcode the best config and skip autotuning with: 2026-02-21T11:40:52.1713292Z @helion.kernel(config=helion.Config(block_sizes=[1, 64, 32], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T11:40:52.1714003Z 2026-02-21T11:40:52.1714172Z [1476s] Code of selected kernel: /tmp/torchinductor_root/lq/clqqq6sh25ncvaga6kc7ssmw43ae45674ysypgbyxarb52rfaetm.py 2026-02-21T11:40:53.2268645Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T11:40:53.2271023Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 64, 32], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T11:40:53.2273405Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T11:40:53.2273883Z WARNING:tritonbench.utils.triton_op:Completed input ID 5: 2026-02-21T11:40:53.2274303Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T11:40:53.2274631Z ------------------------------------------ 2026-02-21T11:40:53.2274946Z (4, 48, 4096, 4096, 128) 2026-02-21T11:40:53.2275125Z 2026-02-21T11:40:53.2278151Z 83%|████████▎ | 5/6 [56:26<15:12, 912.21s/it]WARNING:tritonbench.utils.triton_op:Running input ID 6: 2026-02-21T11:40:53.2278719Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T11:40:53.2279072Z ------------------------------------------ 2026-02-21T11:40:53.2279385Z (4, 48, 8192, 8192, 128) 2026-02-21T11:40:53.2280996Z INFO:tritonbench.utils.triton_op:Took 0.09ms to get benchmark function for aten 2026-02-21T11:40:55.1044292Z INFO:tritonbench.utils.triton_op:Took 1.51ms to get benchmark function for flex_attention 2026-02-21T11:40:56.9080211Z WARNING:__main__:Input tensor metadata: 2026-02-21T11:40:56.9080469Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T11:40:56.9080690Z 'dtype': 'torch.bfloat16', 2026-02-21T11:40:56.9080910Z 'shape': (4, 48, 8192, 128), 2026-02-21T11:40:56.9081139Z 'stride': (50331648, 1048576, 128, 1)}, 2026-02-21T11:40:56.9081366Z { 'device': 'cuda:0', 2026-02-21T11:40:56.9081561Z 'dtype': 'torch.bfloat16', 2026-02-21T11:40:56.9081764Z 'shape': (4, 48, 8192, 128), 2026-02-21T11:40:56.9081974Z 'stride': (50331648, 1048576, 128, 1)}, 2026-02-21T11:40:56.9082669Z { 'device': 'cuda:0', 2026-02-21T11:40:56.9082866Z 'dtype': 'torch.bfloat16', 2026-02-21T11:40:56.9083078Z 'shape': (4, 48, 8192, 128), 2026-02-21T11:40:56.9083289Z 'stride': (50331648, 1048576, 128, 1)}), 2026-02-21T11:40:56.9083499Z 'kwargs': {}} 2026-02-21T11:40:56.9136637Z INFO:tritonbench.utils.triton_op:Took 6.06ms to get benchmark function for helion_attention 2026-02-21T11:40:57.1575150Z [0s] Autotune random seed: 2144140282 2026-02-21T11:40:57.4098462Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T11:41:30.6760529Z [33s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[3, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:41:30.9358227Z [33s] Timeout after 30s compiling Config(block_sizes=[1, 16, 2048], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:41:31.4552729Z [34s] Timeout after 30s compiling Config(block_sizes=[1, 1, 2048], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:41:32.3764060Z [34s] Timeout after 30s compiling Config(block_sizes=[1, 2, 8192], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:41:33.8171506Z [36s] Timeout after 30s compiling Config(block_sizes=[1, 16, 2048], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:41:34.0972910Z [36s] Timeout after 30s compiling Config(block_sizes=[1, 64, 512], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, None], range_num_stages=[4, 0], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:41:34.6625390Z [37s] Timeout after 30s compiling Config(block_sizes=[1, 4, 512], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:41:35.0302585Z [37s] Timeout after 30s compiling Config(block_sizes=[1, 1, 4096], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, None], range_num_stages=[0, 2], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:41:37.7973091Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, False], range_num_stages=[3, 4], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:41:38.3975151Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 64, 512], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[2, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T11:41:38.8752831Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:41:40.7618844Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:41:41.2183102Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 8, 1024], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:41:41.4614437Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 16, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:41:41.7355548Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 16, 256], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:41:41.9487626Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 8, 4096], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:41:42.2679522Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 256, 1024], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[4, 3], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:41:43.6329803Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, True], range_num_stages=[1, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:41:44.5807226Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 2, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:41:44.8622570Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 8192, 32], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[1, 0], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T11:41:45.1242714Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[2, 0], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:41:45.3997395Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 1, 1024], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, False], range_num_stages=[1, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:41:45.6188797Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T11:41:46.1978697Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 16, 1024], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:41:46.5733738Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 256, 512], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T11:41:46.5754630Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.5 configs/s 2026-02-21T11:59:18.7145316Z /tmp/torchinductor_root/o6/co6rjuctqpaz75vipxwjv5tecvb22lqonz5hnx6w6pycwovwezru.py:88:135: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T11:59:18.7146501Z v = tl.load(v_view + (indices_0[:, None, None] * 1048576 + indices_2[None, :, None] * 128 + indices_4[None, None, :] * 1), None) 2026-02-21T11:59:18.7147935Z ^ 2026-02-21T11:59:18.7149936Z /tmp/torchinductor_root/o6/co6rjuctqpaz75vipxwjv5tecvb22lqonz5hnx6w6pycwovwezru.py:92:144: note: - use: %134 = "tt.reshape"(<>) : (tensor<1x16x128xbf16, #ttg.blocked<{sizePerThread = [1, 1, 4], threadsPerWarp = [1, 2, 32], warpsPerCTA = [1, 8, 1], order = [2, 0, 1]}>>) -> tensor<16x128xbf16, #ttg.linear<{register = [[0, 1], [0, 2]], lane = [[0, 4], [0, 8], [0, 16], [0, 32], [0, 64], [1, 0]], warp = [[2, 0], [4, 0], [8, 0]], block = []}>> 2026-02-21T11:59:18.7151473Z 2026-02-21T11:59:18.7152672Z acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128]) 2026-02-21T11:59:18.7153729Z ^ 2026-02-21T11:59:18.7154135Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T11:59:18.7154671Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 2], order = [2, 1, 0]}> 2026-02-21T11:59:18.7155517Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [2, 1, 0]}> 2026-02-21T11:59:18.7156217Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [16, 4, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T11:59:18.7156921Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T11:59:18.7157617Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 4, 16], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T11:59:18.7158305Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T11:59:18.7158986Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T11:59:18.7159674Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T11:59:18.7160336Z #blocked8 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [8], order = [0]}> 2026-02-21T11:59:18.7160964Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [0, 1]}> 2026-02-21T11:59:18.7161520Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}> 2026-02-21T11:59:18.7162012Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [0, 1, 2]}> 2026-02-21T11:59:18.7162718Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [0, 1, 2]}> 2026-02-21T11:59:18.7163206Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}> 2026-02-21T11:59:18.7163698Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 4, 16], warpsPerCTA = [1, 8, 1], order = [2, 1, 0]}> 2026-02-21T11:59:18.7164189Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [16, 4, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}> 2026-02-21T11:59:18.7164781Z #blocked16 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}> 2026-02-21T11:59:18.7165270Z #blocked17 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}> 2026-02-21T11:59:18.7165797Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T11:59:18.7166598Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T11:59:18.7167231Z %c1048576_i32 = arith.constant 1048576 : i32 2026-02-21T11:59:18.7167443Z %c192_i64 = arith.constant 192 : i64 2026-02-21T11:59:18.7167640Z %c0_i64 = arith.constant 0 : i64 2026-02-21T11:59:18.7167836Z %c1048576_i64 = arith.constant 1048576 : i64 2026-02-21T11:59:18.7168126Z %cst = arith.constant dense<0.000000e+00> : tensor<1x4x128xbf16, #blocked> 2026-02-21T11:59:18.7168448Z %cst_0 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked1> 2026-02-21T11:59:18.7168731Z %cst_1 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked1> 2026-02-21T11:59:18.7169018Z %cst_2 = arith.constant dense<8192> : tensor<1x4x1xi64, #blocked2> 2026-02-21T11:59:18.7169293Z %cst_3 = arith.constant dense<0> : tensor<1x4x1xi64, #blocked2> 2026-02-21T11:59:18.7169573Z %cst_4 = arith.constant dense<128> : tensor<1x4x1xi64, #blocked2> 2026-02-21T11:59:18.7169808Z %c16_i32 = arith.constant 16 : i32 2026-02-21T11:59:18.7169995Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T11:59:18.7170216Z %c1_i32 = arith.constant 1 : i32 2026-02-21T11:59:18.7170450Z %cst_5 = arith.constant dense<128> : tensor<1x16x1xi32, #blocked3> 2026-02-21T11:59:18.7170791Z %cst_6 = arith.constant dense<0.127517432> : tensor<1x4x16xf32, #blocked4> 2026-02-21T11:59:18.7171075Z %cst_7 = arith.constant dense<0.127517432> : tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7171309Z %cst_8 = arith.constant dense<0.000000e+00> : tensor<4x16xf32, #blocked6> 2026-02-21T11:59:18.7171539Z %cst_9 = arith.constant dense<128> : tensor<1x1x16xi32, #blocked7> 2026-02-21T11:59:18.7171725Z %c0_i32 = arith.constant 0 : i32 2026-02-21T11:59:18.7171908Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<1x4x128xf32, #blocked> 2026-02-21T11:59:18.7172147Z %cst_11 = arith.constant dense<1.000000e+00> : tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7172384Z %cst_12 = arith.constant dense<0xFF800000> : tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7172572Z %c4_i32 = arith.constant 4 : i32 2026-02-21T11:59:18.7172719Z %c192_i32 = arith.constant 192 : i32 2026-02-21T11:59:18.7172868Z %c393216_i32 = arith.constant 393216 : i32 2026-02-21T11:59:18.7173020Z %c21_i32 = arith.constant 21 : i32 2026-02-21T11:59:18.7173167Z %0 = tt.get_program_id x : i32 2026-02-21T11:59:18.7173297Z %1 = arith.muli %0, %c21_i32 : i32 2026-02-21T11:59:18.7173438Z %2 = arith.addi %1, %c21_i32 : i32 2026-02-21T11:59:18.7173577Z %3 = arith.minsi %2, %c393216_i32 : i32 2026-02-21T11:59:18.7173777Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked8> 2026-02-21T11:59:18.7174052Z %5 = tt.splat %arg0 : !tt.ptr -> tensor<1x4x128x!tt.ptr, #blocked> 2026-02-21T11:59:18.7174297Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked8> 2026-02-21T11:59:18.7174533Z %7 = arith.extsi %6 : tensor<4xi32, #blocked8> to tensor<4xi64, #blocked8> 2026-02-21T11:59:18.7174771Z %8 = arith.extsi %4 : tensor<128xi32, #blocked8> to tensor<128xi64, #blocked8> 2026-02-21T11:59:18.7175091Z %9 = ttg.convert_layout %8 : tensor<128xi64, #blocked8> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:59:18.7175510Z %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi64, #blocked9> 2026-02-21T11:59:18.7175875Z %11 = ttg.convert_layout %10 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #blocked10> 2026-02-21T11:59:18.7176233Z %12 = ttg.convert_layout %11 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> 2026-02-21T11:59:18.7176655Z %13 = tt.expand_dims %12 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi64, #blocked11> 2026-02-21T11:59:18.7177022Z %14 = ttg.convert_layout %13 : tensor<1x1x128xi64, #blocked11> -> tensor<1x1x128xi64, #blocked1> 2026-02-21T11:59:18.7177323Z %15 = tt.broadcast %14 : tensor<1x1x128xi64, #blocked1> -> tensor<1x4x128xi64, #blocked1> 2026-02-21T11:59:18.7177618Z %16 = ttg.convert_layout %15 : tensor<1x4x128xi64, #blocked1> -> tensor<1x4x128xi64, #blocked> 2026-02-21T11:59:18.7177874Z %17 = arith.cmpi sge, %14, %cst_1 : tensor<1x1x128xi64, #blocked1> 2026-02-21T11:59:18.7178102Z %18 = arith.cmpi slt, %14, %cst_0 : tensor<1x1x128xi64, #blocked1> 2026-02-21T11:59:18.7178313Z %19 = arith.andi %17, %18 : tensor<1x1x128xi1, #blocked1> 2026-02-21T11:59:18.7178549Z %20 = tt.broadcast %19 : tensor<1x1x128xi1, #blocked1> -> tensor<1x4x128xi1, #blocked1> 2026-02-21T11:59:18.7178836Z %21 = ttg.convert_layout %20 : tensor<1x4x128xi1, #blocked1> -> tensor<1x4x128xi1, #blocked> 2026-02-21T11:59:18.7179109Z %22 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked8> 2026-02-21T11:59:18.7179422Z %23 = ttg.convert_layout %4 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:59:18.7179835Z %24 = tt.expand_dims %23 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9> 2026-02-21T11:59:18.7180183Z %25 = ttg.convert_layout %24 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked10> 2026-02-21T11:59:18.7180540Z %26 = ttg.convert_layout %25 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked12}>> 2026-02-21T11:59:18.7180921Z %27 = tt.expand_dims %26 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x128x1xi32, #blocked12> 2026-02-21T11:59:18.7181226Z %28 = ttg.convert_layout %27 : tensor<1x128x1xi32, #blocked12> -> tensor<1x128x1xi32, #blocked13> 2026-02-21T11:59:18.7181469Z %29 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x16x!tt.ptr, #blocked14> 2026-02-21T11:59:18.7181742Z %30 = ttg.convert_layout %25 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> 2026-02-21T11:59:18.7182087Z %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi32, #blocked11> 2026-02-21T11:59:18.7182394Z %32 = ttg.convert_layout %31 : tensor<1x1x128xi32, #blocked11> -> tensor<1x1x128xi32, #blocked1> 2026-02-21T11:59:18.7182649Z %33 = tt.broadcast %32 : tensor<1x1x128xi32, #blocked1> -> tensor<1x16x128xi32, #blocked1> 2026-02-21T11:59:18.7182900Z %34 = ttg.convert_layout %33 : tensor<1x16x128xi32, #blocked1> -> tensor<1x16x128xi32, #blocked> 2026-02-21T11:59:18.7183132Z %35 = tt.splat %arg2 : !tt.ptr -> tensor<1x16x128x!tt.ptr, #blocked> 2026-02-21T11:59:18.7183359Z %36 = tt.splat %arg3 : !tt.ptr -> tensor<1x4x128x!tt.ptr, #blocked> 2026-02-21T11:59:18.7183539Z scf.for %arg4 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T11:59:18.7183675Z %37 = arith.remsi %arg4, %c192_i32 : i32 2026-02-21T11:59:18.7183806Z %38 = arith.divsi %arg4, %c192_i32 : i32 2026-02-21T11:59:18.7183931Z %39 = arith.muli %38, %c4_i32 : i32 2026-02-21T11:59:18.7184055Z %40 = arith.extsi %37 : i32 to i64 2026-02-21T11:59:18.7184195Z %41 = arith.extsi %39 : i32 to i64 2026-02-21T11:59:18.7184320Z %42 = arith.muli %40, %c1048576_i64 : i64 2026-02-21T11:59:18.7184491Z %43 = tt.splat %42 : i64 -> tensor<1x4x128xi64, #blocked> 2026-02-21T11:59:18.7184649Z %44 = tt.splat %41 : i64 -> tensor<4xi64, #blocked8> 2026-02-21T11:59:18.7184802Z %45 = arith.addi %44, %7 : tensor<4xi64, #blocked8> 2026-02-21T11:59:18.7185032Z %46 = ttg.convert_layout %45 : tensor<4xi64, #blocked8> -> tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:59:18.7185354Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x4xi64, #blocked9> 2026-02-21T11:59:18.7185638Z %48 = ttg.convert_layout %47 : tensor<1x4xi64, #blocked9> -> tensor<1x4xi64, #blocked5> 2026-02-21T11:59:18.7185918Z %49 = ttg.convert_layout %48 : tensor<1x4xi64, #blocked5> -> tensor<1x4xi64, #ttg.slice<{dim = 2, parent = #blocked15}>> 2026-02-21T11:59:18.7186254Z %50 = tt.expand_dims %49 {axis = 2 : i32} : tensor<1x4xi64, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x4x1xi64, #blocked15> 2026-02-21T11:59:18.7186575Z %51 = ttg.convert_layout %50 : tensor<1x4x1xi64, #blocked15> -> tensor<1x4x1xi64, #blocked2> 2026-02-21T11:59:18.7186785Z %52 = arith.muli %51, %cst_4 : tensor<1x4x1xi64, #blocked2> 2026-02-21T11:59:18.7195378Z %53 = tt.broadcast %52 : tensor<1x4x1xi64, #blocked2> -> tensor<1x4x128xi64, #blocked2> 2026-02-21T11:59:18.7195633Z %54 = ttg.convert_layout %53 : tensor<1x4x128xi64, #blocked2> -> tensor<1x4x128xi64, #blocked> 2026-02-21T11:59:18.7195861Z %55 = arith.addi %54, %16 : tensor<1x4x128xi64, #blocked> 2026-02-21T11:59:18.7196026Z %56 = arith.addi %43, %55 : tensor<1x4x128xi64, #blocked> 2026-02-21T11:59:18.7196276Z %57 = tt.addptr %5, %56 : tensor<1x4x128x!tt.ptr, #blocked>, tensor<1x4x128xi64, #blocked> 2026-02-21T11:59:18.7196481Z %58 = arith.cmpi sge, %40, %c0_i64 : i64 2026-02-21T11:59:18.7196623Z %59 = arith.cmpi slt, %40, %c192_i64 : i64 2026-02-21T11:59:18.7196747Z %60 = arith.andi %58, %59 : i1 2026-02-21T11:59:18.7196896Z %61 = arith.cmpi sge, %51, %cst_3 : tensor<1x4x1xi64, #blocked2> 2026-02-21T11:59:18.7197070Z %62 = arith.cmpi slt, %51, %cst_2 : tensor<1x4x1xi64, #blocked2> 2026-02-21T11:59:18.7197241Z %63 = arith.andi %61, %62 : tensor<1x4x1xi1, #blocked2> 2026-02-21T11:59:18.7197397Z %64 = tt.splat %60 : i1 -> tensor<1x4x1xi1, #blocked2> 2026-02-21T11:59:18.7197555Z %65 = arith.andi %64, %63 : tensor<1x4x1xi1, #blocked2> 2026-02-21T11:59:18.7197750Z %66 = tt.broadcast %65 : tensor<1x4x1xi1, #blocked2> -> tensor<1x4x128xi1, #blocked2> 2026-02-21T11:59:18.7197988Z %67 = ttg.convert_layout %66 : tensor<1x4x128xi1, #blocked2> -> tensor<1x4x128xi1, #blocked> 2026-02-21T11:59:18.7198201Z %68 = arith.andi %67, %21 : tensor<1x4x128xi1, #blocked> 2026-02-21T11:59:18.7198368Z %69 = tt.load %57, %68, %cst : tensor<1x4x128x!tt.ptr, #blocked> 2026-02-21T11:59:18.7198531Z %70 = arith.muli %37, %c1048576_i32 : i32 2026-02-21T11:59:18.7198677Z %71 = tt.splat %70 : i32 -> tensor<1x128x1xi32, #blocked13> 2026-02-21T11:59:18.7198848Z %72 = arith.addi %71, %28 : tensor<1x128x1xi32, #blocked13> 2026-02-21T11:59:18.7199055Z %73 = tt.broadcast %72 : tensor<1x128x1xi32, #blocked13> -> tensor<1x128x16xi32, #blocked13> 2026-02-21T11:59:18.7199319Z %74 = ttg.convert_layout %73 : tensor<1x128x16xi32, #blocked13> -> tensor<1x128x16xi32, #blocked14> 2026-02-21T11:59:18.7199595Z %75 = tt.reshape %69 : tensor<1x4x128xbf16, #blocked> -> tensor<4x128xbf16, #blocked10> 2026-02-21T11:59:18.7199789Z %76 = tt.splat %70 : i32 -> tensor<1x16x1xi32, #blocked3> 2026-02-21T11:59:18.7200152Z %77:3 = scf.for %arg5 = %c0_i32 to %c8192_i32 step %c16_i32 iter_args(%arg6 = %cst_12, %arg7 = %cst_11, %arg8 = %cst_10) -> (tensor<1x4xf32, #blocked5>, tensor<1x4xf32, #blocked5>, tensor<1x4x128xf32, #blocked>) : i32 { 2026-02-21T11:59:18.7200537Z %86 = tt.splat %arg5 : i32 -> tensor<16xi32, #blocked8> 2026-02-21T11:59:18.7200690Z %87 = arith.addi %86, %22 : tensor<16xi32, #blocked8> 2026-02-21T11:59:18.7200926Z %88 = ttg.convert_layout %87 : tensor<16xi32, #blocked8> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> 2026-02-21T11:59:18.7201253Z %89 = tt.expand_dims %88 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x16xi32, #blocked9> 2026-02-21T11:59:18.7201545Z %90 = ttg.convert_layout %89 : tensor<1x16xi32, #blocked9> -> tensor<1x16xi32, #blocked6> 2026-02-21T11:59:18.7201831Z %91 = ttg.convert_layout %90 : tensor<1x16xi32, #blocked6> -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked16}>> 2026-02-21T11:59:18.7202171Z %92 = tt.expand_dims %91 {axis = 1 : i32} : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked16}>> -> tensor<1x1x16xi32, #blocked16> 2026-02-21T11:59:18.7202478Z %93 = ttg.convert_layout %92 : tensor<1x1x16xi32, #blocked16> -> tensor<1x1x16xi32, #blocked7> 2026-02-21T11:59:18.7202758Z %94 = arith.muli %93, %cst_9 : tensor<1x1x16xi32, #blocked7> 2026-02-21T11:59:18.7202979Z %95 = tt.broadcast %94 : tensor<1x1x16xi32, #blocked7> -> tensor<1x128x16xi32, #blocked7> 2026-02-21T11:59:18.7203236Z %96 = ttg.convert_layout %95 : tensor<1x128x16xi32, #blocked7> -> tensor<1x128x16xi32, #blocked14> 2026-02-21T11:59:18.7203455Z %97 = arith.addi %74, %96 : tensor<1x128x16xi32, #blocked14> 2026-02-21T11:59:18.7203681Z %98 = tt.addptr %29, %97 : tensor<1x128x16x!tt.ptr, #blocked14>, tensor<1x128x16xi32, #blocked14> 2026-02-21T11:59:18.7203903Z %99 = tt.load %98 : tensor<1x128x16x!tt.ptr, #blocked14> 2026-02-21T11:59:18.7204104Z %100 = tt.reshape %99 : tensor<1x128x16xbf16, #blocked14> -> tensor<128x16xbf16, #blocked6> 2026-02-21T11:59:18.7204433Z %101 = ttg.convert_layout %75 : tensor<4x128xbf16, #blocked10> -> tensor<4x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> 2026-02-21T11:59:18.7204790Z %102 = ttg.convert_layout %100 : tensor<128x16xbf16, #blocked6> -> tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> 2026-02-21T11:59:18.7205098Z %103 = ttg.convert_layout %cst_8 : tensor<4x16xf32, #blocked6> -> tensor<4x16xf32, #blocked6> 2026-02-21T11:59:18.7205511Z %104 = tt.dot %101, %102, %103, inputPrecision = tf32 : tensor<4x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> * tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> -> tensor<4x16xf32, #blocked6> 2026-02-21T11:59:18.7205907Z %105 = tt.reshape %104 : tensor<4x16xf32, #blocked6> -> tensor<1x4x16xf32, #blocked4> 2026-02-21T11:59:18.7206151Z %106 = arith.truncf %105 : tensor<1x4x16xf32, #blocked4> to tensor<1x4x16xbf16, #blocked4> 2026-02-21T11:59:18.7206397Z %107 = arith.extf %106 : tensor<1x4x16xbf16, #blocked4> to tensor<1x4x16xf32, #blocked4> 2026-02-21T11:59:18.7206593Z %108 = "tt.reduce"(%107) <{axis = 2 : i32}> ({ 2026-02-21T11:59:18.7206729Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T11:59:18.7206854Z %160 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T11:59:18.7206988Z tt.reduce.return %160 : f32 2026-02-21T11:59:18.7207181Z }) : (tensor<1x4x16xf32, #blocked4>) -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked4}>> 2026-02-21T11:59:18.7207477Z %109 = ttg.convert_layout %108 : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked4}>> -> tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7207778Z %110 = arith.truncf %109 : tensor<1x4xf32, #blocked5> to tensor<1x4xbf16, #blocked5> 2026-02-21T11:59:18.7208008Z %111 = arith.extf %110 : tensor<1x4xbf16, #blocked5> to tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7208212Z %112 = arith.mulf %111, %cst_7 : tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7208411Z %113 = arith.truncf %112 : tensor<1x4xf32, #blocked5> to tensor<1x4xbf16, #blocked5> 2026-02-21T11:59:18.7208657Z %114 = arith.extf %113 : tensor<1x4xbf16, #blocked5> to tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7208856Z %115 = arith.cmpf ogt, %arg6, %114 : tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7209036Z %116 = arith.cmpf une, %arg6, %arg6 : tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7209211Z %117 = arith.ori %115, %116 : tensor<1x4xi1, #blocked5> 2026-02-21T11:59:18.7209407Z %118 = arith.select %117, %arg6, %114 : tensor<1x4xi1, #blocked5>, tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7209620Z %119 = arith.mulf %107, %cst_6 : tensor<1x4x16xf32, #blocked4> 2026-02-21T11:59:18.7209825Z %120 = arith.truncf %119 : tensor<1x4x16xf32, #blocked4> to tensor<1x4x16xbf16, #blocked4> 2026-02-21T11:59:18.7210117Z %121 = ttg.convert_layout %118 : tensor<1x4xf32, #blocked5> -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked15}>> 2026-02-21T11:59:18.7210467Z %122 = tt.expand_dims %121 {axis = 2 : i32} : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x4x1xf32, #blocked15> 2026-02-21T11:59:18.7210773Z %123 = ttg.convert_layout %122 : tensor<1x4x1xf32, #blocked15> -> tensor<1x4x1xf32, #blocked2> 2026-02-21T11:59:18.7211044Z %124 = arith.extf %120 : tensor<1x4x16xbf16, #blocked4> to tensor<1x4x16xf32, #blocked4> 2026-02-21T11:59:18.7211279Z %125 = tt.broadcast %123 : tensor<1x4x1xf32, #blocked2> -> tensor<1x4x16xf32, #blocked2> 2026-02-21T11:59:18.7211530Z %126 = ttg.convert_layout %125 : tensor<1x4x16xf32, #blocked2> -> tensor<1x4x16xf32, #blocked4> 2026-02-21T11:59:18.7211746Z %127 = arith.subf %124, %126 : tensor<1x4x16xf32, #blocked4> 2026-02-21T11:59:18.7212052Z %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x4x16xf32, #blocked4>) -> tensor<1x4x16xf32, #blocked4> 2026-02-21T11:59:18.7212364Z %129 = "tt.reduce"(%128) <{axis = 2 : i32}> ({ 2026-02-21T11:59:18.7212494Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T11:59:18.7212620Z %160 = arith.addf %arg9, %arg10 : f32 2026-02-21T11:59:18.7212745Z tt.reduce.return %160 : f32 2026-02-21T11:59:18.7212939Z }) : (tensor<1x4x16xf32, #blocked4>) -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked4}>> 2026-02-21T11:59:18.7213234Z %130 = ttg.convert_layout %129 : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked4}>> -> tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7213480Z %131 = arith.subf %arg6, %118 : tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7213772Z %132 = tt.extern_elementwise %131 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x4xf32, #blocked5>) -> tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7214064Z %133 = arith.mulf %arg7, %132 : tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7214226Z %134 = arith.addf %133, %130 : tensor<1x4xf32, #blocked5> 2026-02-21T11:59:18.7214472Z %135 = ttg.convert_layout %132 : tensor<1x4xf32, #blocked5> -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked15}>> 2026-02-21T11:59:18.7214816Z %136 = tt.expand_dims %135 {axis = 2 : i32} : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x4x1xf32, #blocked15> 2026-02-21T11:59:18.7215122Z %137 = ttg.convert_layout %136 : tensor<1x4x1xf32, #blocked15> -> tensor<1x4x1xf32, #blocked2> 2026-02-21T11:59:18.7215373Z %138 = tt.broadcast %137 : tensor<1x4x1xf32, #blocked2> -> tensor<1x4x128xf32, #blocked2> 2026-02-21T11:59:18.7215641Z %139 = ttg.convert_layout %138 : tensor<1x4x128xf32, #blocked2> -> tensor<1x4x128xf32, #blocked> 2026-02-21T11:59:18.7215860Z %140 = arith.mulf %arg8, %139 : tensor<1x4x128xf32, #blocked> 2026-02-21T11:59:18.7216114Z %141 = ttg.convert_layout %90 : tensor<1x16xi32, #blocked6> -> tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked17}>> 2026-02-21T11:59:18.7216465Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked17}>> -> tensor<1x16x1xi32, #blocked17> 2026-02-21T11:59:18.7216796Z %143 = ttg.convert_layout %142 : tensor<1x16x1xi32, #blocked17> -> tensor<1x16x1xi32, #blocked3> 2026-02-21T11:59:18.7217014Z %144 = arith.muli %143, %cst_5 : tensor<1x16x1xi32, #blocked3> 2026-02-21T11:59:18.7217188Z %145 = arith.addi %76, %144 : tensor<1x16x1xi32, #blocked3> 2026-02-21T11:59:18.7217390Z %146 = tt.broadcast %145 : tensor<1x16x1xi32, #blocked3> -> tensor<1x16x128xi32, #blocked3> 2026-02-21T11:59:18.7217652Z %147 = ttg.convert_layout %146 : tensor<1x16x128xi32, #blocked3> -> tensor<1x16x128xi32, #blocked> 2026-02-21T11:59:18.7217871Z %148 = arith.addi %147, %34 : tensor<1x16x128xi32, #blocked> 2026-02-21T11:59:18.7218086Z %149 = tt.addptr %35, %148 : tensor<1x16x128x!tt.ptr, #blocked>, tensor<1x16x128xi32, #blocked> 2026-02-21T11:59:18.7218307Z %150 = tt.load %149 : tensor<1x16x128x!tt.ptr, #blocked> 2026-02-21T11:59:18.7218513Z %151 = arith.truncf %128 : tensor<1x4x16xf32, #blocked4> to tensor<1x4x16xbf16, #blocked4> 2026-02-21T11:59:18.7226940Z %152 = tt.reshape %140 : tensor<1x4x128xf32, #blocked> -> tensor<4x128xf32, #blocked10> 2026-02-21T11:59:18.7227185Z %153 = tt.reshape %151 : tensor<1x4x16xbf16, #blocked4> -> tensor<4x16xbf16, #blocked6> 2026-02-21T11:59:18.7227429Z %154 = tt.reshape %150 : tensor<1x16x128xbf16, #blocked> -> tensor<16x128xbf16, #blocked10> 2026-02-21T11:59:18.7227737Z %155 = ttg.convert_layout %153 : tensor<4x16xbf16, #blocked6> -> tensor<4x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> 2026-02-21T11:59:18.7228096Z %156 = ttg.convert_layout %154 : tensor<16x128xbf16, #blocked10> -> tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> 2026-02-21T11:59:18.7228434Z %157 = ttg.convert_layout %152 : tensor<4x128xf32, #blocked10> -> tensor<4x128xf32, #blocked10> 2026-02-21T11:59:18.7228864Z %158 = tt.dot %155, %156, %157, inputPrecision = tf32 : tensor<4x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<4x128xf32, #blocked10> 2026-02-21T11:59:18.7229265Z %159 = tt.reshape %158 : tensor<4x128xf32, #blocked10> -> tensor<1x4x128xf32, #blocked> 2026-02-21T11:59:18.7229539Z scf.yield %118, %134, %159 : tensor<1x4xf32, #blocked5>, tensor<1x4xf32, #blocked5>, tensor<1x4x128xf32, #blocked> 2026-02-21T11:59:18.7229759Z } {tt.num_stages = 4 : i32} 2026-02-21T11:59:18.7229979Z %78 = ttg.convert_layout %77#1 : tensor<1x4xf32, #blocked5> -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked15}>> 2026-02-21T11:59:18.7230323Z %79 = tt.expand_dims %78 {axis = 2 : i32} : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x4x1xf32, #blocked15> 2026-02-21T11:59:18.7230620Z %80 = ttg.convert_layout %79 : tensor<1x4x1xf32, #blocked15> -> tensor<1x4x1xf32, #blocked2> 2026-02-21T11:59:18.7230864Z %81 = tt.broadcast %80 : tensor<1x4x1xf32, #blocked2> -> tensor<1x4x128xf32, #blocked2> 2026-02-21T11:59:18.7231111Z %82 = ttg.convert_layout %81 : tensor<1x4x128xf32, #blocked2> -> tensor<1x4x128xf32, #blocked> 2026-02-21T11:59:18.7231323Z %83 = arith.divf %77#2, %82 : tensor<1x4x128xf32, #blocked> 2026-02-21T11:59:18.7231526Z %84 = arith.truncf %83 : tensor<1x4x128xf32, #blocked> to tensor<1x4x128xbf16, #blocked> 2026-02-21T11:59:18.7231771Z %85 = tt.addptr %36, %56 : tensor<1x4x128x!tt.ptr, #blocked>, tensor<1x4x128xi64, #blocked> 2026-02-21T11:59:18.7232003Z tt.store %85, %84, %68 : tensor<1x4x128x!tt.ptr, #blocked> 2026-02-21T11:59:18.7232203Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32} 2026-02-21T11:59:18.7232368Z tt.return 2026-02-21T11:59:18.7232455Z } 2026-02-21T11:59:18.7232535Z } 2026-02-21T11:59:18.7232582Z 2026-02-21T11:59:18.7232620Z {-# 2026-02-21T11:59:18.7232702Z external_resources: { 2026-02-21T11:59:18.7232836Z mlir_reproducer: { 2026-02-21T11:59:18.7235080Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T11:59:18.7237383Z disable_threading: false, 2026-02-21T11:59:18.7237502Z verify_each: true 2026-02-21T11:59:18.7237596Z } 2026-02-21T11:59:18.7237674Z } 2026-02-21T11:59:18.7237744Z #-} 2026-02-21T11:59:18.7238025Z /tmp/torchinductor_root/o6/co6rjuctqpaz75vipxwjv5tecvb22lqonz5hnx6w6pycwovwezru.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T11:59:18.7238755Z /tmp/torchinductor_root/o6/co6rjuctqpaz75vipxwjv5tecvb22lqonz5hnx6w6pycwovwezru.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T11:59:18.7239311Z [1101s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T11:59:18.7240106Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 4, 16], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[0, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T11:59:18.7240831Z Error: RuntimeError: PassManager::run failed 2026-02-21T11:59:18.7241006Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T12:00:16.9520987Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 0.3 configs/s 2026-02-21T12:00:16.9532676Z [1159s] Adaptive compile timeout: 30s (90% percentile=30.0s, bounds=[30.0s, 30s]) 2026-02-21T12:00:17.0246768Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6/6 - configs/s 2026-02-21T12:00:17.4295027Z [1160s] Initial random population of 100, 5 starting points: 2026-02-21T12:00:17.4298417Z error=19 2026-02-21T12:00:17.4298741Z timeout=25 2026-02-21T12:00:17.4298945Z ok=56 2026-02-21T12:00:17.4299136Z min=31.6083 2026-02-21T12:00:17.4299767Z mid=276.2712 2026-02-21T12:00:17.4299996Z max=56214.3750 2026-02-21T12:00:17.4300257Z best={'block_sizes': [1, 256, 16], 2026-02-21T12:00:17.4300684Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T12:00:17.4301087Z 'l2_groupings': [8], 2026-02-21T12:00:17.4301356Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:00:17.4301639Z 'loop_orders': [[0, 1]], 2026-02-21T12:00:17.4301889Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:00:17.4302122Z 'num_sm_multiplier': 16, 2026-02-21T12:00:17.4302478Z 'num_stages': 4, 2026-02-21T12:00:17.4302665Z 'num_warps': 16, 2026-02-21T12:00:17.4302885Z 'pid_type': 'persistent_blocked', 2026-02-21T12:00:17.4303156Z 'range_flattens': [False, True], 2026-02-21T12:00:17.4303429Z 'range_multi_buffers': [None, False], 2026-02-21T12:00:17.4303684Z 'range_num_stages': [1, 3], 2026-02-21T12:00:17.4303919Z 'range_unroll_factors': [2, 3], 2026-02-21T12:00:17.4304161Z 'range_warp_specializes': [], 2026-02-21T12:00:17.4304387Z 'waves_per_eu': 1} 2026-02-21T12:00:17.4309385Z [1160s] Fitting surrogate: 100 points, 100 targets 2026-02-21T12:00:18.2834306Z [1160s] Generation 1 starting: 82 neighbors, 5 active search path(s) 2026-02-21T12:00:39.0100903Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 2.1 configs/s 2026-02-21T12:01:44.0125525Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 82/82 1.5 configs/s 2026-02-21T12:01:44.7338082Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 7/7 - configs/s 2026-02-21T12:01:49.7223982Z [1252s] Generation 1 complete: 2026-02-21T12:01:49.7224357Z error=10 2026-02-21T12:01:49.7224566Z ok=77 2026-02-21T12:01:49.7224770Z min=25.8854 2026-02-21T12:01:49.7224974Z mid=64.4642 2026-02-21T12:01:49.7225597Z max=1690.6870 2026-02-21T12:01:49.7225846Z best={'block_sizes': [1, 256, 32], 2026-02-21T12:01:49.7226266Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T12:01:49.7226661Z 'l2_groupings': [8], 2026-02-21T12:01:49.7226959Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:01:49.7227291Z 'loop_orders': [[0, 1]], 2026-02-21T12:01:49.7227571Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:01:49.7227854Z 'num_sm_multiplier': 16, 2026-02-21T12:01:49.7228113Z 'num_stages': 4, 2026-02-21T12:01:49.7228343Z 'num_warps': 8, 2026-02-21T12:01:49.7228597Z 'pid_type': 'persistent_blocked', 2026-02-21T12:01:49.7228908Z 'range_flattens': [False, True], 2026-02-21T12:01:49.7229375Z 'range_multi_buffers': [True, False], 2026-02-21T12:01:49.7229691Z 'range_num_stages': [1, 3], 2026-02-21T12:01:49.7229982Z 'range_unroll_factors': [2, 3], 2026-02-21T12:01:49.7230286Z 'range_warp_specializes': [], 2026-02-21T12:01:49.7230557Z 'waves_per_eu': 1} 2026-02-21T12:01:49.7239075Z [1252s] Fitting surrogate: 187 points, 187 targets 2026-02-21T12:01:50.5941738Z [1253s] Generation 2 starting: 80 neighbors, 5 active search path(s) 2026-02-21T12:02:27.4604732Z [1290s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:02:29.4240209Z [1292s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T12:02:29.4265216Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 0.7 configs/s 2026-02-21T12:03:16.0525739Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 83/83 1.8 configs/s 2026-02-21T12:03:17.5014775Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 7/7 - configs/s 2026-02-21T12:03:27.5499759Z [1350s] Generation 2 complete: 2026-02-21T12:03:27.5500179Z error=3 2026-02-21T12:03:27.5500392Z timeout=2 2026-02-21T12:03:27.5500604Z ok=81 2026-02-21T12:03:27.5500802Z min=25.8371 2026-02-21T12:03:27.5501056Z mid=31.6133 2026-02-21T12:03:27.5501251Z max=552.2924 2026-02-21T12:03:27.5501496Z best={'block_sizes': [1, 256, 32], 2026-02-21T12:03:27.5502424Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T12:03:27.5502831Z 'l2_groupings': [8], 2026-02-21T12:03:27.5503101Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:03:27.5503441Z 'loop_orders': [[0, 1]], 2026-02-21T12:03:27.5503713Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:03:27.5503997Z 'num_sm_multiplier': 16, 2026-02-21T12:03:27.5504258Z 'num_stages': 4, 2026-02-21T12:03:27.5504482Z 'num_warps': 8, 2026-02-21T12:03:27.5504755Z 'pid_type': 'persistent_blocked', 2026-02-21T12:03:27.5505058Z 'range_flattens': [False, True], 2026-02-21T12:03:27.5505364Z 'range_multi_buffers': [True, False], 2026-02-21T12:03:27.5505664Z 'range_num_stages': [1, 3], 2026-02-21T12:03:27.5505944Z 'range_unroll_factors': [2, 3], 2026-02-21T12:03:27.5506170Z 'range_warp_specializes': [], 2026-02-21T12:03:27.5506329Z 'waves_per_eu': 1} 2026-02-21T12:03:27.5519022Z [1350s] Fitting surrogate: 273 points, 273 targets 2026-02-21T12:03:28.4512333Z [1351s] Generation 3 starting: 77 neighbors, 5 active search path(s) 2026-02-21T12:03:53.4077472Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 0.8 configs/s 2026-02-21T12:04:39.8044797Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 79/79 1.6 configs/s 2026-02-21T12:04:41.0677143Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 7/7 - configs/s 2026-02-21T12:04:49.8138323Z [1432s] Generation 3 complete: 2026-02-21T12:04:49.8138743Z error=1 2026-02-21T12:04:49.8139000Z ok=82 2026-02-21T12:04:49.8139207Z min=25.6264 2026-02-21T12:04:49.8139427Z mid=36.8950 2026-02-21T12:04:49.8139629Z max=402.1087 2026-02-21T12:04:49.8140081Z best={'block_sizes': [1, 256, 32], 2026-02-21T12:04:49.8140521Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T12:04:49.8140922Z 'l2_groupings': [8], 2026-02-21T12:04:49.8141756Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:04:49.8142070Z 'loop_orders': [[0, 1]], 2026-02-21T12:04:49.8142356Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:04:49.8142660Z 'num_sm_multiplier': 16, 2026-02-21T12:04:49.8143015Z 'num_stages': 4, 2026-02-21T12:04:49.8143268Z 'num_warps': 8, 2026-02-21T12:04:49.8143522Z 'pid_type': 'persistent_blocked', 2026-02-21T12:04:49.8143958Z 'range_flattens': [False, True], 2026-02-21T12:04:49.8144291Z 'range_multi_buffers': [True, False], 2026-02-21T12:04:49.8144673Z 'range_num_stages': [1, 3], 2026-02-21T12:04:49.8144954Z 'range_unroll_factors': [2, 3], 2026-02-21T12:04:49.8145241Z 'range_warp_specializes': [], 2026-02-21T12:04:49.8145452Z 'waves_per_eu': 1} 2026-02-21T12:04:49.8160953Z [1432s] Fitting surrogate: 356 points, 356 targets 2026-02-21T12:04:50.5773437Z [1433s] Generation 4 starting: 68 neighbors, 4 active search path(s) 2026-02-21T12:05:27.3670864Z [1469s] Timeout after 30s compiling Config(block_sizes=[1, 64, 256], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:05:27.3693343Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 0.5 configs/s 2026-02-21T12:06:08.6951612Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 70/70 1.4 configs/s 2026-02-21T12:06:09.3859292Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:06:15.5215444Z [1518s] Generation 4 complete: 2026-02-21T12:06:15.5215839Z error=2 2026-02-21T12:06:15.5216054Z timeout=1 2026-02-21T12:06:15.5216251Z ok=70 2026-02-21T12:06:15.5216456Z min=20.8176 2026-02-21T12:06:15.5216659Z mid=44.1705 2026-02-21T12:06:15.5216861Z max=379.6740 2026-02-21T12:06:15.5217148Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:06:15.5217563Z 'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:06:15.5218469Z 'l2_groupings': [8], 2026-02-21T12:06:15.5218750Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:06:15.5219070Z 'loop_orders': [[0, 1]], 2026-02-21T12:06:15.5219340Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:06:15.5219773Z 'num_sm_multiplier': 128, 2026-02-21T12:06:15.5220046Z 'num_stages': 2, 2026-02-21T12:06:15.5220282Z 'num_warps': 8, 2026-02-21T12:06:15.5220540Z 'pid_type': 'persistent_blocked', 2026-02-21T12:06:15.5220860Z 'range_flattens': [False, None], 2026-02-21T12:06:15.5221185Z 'range_multi_buffers': [False, False], 2026-02-21T12:06:15.5221496Z 'range_num_stages': [3, 2], 2026-02-21T12:06:15.5221786Z 'range_unroll_factors': [4, 3], 2026-02-21T12:06:15.5222080Z 'range_warp_specializes': [], 2026-02-21T12:06:15.5222360Z 'waves_per_eu': 1} 2026-02-21T12:06:15.5237394Z [1518s] Fitting surrogate: 429 points, 429 targets 2026-02-21T12:06:16.2432993Z [1518s] Generation 5 starting: 71 neighbors, 4 active search path(s) 2026-02-21T12:06:52.4546048Z [1555s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:06:53.6054693Z [1556s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:06:54.3550180Z [1556s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:06:55.9905227Z [1558s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:07:05.0747387Z [1567s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:07:05.0761583Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.3 configs/s 2026-02-21T12:07:39.9752118Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 73/73 1.9 configs/s 2026-02-21T12:07:40.5741964Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:07:45.8667701Z [1608s] Generation 5 complete: 2026-02-21T12:07:45.8668120Z error=5 2026-02-21T12:07:45.8668490Z timeout=5 2026-02-21T12:07:45.8668818Z ok=65 2026-02-21T12:07:45.8669125Z min=20.9904 2026-02-21T12:07:45.8669431Z mid=36.2941 2026-02-21T12:07:45.8669688Z max=326.3378 2026-02-21T12:07:45.8669970Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:07:45.8670410Z 'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:07:45.8670810Z 'l2_groupings': [8], 2026-02-21T12:07:45.8671849Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:07:45.8672210Z 'loop_orders': [[0, 1]], 2026-02-21T12:07:45.8672489Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:07:45.8672778Z 'num_sm_multiplier': 128, 2026-02-21T12:07:45.8673055Z 'num_stages': 2, 2026-02-21T12:07:45.8673292Z 'num_warps': 8, 2026-02-21T12:07:45.8673568Z 'pid_type': 'persistent_blocked', 2026-02-21T12:07:45.8673878Z 'range_flattens': [False, True], 2026-02-21T12:07:45.8674156Z 'range_multi_buffers': [False, False], 2026-02-21T12:07:45.8674401Z 'range_num_stages': [3, 2], 2026-02-21T12:07:45.8674621Z 'range_unroll_factors': [4, 3], 2026-02-21T12:07:45.8674848Z 'range_warp_specializes': [], 2026-02-21T12:07:45.8675213Z 'waves_per_eu': 2} 2026-02-21T12:07:45.8693123Z [1608s] Fitting surrogate: 504 points, 504 targets 2026-02-21T12:07:47.6951487Z [1610s] Generation 6 starting: 75 neighbors, 4 active search path(s) 2026-02-21T12:08:22.0283281Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 0.8 configs/s 2026-02-21T12:09:17.2038236Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 77/77 1.0 configs/s 2026-02-21T12:09:18.0854051Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:09:25.8956464Z [1708s] Generation 6 complete: 2026-02-21T12:09:25.8956933Z error=2 2026-02-21T12:09:25.8957143Z ok=77 2026-02-21T12:09:25.8957347Z min=20.8192 2026-02-21T12:09:25.8957556Z mid=35.9649 2026-02-21T12:09:25.8957771Z max=296.3518 2026-02-21T12:09:25.8958031Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:09:25.8958431Z 'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:09:25.8960366Z 'l2_groupings': [8], 2026-02-21T12:09:25.8960728Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:09:25.8961057Z 'loop_orders': [[0, 1]], 2026-02-21T12:09:25.8961360Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:09:25.8961636Z 'num_sm_multiplier': 128, 2026-02-21T12:09:25.8961888Z 'num_stages': 2, 2026-02-21T12:09:25.8962099Z 'num_warps': 8, 2026-02-21T12:09:25.8962348Z 'pid_type': 'persistent_blocked', 2026-02-21T12:09:25.8962781Z 'range_flattens': [False, True], 2026-02-21T12:09:25.8963072Z 'range_multi_buffers': [False, False], 2026-02-21T12:09:25.8963356Z 'range_num_stages': [3, 2], 2026-02-21T12:09:25.8963631Z 'range_unroll_factors': [4, 3], 2026-02-21T12:09:25.8963913Z 'range_warp_specializes': [], 2026-02-21T12:09:25.8964168Z 'waves_per_eu': 2} 2026-02-21T12:09:25.8982723Z [1708s] Fitting surrogate: 583 points, 583 targets 2026-02-21T12:09:26.6863973Z [1709s] Generation 7 starting: 69 neighbors, 4 active search path(s) 2026-02-21T12:10:02.9214048Z [1745s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:10:04.1996603Z [1746s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:10:04.2020917Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.5 configs/s 2026-02-21T12:10:36.6106238Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 71/71 2.2 configs/s 2026-02-21T12:10:37.1603571Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:10:42.0149661Z [1784s] Generation 7 complete: 2026-02-21T12:10:42.0149910Z error=9 2026-02-21T12:10:42.0150004Z timeout=2 2026-02-21T12:10:42.0150361Z ok=62 2026-02-21T12:10:42.0150452Z min=20.9846 2026-02-21T12:10:42.0150550Z mid=43.6457 2026-02-21T12:10:42.0150643Z max=215.3094 2026-02-21T12:10:42.0150755Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:10:42.0150952Z 'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:10:42.0151147Z 'l2_groupings': [8], 2026-02-21T12:10:42.0151274Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:10:42.0151421Z 'loop_orders': [[0, 1]], 2026-02-21T12:10:42.0151549Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:10:42.0151672Z 'num_sm_multiplier': 128, 2026-02-21T12:10:42.0151789Z 'num_stages': 1, 2026-02-21T12:10:42.0151893Z 'num_warps': 8, 2026-02-21T12:10:42.0152130Z 'pid_type': 'persistent_blocked', 2026-02-21T12:10:42.0152268Z 'range_flattens': [False, True], 2026-02-21T12:10:42.0152420Z 'range_multi_buffers': [False, False], 2026-02-21T12:10:42.0152565Z 'range_num_stages': [3, 2], 2026-02-21T12:10:42.0152691Z 'range_unroll_factors': [4, 3], 2026-02-21T12:10:42.0152828Z 'range_warp_specializes': [], 2026-02-21T12:10:42.0152949Z 'waves_per_eu': 2} 2026-02-21T12:10:42.0178456Z [1784s] Fitting surrogate: 656 points, 656 targets 2026-02-21T12:10:42.8180057Z [1785s] Generation 8 starting: 75 neighbors, 4 active search path(s) 2026-02-21T12:11:18.5984667Z [1821s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, False], range_num_stages=[3, 3], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:11:22.7572129Z [1825s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:11:23.1353673Z [1825s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[4, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:11:24.5823146Z [1827s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:11:26.8732251Z [1829s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:11:28.5244763Z [1831s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:11:30.4062016Z [1832s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:11:30.4089469Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 0.6 configs/s 2026-02-21T12:12:02.6255597Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 76/76 2.2 configs/s 2026-02-21T12:12:03.4341598Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:12:10.5909830Z [1873s] Generation 8 complete: 2026-02-21T12:12:10.5910259Z error=10 2026-02-21T12:12:10.5910480Z timeout=7 2026-02-21T12:12:10.5910683Z ok=62 2026-02-21T12:12:10.5910888Z min=20.9229 2026-02-21T12:12:10.5911096Z mid=28.2292 2026-02-21T12:12:10.5911295Z max=546.9233 2026-02-21T12:12:10.5911542Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:12:10.5912019Z 'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:12:10.5912518Z 'l2_groupings': [8], 2026-02-21T12:12:10.5912807Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:12:10.5913135Z 'loop_orders': [[0, 1]], 2026-02-21T12:12:10.5913410Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:12:10.5913695Z 'num_sm_multiplier': 128, 2026-02-21T12:12:10.5913981Z 'num_stages': 2, 2026-02-21T12:12:10.5914234Z 'num_warps': 8, 2026-02-21T12:12:10.5914495Z 'pid_type': 'persistent_blocked', 2026-02-21T12:12:10.5914820Z 'range_flattens': [False, True], 2026-02-21T12:12:10.5915137Z 'range_multi_buffers': [True, False], 2026-02-21T12:12:10.5915446Z 'range_num_stages': [3, 2], 2026-02-21T12:12:10.5915736Z 'range_unroll_factors': [4, 3], 2026-02-21T12:12:10.5916030Z 'range_warp_specializes': [], 2026-02-21T12:12:10.5916308Z 'waves_per_eu': 2} 2026-02-21T12:12:10.5989631Z [1873s] Fitting surrogate: 735 points, 735 targets 2026-02-21T12:12:11.0253443Z [1873s] Generation 9 starting: 36 neighbors, 2 active search path(s) 2026-02-21T12:12:47.1357879Z [1909s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, None], range_num_stages=[4, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:12:47.1379761Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 0.5 configs/s 2026-02-21T12:13:12.4859477Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 37/37 1.4 configs/s 2026-02-21T12:13:12.8917710Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:13:16.4606362Z [1939s] Generation 9 complete: 2026-02-21T12:13:16.4606794Z error=2 2026-02-21T12:13:16.4607043Z timeout=1 2026-02-21T12:13:16.4607252Z ok=35 2026-02-21T12:13:16.4607460Z min=20.9333 2026-02-21T12:13:16.4607663Z mid=31.0326 2026-02-21T12:13:16.4607865Z max=735.9344 2026-02-21T12:13:16.4608106Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:13:16.4608523Z 'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:13:16.4608925Z 'l2_groupings': [8], 2026-02-21T12:13:16.4609222Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:13:16.4609536Z 'loop_orders': [[0, 1]], 2026-02-21T12:13:16.4609831Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:13:16.4610108Z 'num_sm_multiplier': 64, 2026-02-21T12:13:16.4610371Z 'num_stages': 2, 2026-02-21T12:13:16.4610604Z 'num_warps': 8, 2026-02-21T12:13:16.4611378Z 'pid_type': 'persistent_blocked', 2026-02-21T12:13:16.4611701Z 'range_flattens': [True, True], 2026-02-21T12:13:16.4612001Z 'range_multi_buffers': [True, False], 2026-02-21T12:13:16.4612312Z 'range_num_stages': [3, 2], 2026-02-21T12:13:16.4612596Z 'range_unroll_factors': [4, 3], 2026-02-21T12:13:16.4612870Z 'range_warp_specializes': [], 2026-02-21T12:13:16.4613100Z 'waves_per_eu': 2} 2026-02-21T12:13:16.4632983Z [1939s] Fitting surrogate: 773 points, 773 targets 2026-02-21T12:13:16.9488714Z [1939s] Generation 10 starting: 35 neighbors, 2 active search path(s) 2026-02-21T12:13:52.3739692Z [1974s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:13:52.3762754Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0.8 configs/s 2026-02-21T12:14:04.9455579Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 2.7 configs/s 2026-02-21T12:14:05.1704646Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:14:07.1394471Z [1989s] Generation 10 complete: 2026-02-21T12:14:07.1396428Z error=13 2026-02-21T12:14:07.1396625Z timeout=1 2026-02-21T12:14:07.1396719Z ok=23 2026-02-21T12:14:07.1396822Z min=20.9181 2026-02-21T12:14:07.1396908Z mid=38.9961 2026-02-21T12:14:07.1396990Z max=241.0125 2026-02-21T12:14:07.1397117Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:14:07.1397278Z 'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:14:07.1397444Z 'l2_groupings': [8], 2026-02-21T12:14:07.1397552Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:14:07.1397671Z 'loop_orders': [[0, 1]], 2026-02-21T12:14:07.1397777Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:14:07.1397891Z 'num_sm_multiplier': 32, 2026-02-21T12:14:07.1397990Z 'num_stages': 1, 2026-02-21T12:14:07.1398080Z 'num_warps': 8, 2026-02-21T12:14:07.1398175Z 'pid_type': 'persistent_blocked', 2026-02-21T12:14:07.1398295Z 'range_flattens': [True, True], 2026-02-21T12:14:07.1398668Z 'range_multi_buffers': [True, False], 2026-02-21T12:14:07.1398785Z 'range_num_stages': [3, 2], 2026-02-21T12:14:07.1398888Z 'range_unroll_factors': [4, 3], 2026-02-21T12:14:07.1399001Z 'range_warp_specializes': [], 2026-02-21T12:14:07.1399101Z 'waves_per_eu': 2} 2026-02-21T12:14:07.1411237Z [1989s] Fitting surrogate: 810 points, 810 targets 2026-02-21T12:14:07.6132016Z [1990s] Generation 11 starting: 35 neighbors, 2 active search path(s) 2026-02-21T12:14:39.4023916Z [2021s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:14:41.6968215Z [2024s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[4, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:14:43.8362494Z [2026s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:14:44.7578805Z [2027s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:14:44.7607117Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0.9 configs/s 2026-02-21T12:15:00.7957825Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 2.4 configs/s 2026-02-21T12:15:01.0920903Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:15:03.6902983Z [2046s] Generation 11 complete: 2026-02-21T12:15:03.6903442Z error=2 2026-02-21T12:15:03.6903655Z timeout=4 2026-02-21T12:15:03.6903853Z ok=31 2026-02-21T12:15:03.6905327Z min=20.8761 2026-02-21T12:15:03.6905536Z mid=37.7577 2026-02-21T12:15:03.6905743Z max=209.6498 2026-02-21T12:15:03.6905995Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:15:03.6906448Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T12:15:03.6906836Z 'l2_groupings': [8], 2026-02-21T12:15:03.6907118Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:15:03.6907439Z 'loop_orders': [[0, 1]], 2026-02-21T12:15:03.6907710Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:15:03.6907994Z 'num_sm_multiplier': 32, 2026-02-21T12:15:03.6908263Z 'num_stages': 1, 2026-02-21T12:15:03.6908501Z 'num_warps': 8, 2026-02-21T12:15:03.6908755Z 'pid_type': 'persistent_blocked', 2026-02-21T12:15:03.6909573Z 'range_flattens': [True, True], 2026-02-21T12:15:03.6909876Z 'range_multi_buffers': [True, False], 2026-02-21T12:15:03.6910194Z 'range_num_stages': [3, 3], 2026-02-21T12:15:03.6910495Z 'range_unroll_factors': [4, 3], 2026-02-21T12:15:03.6910788Z 'range_warp_specializes': [], 2026-02-21T12:15:03.6910988Z 'waves_per_eu': 2} 2026-02-21T12:15:03.6932657Z [2046s] Fitting surrogate: 847 points, 847 targets 2026-02-21T12:15:04.1315872Z [2046s] Generation 12 starting: 35 neighbors, 2 active search path(s) 2026-02-21T12:15:36.7480115Z [2079s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, False], range_num_stages=[3, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:15:38.7938117Z [2081s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:15:39.1888827Z [2081s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[3, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:15:39.7470904Z [2082s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:15:39.7487340Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 1.0 configs/s 2026-02-21T12:15:58.1339230Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 1.9 configs/s 2026-02-21T12:15:58.3960478Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:16:00.6832188Z [2103s] Generation 12 complete: 2026-02-21T12:16:00.6832447Z error=3 2026-02-21T12:16:00.6832933Z timeout=4 2026-02-21T12:16:00.6833063Z ok=30 2026-02-21T12:16:00.6833162Z min=20.9142 2026-02-21T12:16:00.6833279Z mid=36.8290 2026-02-21T12:16:00.6833380Z max=469.2883 2026-02-21T12:16:00.6833500Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:16:00.6833699Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T12:16:00.6833906Z 'l2_groupings': [8], 2026-02-21T12:16:00.6834045Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:16:00.6834200Z 'loop_orders': [[0, 1]], 2026-02-21T12:16:00.6834336Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:16:00.6834471Z 'num_sm_multiplier': 32, 2026-02-21T12:16:00.6834609Z 'num_stages': 2, 2026-02-21T12:16:00.6834718Z 'num_warps': 8, 2026-02-21T12:16:00.6834849Z 'pid_type': 'persistent_blocked', 2026-02-21T12:16:00.6834991Z 'range_flattens': [True, True], 2026-02-21T12:16:00.6835139Z 'range_multi_buffers': [True, False], 2026-02-21T12:16:00.6835292Z 'range_num_stages': [3, 3], 2026-02-21T12:16:00.6835428Z 'range_unroll_factors': [4, 3], 2026-02-21T12:16:00.6835574Z 'range_warp_specializes': [], 2026-02-21T12:16:00.6835702Z 'waves_per_eu': 2} 2026-02-21T12:16:00.6862605Z [2103s] Fitting surrogate: 884 points, 884 targets 2026-02-21T12:16:01.1579161Z [2103s] Generation 13 starting: 39 neighbors, 2 active search path(s) 2026-02-21T12:16:36.2046509Z [2138s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:16:36.7042188Z [2139s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[3, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:16:39.4617238Z [2142s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, None], range_num_stages=[4, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:16:39.4639099Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 0.6 configs/s 2026-02-21T12:17:10.0427574Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 40/40 1.2 configs/s 2026-02-21T12:17:10.3244765Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:17:12.7938290Z [2175s] Generation 13 complete: 2026-02-21T12:17:12.7938790Z error=5 2026-02-21T12:17:12.7939014Z timeout=3 2026-02-21T12:17:12.7939219Z ok=33 2026-02-21T12:17:12.7939433Z min=20.9302 2026-02-21T12:17:12.7939637Z mid=45.8267 2026-02-21T12:17:12.7939859Z max=532.5824 2026-02-21T12:17:12.7940114Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:17:12.7940517Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T12:17:12.7940954Z 'l2_groupings': [8], 2026-02-21T12:17:12.7941235Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:17:12.7941583Z 'loop_orders': [[0, 1]], 2026-02-21T12:17:12.7941864Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:17:12.7942153Z 'num_sm_multiplier': 32, 2026-02-21T12:17:12.7942415Z 'num_stages': 2, 2026-02-21T12:17:12.7942646Z 'num_warps': 8, 2026-02-21T12:17:12.7942917Z 'pid_type': 'persistent_blocked', 2026-02-21T12:17:12.7943229Z 'range_flattens': [None, True], 2026-02-21T12:17:12.7943546Z 'range_multi_buffers': [True, False], 2026-02-21T12:17:12.7943855Z 'range_num_stages': [3, 3], 2026-02-21T12:17:12.7944151Z 'range_unroll_factors': [4, 3], 2026-02-21T12:17:12.7944440Z 'range_warp_specializes': [], 2026-02-21T12:17:12.7944719Z 'waves_per_eu': 2} 2026-02-21T12:17:12.7968824Z [2175s] Fitting surrogate: 925 points, 925 targets 2026-02-21T12:17:14.3649088Z [2176s] Generation 14 starting: 32 neighbors, 2 active search path(s) 2026-02-21T12:17:50.1003982Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 0.7 configs/s 2026-02-21T12:18:24.2373133Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 33/33 0.7 configs/s 2026-02-21T12:18:24.4762061Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:18:26.5537660Z [2249s] Generation 14 complete: 2026-02-21T12:18:26.5538044Z error=3 2026-02-21T12:18:26.5538249Z ok=31 2026-02-21T12:18:26.5538450Z min=20.8891 2026-02-21T12:18:26.5538671Z mid=42.1927 2026-02-21T12:18:26.5538905Z max=690.1064 2026-02-21T12:18:26.5539155Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:18:26.5540003Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T12:18:26.5540394Z 'l2_groupings': [8], 2026-02-21T12:18:26.5540675Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:18:26.5541006Z 'loop_orders': [[0, 1]], 2026-02-21T12:18:26.5541282Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:18:26.5541560Z 'num_sm_multiplier': 32, 2026-02-21T12:18:26.5541819Z 'num_stages': 2, 2026-02-21T12:18:26.5542045Z 'num_warps': 8, 2026-02-21T12:18:26.5542304Z 'pid_type': 'persistent_blocked', 2026-02-21T12:18:26.5542624Z 'range_flattens': [False, None], 2026-02-21T12:18:26.5542933Z 'range_multi_buffers': [True, False], 2026-02-21T12:18:26.5543245Z 'range_num_stages': [3, 3], 2026-02-21T12:18:26.5543525Z 'range_unroll_factors': [4, 3], 2026-02-21T12:18:26.5572466Z 'range_warp_specializes': [], 2026-02-21T12:18:26.5572634Z 'waves_per_eu': 2} 2026-02-21T12:18:26.5572796Z [2249s] Fitting surrogate: 959 points, 959 targets 2026-02-21T12:18:27.0047181Z [2249s] Generation 15 starting: 33 neighbors, 2 active search path(s) 2026-02-21T12:19:04.5151017Z [2287s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:19:04.5173676Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 0.7 configs/s 2026-02-21T12:19:21.7322711Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 2.0 configs/s 2026-02-21T12:19:22.0298053Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:19:24.6371414Z [2307s] Generation 15 complete: 2026-02-21T12:19:24.6372370Z error=6 2026-02-21T12:19:24.6372588Z timeout=1 2026-02-21T12:19:24.6372797Z ok=28 2026-02-21T12:19:24.6372998Z min=20.8988 2026-02-21T12:19:24.6373221Z mid=39.6252 2026-02-21T12:19:24.6373421Z max=224.5784 2026-02-21T12:19:24.6373665Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:19:24.6374091Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T12:19:24.6374483Z 'l2_groupings': [8], 2026-02-21T12:19:24.6374764Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:19:24.6375076Z 'loop_orders': [[0, 1]], 2026-02-21T12:19:24.6375356Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:19:24.6375649Z 'num_sm_multiplier': 32, 2026-02-21T12:19:24.6375904Z 'num_stages': 2, 2026-02-21T12:19:24.6376133Z 'num_warps': 8, 2026-02-21T12:19:24.6376387Z 'pid_type': 'persistent_blocked', 2026-02-21T12:19:24.6376692Z 'range_flattens': [False, False], 2026-02-21T12:19:24.6376994Z 'range_multi_buffers': [True, False], 2026-02-21T12:19:24.6377301Z 'range_num_stages': [3, 3], 2026-02-21T12:19:24.6377584Z 'range_unroll_factors': [4, 3], 2026-02-21T12:19:24.6377876Z 'range_warp_specializes': [], 2026-02-21T12:19:24.6378156Z 'waves_per_eu': 2} 2026-02-21T12:19:24.6405771Z [2307s] Fitting surrogate: 994 points, 994 targets 2026-02-21T12:19:25.0667668Z [2307s] Generation 16 starting: 35 neighbors, 2 active search path(s) 2026-02-21T12:19:58.8084936Z [2341s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, False], range_num_stages=[4, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:20:00.9846234Z [2343s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:20:00.9868042Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 1.0 configs/s 2026-02-21T12:20:24.0674241Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 1.5 configs/s 2026-02-21T12:20:24.2946003Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:20:26.2798198Z [2368s] Generation 16 complete: 2026-02-21T12:20:26.2798551Z error=4 2026-02-21T12:20:26.2798759Z timeout=2 2026-02-21T12:20:26.2798963Z ok=31 2026-02-21T12:20:26.2799162Z min=20.9688 2026-02-21T12:20:26.2799373Z mid=37.6421 2026-02-21T12:20:26.2799584Z max=430.6540 2026-02-21T12:20:26.2799828Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:20:26.2800261Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T12:20:26.2800654Z 'l2_groupings': [8], 2026-02-21T12:20:26.2800942Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:20:26.2801279Z 'loop_orders': [[0, 1]], 2026-02-21T12:20:26.2801560Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:20:26.2801845Z 'num_sm_multiplier': 32, 2026-02-21T12:20:26.2802110Z 'num_stages': 2, 2026-02-21T12:20:26.2802968Z 'num_warps': 8, 2026-02-21T12:20:26.2803253Z 'pid_type': 'persistent_blocked', 2026-02-21T12:20:26.2803567Z 'range_flattens': [False, False], 2026-02-21T12:20:26.2803872Z 'range_multi_buffers': [True, None], 2026-02-21T12:20:26.2804157Z 'range_num_stages': [3, 3], 2026-02-21T12:20:26.2804397Z 'range_unroll_factors': [4, 3], 2026-02-21T12:20:26.2804654Z 'range_warp_specializes': [], 2026-02-21T12:20:26.2804890Z 'waves_per_eu': 2} 2026-02-21T12:20:26.2833094Z [2368s] Fitting surrogate: 1031 points, 1031 targets 2026-02-21T12:20:26.7030342Z [2369s] Generation 17 starting: 35 neighbors, 2 active search path(s) 2026-02-21T12:21:00.5663761Z [2403s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[3, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:21:02.5056773Z [2405s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:21:03.9431023Z [2406s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T12:21:03.9456470Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0.7 configs/s 2026-02-21T12:21:25.1470984Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 1.7 configs/s 2026-02-21T12:21:25.4945458Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:21:28.5497840Z [2431s] Generation 17 complete: 2026-02-21T12:21:28.5501614Z error=2 2026-02-21T12:21:28.5501897Z timeout=3 2026-02-21T12:21:28.5502136Z ok=32 2026-02-21T12:21:28.5502343Z min=20.8908 2026-02-21T12:21:28.5502558Z mid=39.5303 2026-02-21T12:21:28.5502759Z max=401.4762 2026-02-21T12:21:28.5503061Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:21:28.5503487Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T12:21:28.5504308Z 'l2_groupings': [8], 2026-02-21T12:21:28.5504592Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:21:28.5504912Z 'loop_orders': [[0, 1]], 2026-02-21T12:21:28.5505215Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:21:28.5505496Z 'num_sm_multiplier': 32, 2026-02-21T12:21:28.5505763Z 'num_stages': 2, 2026-02-21T12:21:28.5505996Z 'num_warps': 8, 2026-02-21T12:21:28.5506264Z 'pid_type': 'persistent_blocked', 2026-02-21T12:21:28.5506619Z 'range_flattens': [None, False], 2026-02-21T12:21:28.5506914Z 'range_multi_buffers': [True, None], 2026-02-21T12:21:28.5507166Z 'range_num_stages': [3, 3], 2026-02-21T12:21:28.5507407Z 'range_unroll_factors': [4, 3], 2026-02-21T12:21:28.5507653Z 'range_warp_specializes': [], 2026-02-21T12:21:28.5507879Z 'waves_per_eu': 2} 2026-02-21T12:21:28.5538624Z [2431s] Fitting surrogate: 1068 points, 1068 targets 2026-02-21T12:21:29.0342580Z [2431s] Generation 18 starting: 36 neighbors, 2 active search path(s) 2026-02-21T12:22:01.1201730Z [2463s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[3, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:22:02.7427700Z [2465s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:22:03.8355569Z [2466s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[3, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:22:03.8378664Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 0.9 configs/s 2026-02-21T12:22:30.8763307Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 1.3 configs/s 2026-02-21T12:22:31.1170837Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:22:33.2107101Z [2495s] Generation 18 complete: 2026-02-21T12:22:33.2107482Z error=8 2026-02-21T12:22:33.2107689Z timeout=3 2026-02-21T12:22:33.2107921Z ok=27 2026-02-21T12:22:33.2108130Z min=20.9218 2026-02-21T12:22:33.2108335Z mid=44.1776 2026-02-21T12:22:33.2108568Z max=839.8673 2026-02-21T12:22:33.2108814Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:22:33.2109265Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T12:22:33.2109666Z 'l2_groupings': [8], 2026-02-21T12:22:33.2109950Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:22:33.2110267Z 'loop_orders': [[0, 1]], 2026-02-21T12:22:33.2110550Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:22:33.2110825Z 'num_sm_multiplier': 32, 2026-02-21T12:22:33.2111439Z 'num_stages': 3, 2026-02-21T12:22:33.2111661Z 'num_warps': 8, 2026-02-21T12:22:33.2111921Z 'pid_type': 'persistent_blocked', 2026-02-21T12:22:33.2112240Z 'range_flattens': [None, False], 2026-02-21T12:22:33.2112537Z 'range_multi_buffers': [True, None], 2026-02-21T12:22:33.2112847Z 'range_num_stages': [4, 3], 2026-02-21T12:22:33.2113133Z 'range_unroll_factors': [4, 3], 2026-02-21T12:22:33.2113438Z 'range_warp_specializes': [], 2026-02-21T12:22:33.2113724Z 'waves_per_eu': 2} 2026-02-21T12:22:33.2142528Z [2495s] Fitting surrogate: 1106 points, 1106 targets 2026-02-21T12:22:33.6610090Z [2496s] Generation 19 starting: 34 neighbors, 2 active search path(s) 2026-02-21T12:23:07.4084416Z [2529s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[4, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:23:08.3940116Z [2530s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[4, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:23:09.2113004Z [2531s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[4, 3], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:23:09.2137638Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 0.9 configs/s 2026-02-21T12:23:23.9905397Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 2.3 configs/s 2026-02-21T12:23:24.2821016Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:23:26.8389744Z [2549s] Generation 19 complete: 2026-02-21T12:23:26.8390248Z error=5 2026-02-21T12:23:26.8390463Z timeout=3 2026-02-21T12:23:26.8390668Z ok=28 2026-02-21T12:23:26.8390866Z min=20.8872 2026-02-21T12:23:26.8391081Z mid=37.1607 2026-02-21T12:23:26.8391280Z max=191.2476 2026-02-21T12:23:26.8391542Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:23:26.8391947Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T12:23:26.8392318Z 'l2_groupings': [8], 2026-02-21T12:23:26.8392580Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:23:26.8392893Z 'loop_orders': [[0, 1]], 2026-02-21T12:23:26.8393151Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:23:26.8393416Z 'num_sm_multiplier': 32, 2026-02-21T12:23:26.8393664Z 'num_stages': 3, 2026-02-21T12:23:26.8393877Z 'num_warps': 8, 2026-02-21T12:23:26.8394122Z 'pid_type': 'persistent_blocked', 2026-02-21T12:23:26.8394413Z 'range_flattens': [None, False], 2026-02-21T12:23:26.8394706Z 'range_multi_buffers': [True, None], 2026-02-21T12:23:26.8394994Z 'range_num_stages': [3, 3], 2026-02-21T12:23:26.8395256Z 'range_unroll_factors': [3, 3], 2026-02-21T12:23:26.8396167Z 'range_warp_specializes': [], 2026-02-21T12:23:26.8396425Z 'waves_per_eu': 2} 2026-02-21T12:23:26.8427409Z [2549s] Fitting surrogate: 1142 points, 1142 targets 2026-02-21T12:23:27.2989949Z [2549s] Generation 20 starting: 36 neighbors, 2 active search path(s) 2026-02-21T12:24:00.2943138Z [2582s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:24:04.3360787Z [2586s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:24:05.8988677Z [2588s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:24:05.9003724Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 0.5 configs/s 2026-02-21T12:24:26.3858938Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 1.8 configs/s 2026-02-21T12:24:26.7931238Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T12:24:30.4027808Z [2612s] Generation 20 complete: 2026-02-21T12:24:30.4028274Z error=5 2026-02-21T12:24:30.4028498Z timeout=3 2026-02-21T12:24:30.4028695Z ok=30 2026-02-21T12:24:30.4028931Z min=20.9518 2026-02-21T12:24:30.4029135Z mid=22.1884 2026-02-21T12:24:30.4029443Z max=433.0034 2026-02-21T12:24:30.4029807Z best={'block_sizes': [1, 256, 64], 2026-02-21T12:24:30.4030433Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T12:24:30.4031005Z 'l2_groupings': [8], 2026-02-21T12:24:30.4031288Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:24:30.4031616Z 'loop_orders': [[0, 1]], 2026-02-21T12:24:30.4032038Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:24:30.4032492Z 'num_sm_multiplier': 32, 2026-02-21T12:24:30.4032919Z 'num_stages': 3, 2026-02-21T12:24:30.4033267Z 'num_warps': 8, 2026-02-21T12:24:30.4033582Z 'pid_type': 'persistent_blocked', 2026-02-21T12:24:30.4033897Z 'range_flattens': [False, False], 2026-02-21T12:24:30.4034207Z 'range_multi_buffers': [True, None], 2026-02-21T12:24:30.4034441Z 'range_num_stages': [3, 3], 2026-02-21T12:24:30.4034648Z 'range_unroll_factors': [3, 3], 2026-02-21T12:24:30.4034860Z 'range_warp_specializes': [], 2026-02-21T12:24:30.4035061Z 'waves_per_eu': 2} 2026-02-21T12:24:30.4066578Z [2612s] Fitting surrogate: 1180 points, 1180 targets 2026-02-21T12:24:30.5462549Z [2613s] Autotuning complete in 2613.1s after searching 1074 configs. 2026-02-21T12:24:30.5462791Z One can hardcode the best config and skip autotuning with: 2026-02-21T12:24:30.5463593Z @helion.kernel(config=helion.Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[3, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T12:24:30.5464712Z 2026-02-21T12:24:30.5464887Z [2613s] Code of selected kernel: /tmp/torchinductor_root/dw/cdwlh63qytjqmhnny24trv2tdjay2tlbktwxnqcdldr37fjxfebq.py 2026-02-21T12:24:31.6746970Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T12:24:31.6750244Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[3, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T12:24:31.6754782Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T12:24:31.6755816Z WARNING:tritonbench.utils.triton_op:Completed input ID 6: 2026-02-21T12:24:31.6756269Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T12:24:31.6756619Z ------------------------------------------ 2026-02-21T12:24:31.6757064Z (4, 48, 8192, 8192, 128) 2026-02-21T12:24:31.6757240Z 2026-02-21T12:24:31.6757802Z 100%|██████████| 6/6 [1:40:05<00:00, 1492.33s/it] 2026-02-21T12:24:31.6758232Z 100%|██████████| 6/6 [1:40:05<00:00, 1000.88s/it] 2026-02-21T12:24:31.6758675Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmpli665d5a.csv 2026-02-21T12:24:34.9230162Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) flex_attention-speedup flex_attention-accuracy helion_attention-speedup helion_attention-accuracy 2026-02-21T12:24:34.9231373Z ------------------------------------------ ------------------------ ------------------------- -------------------------- --------------------------- 2026-02-21T12:24:34.9233081Z (4, 48, 128, 128, 128) 2.027 0 3.37355 0 2026-02-21T12:24:34.9233786Z (4, 48, 256, 256, 128) 2.18314 0 3.25082 0 2026-02-21T12:24:34.9234363Z (4, 48, 512, 512, 128) 2.73606 1 3.31177 0 2026-02-21T12:24:34.9234955Z (4, 48, 2048, 2048, 128) 3.51616 1 5.07911 0 2026-02-21T12:24:34.9235550Z (4, 48, 4096, 4096, 128) 3.65348 1 4.96449 0 2026-02-21T12:24:34.9236140Z (4, 48, 8192, 8192, 128) 3.89044 1 5.52917 0 2026-02-21T12:24:34.9236737Z average 3.00105 0.666667 4.25149 0 2026-02-21T12:26:53.4609172Z Applying custom args for flash_attention: {'d_head': 128, 'num_inputs': 6} 2026-02-21T12:26:53.4741760Z INFO:root:TMA benchmarks will be running without grid constant TMA descriptor. 2026-02-21T12:26:53.4763150Z TMA benchmarks will be running without grid constant TMA descriptor. 2026-02-21T12:26:53.4875354Z Running flash_attention benchmark with Helion implementation... 2026-02-21T12:26:53.4875565Z 2026-02-21T12:26:53.6321606Z Equally-spaced-k mode: Selected 6 equally spaced inputs (total available: 7) 2026-02-21T12:26:53.6322306Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 1, 2, 4, 5, 6] 2026-02-21T12:26:53.6328527Z 2026-02-21T12:26:53.6334149Z 0%| | 0/6 [00:00>) : (tensor<1x128x16xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 4], order = [1, 0, 2]}>>) -> tensor<128x16xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 4], order = [0, 1]}>> 2026-02-21T12:32:02.9948624Z 2026-02-21T12:32:02.9949675Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T12:32:02.9950956Z ^ 2026-02-21T12:32:02.9951473Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T12:32:02.9952231Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 4, 16], warpsPerCTA = [1, 4, 1], order = [2, 1, 0]}> 2026-02-21T12:32:02.9953030Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:32:02.9953767Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}> 2026-02-21T12:32:02.9954351Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}> 2026-02-21T12:32:02.9954949Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [2, 1, 0]}> 2026-02-21T12:32:02.9955536Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:32:02.9956122Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:32:02.9956717Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:32:02.9957298Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T12:32:02.9957864Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T12:32:02.9958411Z #blocked10 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [4], order = [0]}> 2026-02-21T12:32:02.9958957Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [0, 1]}> 2026-02-21T12:32:02.9959551Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T12:32:02.9960139Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T12:32:02.9960716Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [0, 1, 2]}> 2026-02-21T12:32:02.9961309Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [0, 1, 2]}> 2026-02-21T12:32:02.9961900Z #blocked16 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T12:32:02.9962707Z #blocked17 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T12:32:02.9963349Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T12:32:02.9964112Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T12:32:02.9964699Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T12:32:02.9964877Z %c192_i64 = arith.constant 192 : i64 2026-02-21T12:32:02.9965042Z %c0_i64 = arith.constant 0 : i64 2026-02-21T12:32:02.9965213Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T12:32:02.9965439Z %cst = arith.constant dense<0.000000e+00> : tensor<1x128x16xbf16, #blocked> 2026-02-21T12:32:02.9965723Z %cst_0 = arith.constant dense<256> : tensor<1x1x16xi64, #blocked1> 2026-02-21T12:32:02.9965970Z %cst_1 = arith.constant dense<0> : tensor<1x1x16xi64, #blocked1> 2026-02-21T12:32:02.9966249Z %cst_2 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2> 2026-02-21T12:32:02.9966502Z %cst_3 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2> 2026-02-21T12:32:02.9966744Z %cst_4 = arith.constant dense<128> : tensor<1x1x16xi64, #blocked1> 2026-02-21T12:32:02.9967012Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked3> 2026-02-21T12:32:02.9967280Z %cst_6 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked4> 2026-02-21T12:32:02.9967527Z %cst_7 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked4> 2026-02-21T12:32:02.9967792Z %cst_8 = arith.constant dense<256> : tensor<1x2x1xi64, #blocked5> 2026-02-21T12:32:02.9968039Z %cst_9 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked5> 2026-02-21T12:32:02.9968285Z %cst_10 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked5> 2026-02-21T12:32:02.9968491Z %c16_i32 = arith.constant 16 : i32 2026-02-21T12:32:02.9968658Z %c256_i32 = arith.constant 256 : i32 2026-02-21T12:32:02.9968859Z %cst_11 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked5> 2026-02-21T12:32:02.9969108Z %cst_12 = arith.constant dense<128> : tensor<1x16x1xi32, #blocked6> 2026-02-21T12:32:02.9969367Z %cst_13 = arith.constant dense<0.127517432> : tensor<1x2x16xf32, #blocked7> 2026-02-21T12:32:02.9969646Z %cst_14 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked8> 2026-02-21T12:32:02.9969925Z %cst_15 = arith.constant dense<0.000000e+00> : tensor<2x16xf32, #blocked9> 2026-02-21T12:32:02.9970147Z %c0_i32 = arith.constant 0 : i32 2026-02-21T12:32:02.9970366Z %cst_16 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked3> 2026-02-21T12:32:02.9970639Z %cst_17 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked8> 2026-02-21T12:32:02.9970911Z %cst_18 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked8> 2026-02-21T12:32:02.9971124Z %c2_i32 = arith.constant 2 : i32 2026-02-21T12:32:02.9971290Z %0 = tt.get_program_id x : i32 2026-02-21T12:32:02.9971448Z %1 = tt.get_program_id y : i32 2026-02-21T12:32:02.9971600Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T12:32:02.9971821Z %3 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked10> 2026-02-21T12:32:02.9972071Z %4 = tt.splat %2 : i32 -> tensor<2xi32, #blocked10> 2026-02-21T12:32:02.9972273Z %5 = arith.addi %4, %3 : tensor<2xi32, #blocked10> 2026-02-21T12:32:02.9972510Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked10> 2026-02-21T12:32:02.9972744Z %7 = arith.extsi %0 : i32 to i64 2026-02-21T12:32:02.9972902Z %8 = arith.extsi %2 : i32 to i64 2026-02-21T12:32:02.9973115Z %9 = tt.splat %arg0 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked3> 2026-02-21T12:32:02.9973373Z %10 = arith.muli %7, %c32768_i64 : i64 2026-02-21T12:32:02.9973565Z %11 = tt.splat %10 : i64 -> tensor<1x2x128xi64, #blocked3> 2026-02-21T12:32:02.9973784Z %12 = tt.splat %8 : i64 -> tensor<2xi64, #blocked10> 2026-02-21T12:32:02.9974019Z %13 = arith.extsi %3 : tensor<2xi32, #blocked10> to tensor<2xi64, #blocked10> 2026-02-21T12:32:02.9974269Z %14 = arith.addi %12, %13 : tensor<2xi64, #blocked10> 2026-02-21T12:32:02.9974555Z %15 = ttg.convert_layout %14 : tensor<2xi64, #blocked10> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked11}>> 2026-02-21T12:32:02.9974910Z %16 = tt.expand_dims %15 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x2xi64, #blocked11> 2026-02-21T12:32:02.9975220Z %17 = ttg.convert_layout %16 : tensor<1x2xi64, #blocked11> -> tensor<1x2xi64, #blocked8> 2026-02-21T12:32:02.9975522Z %18 = ttg.convert_layout %17 : tensor<1x2xi64, #blocked8> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked12}>> 2026-02-21T12:32:02.9975885Z %19 = tt.expand_dims %18 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x2x1xi64, #blocked12> 2026-02-21T12:32:02.9976226Z %20 = ttg.convert_layout %19 : tensor<1x2x1xi64, #blocked12> -> tensor<1x2x1xi64, #blocked5> 2026-02-21T12:32:02.9976452Z %21 = arith.muli %20, %cst_10 : tensor<1x2x1xi64, #blocked5> 2026-02-21T12:32:02.9976667Z %22 = tt.broadcast %21 : tensor<1x2x1xi64, #blocked5> -> tensor<1x2x128xi64, #blocked5> 2026-02-21T12:32:02.9976929Z %23 = ttg.convert_layout %22 : tensor<1x2x128xi64, #blocked5> -> tensor<1x2x128xi64, #blocked3> 2026-02-21T12:32:02.9977188Z %24 = arith.extsi %6 : tensor<128xi32, #blocked10> to tensor<128xi64, #blocked10> 2026-02-21T12:32:02.9977503Z %25 = ttg.convert_layout %24 : tensor<128xi64, #blocked10> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked11}>> 2026-02-21T12:32:02.9977859Z %26 = tt.expand_dims %25 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x128xi64, #blocked11> 2026-02-21T12:32:02.9978180Z %27 = ttg.convert_layout %26 : tensor<1x128xi64, #blocked11> -> tensor<1x128xi64, #blocked13> 2026-02-21T12:32:02.9978495Z %28 = ttg.convert_layout %27 : tensor<1x128xi64, #blocked13> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked14}>> 2026-02-21T12:32:02.9978872Z %29 = tt.expand_dims %28 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked14}>> -> tensor<1x1x128xi64, #blocked14> 2026-02-21T12:32:02.9979204Z %30 = ttg.convert_layout %29 : tensor<1x1x128xi64, #blocked14> -> tensor<1x1x128xi64, #blocked4> 2026-02-21T12:32:02.9979469Z %31 = tt.broadcast %30 : tensor<1x1x128xi64, #blocked4> -> tensor<1x2x128xi64, #blocked4> 2026-02-21T12:32:02.9979734Z %32 = ttg.convert_layout %31 : tensor<1x2x128xi64, #blocked4> -> tensor<1x2x128xi64, #blocked3> 2026-02-21T12:32:02.9979960Z %33 = arith.addi %23, %32 : tensor<1x2x128xi64, #blocked3> 2026-02-21T12:32:02.9980135Z %34 = arith.addi %11, %33 : tensor<1x2x128xi64, #blocked3> 2026-02-21T12:32:02.9980364Z %35 = tt.addptr %9, %34 : tensor<1x2x128x!tt.ptr, #blocked3>, tensor<1x2x128xi64, #blocked3> 2026-02-21T12:32:02.9980574Z %36 = arith.cmpi sge, %7, %c0_i64 : i64 2026-02-21T12:32:02.9980714Z %37 = arith.cmpi slt, %7, %c192_i64 : i64 2026-02-21T12:32:02.9980846Z %38 = arith.andi %36, %37 : i1 2026-02-21T12:32:02.9981003Z %39 = arith.cmpi sge, %20, %cst_9 : tensor<1x2x1xi64, #blocked5> 2026-02-21T12:32:02.9981190Z %40 = arith.cmpi slt, %20, %cst_8 : tensor<1x2x1xi64, #blocked5> 2026-02-21T12:32:02.9981370Z %41 = arith.andi %39, %40 : tensor<1x2x1xi1, #blocked5> 2026-02-21T12:32:02.9981540Z %42 = tt.splat %38 : i1 -> tensor<1x2x1xi1, #blocked5> 2026-02-21T12:32:02.9981702Z %43 = arith.andi %42, %41 : tensor<1x2x1xi1, #blocked5> 2026-02-21T12:32:02.9981907Z %44 = tt.broadcast %43 : tensor<1x2x1xi1, #blocked5> -> tensor<1x2x128xi1, #blocked5> 2026-02-21T12:32:02.9982205Z %45 = ttg.convert_layout %44 : tensor<1x2x128xi1, #blocked5> -> tensor<1x2x128xi1, #blocked3> 2026-02-21T12:32:02.9982440Z %46 = arith.cmpi sge, %30, %cst_7 : tensor<1x1x128xi64, #blocked4> 2026-02-21T12:32:02.9982633Z %47 = arith.cmpi slt, %30, %cst_6 : tensor<1x1x128xi64, #blocked4> 2026-02-21T12:32:02.9982811Z %48 = arith.andi %46, %47 : tensor<1x1x128xi1, #blocked4> 2026-02-21T12:32:02.9983023Z %49 = tt.broadcast %48 : tensor<1x1x128xi1, #blocked4> -> tensor<1x2x128xi1, #blocked4> 2026-02-21T12:32:02.9983280Z %50 = ttg.convert_layout %49 : tensor<1x2x128xi1, #blocked4> -> tensor<1x2x128xi1, #blocked3> 2026-02-21T12:32:02.9983487Z %51 = arith.andi %45, %50 : tensor<1x2x128xi1, #blocked3> 2026-02-21T12:32:02.9983654Z %52 = tt.load %35, %51, %cst_5 : tensor<1x2x128x!tt.ptr, #blocked3> 2026-02-21T12:32:02.9983855Z %53 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked10> 2026-02-21T12:32:02.9984072Z %54 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x16x!tt.ptr, #blocked> 2026-02-21T12:32:02.9984260Z %55 = tt.splat %10 : i64 -> tensor<1x128x16xi64, #blocked> 2026-02-21T12:32:02.9984527Z %56 = ttg.convert_layout %27 : tensor<1x128xi64, #blocked13> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked15}>> 2026-02-21T12:32:02.9984870Z %57 = tt.expand_dims %56 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x128x1xi64, #blocked15> 2026-02-21T12:32:02.9985179Z %58 = ttg.convert_layout %57 : tensor<1x128x1xi64, #blocked15> -> tensor<1x128x1xi64, #blocked2> 2026-02-21T12:32:02.9985430Z %59 = tt.broadcast %58 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x16xi64, #blocked2> 2026-02-21T12:32:02.9985674Z %60 = ttg.convert_layout %59 : tensor<1x128x16xi64, #blocked2> -> tensor<1x128x16xi64, #blocked> 2026-02-21T12:32:02.9985925Z %61 = arith.extsi %53 : tensor<16xi32, #blocked10> to tensor<16xi64, #blocked10> 2026-02-21T12:32:02.9986120Z %62 = arith.cmpi sge, %58, %cst_3 : tensor<1x128x1xi64, #blocked2> 2026-02-21T12:32:02.9986300Z %63 = arith.cmpi slt, %58, %cst_2 : tensor<1x128x1xi64, #blocked2> 2026-02-21T12:32:02.9986472Z %64 = arith.andi %62, %63 : tensor<1x128x1xi1, #blocked2> 2026-02-21T12:32:02.9986626Z %65 = tt.splat %38 : i1 -> tensor<1x128x1xi1, #blocked2> 2026-02-21T12:32:02.9986784Z %66 = arith.andi %65, %64 : tensor<1x128x1xi1, #blocked2> 2026-02-21T12:32:02.9986973Z %67 = tt.broadcast %66 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x16xi1, #blocked2> 2026-02-21T12:32:02.9987220Z %68 = ttg.convert_layout %67 : tensor<1x128x16xi1, #blocked2> -> tensor<1x128x16xi1, #blocked> 2026-02-21T12:32:02.9987463Z %69 = tt.reshape %52 : tensor<1x2x128xbf16, #blocked3> -> tensor<2x128xbf16, #blocked13> 2026-02-21T12:32:02.9987709Z %70 = arith.muli %0, %c32768_i32 : i32 2026-02-21T12:32:02.9987859Z %71 = tt.splat %70 : i32 -> tensor<1x16x1xi32, #blocked6> 2026-02-21T12:32:02.9988097Z %72 = ttg.convert_layout %6 : tensor<128xi32, #blocked10> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked11}>> 2026-02-21T12:32:02.9988435Z %73 = tt.expand_dims %72 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x128xi32, #blocked11> 2026-02-21T12:32:02.9988734Z %74 = ttg.convert_layout %73 : tensor<1x128xi32, #blocked11> -> tensor<1x128xi32, #blocked13> 2026-02-21T12:32:02.9989024Z %75 = ttg.convert_layout %74 : tensor<1x128xi32, #blocked13> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked14}>> 2026-02-21T12:32:02.9989371Z %76 = tt.expand_dims %75 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked14}>> -> tensor<1x1x128xi32, #blocked14> 2026-02-21T12:32:02.9989674Z %77 = ttg.convert_layout %76 : tensor<1x1x128xi32, #blocked14> -> tensor<1x1x128xi32, #blocked4> 2026-02-21T12:32:02.9989928Z %78 = tt.broadcast %77 : tensor<1x1x128xi32, #blocked4> -> tensor<1x16x128xi32, #blocked4> 2026-02-21T12:32:02.9998448Z %79 = ttg.convert_layout %78 : tensor<1x16x128xi32, #blocked4> -> tensor<1x16x128xi32, #blocked3> 2026-02-21T12:32:02.9998745Z %80 = tt.splat %arg2 : !tt.ptr -> tensor<1x16x128x!tt.ptr, #blocked3> 2026-02-21T12:32:02.9999148Z %81:3 = scf.for %arg4 = %c0_i32 to %c256_i32 step %c16_i32 iter_args(%arg5 = %cst_18, %arg6 = %cst_17, %arg7 = %cst_16) -> (tensor<1x2xf32, #blocked8>, tensor<1x2xf32, #blocked8>, tensor<1x2x128xf32, #blocked3>) : i32 { 2026-02-21T12:32:02.9999515Z %112 = tt.splat %arg4 : i32 -> tensor<16xi32, #blocked10> 2026-02-21T12:32:02.9999705Z %113 = arith.addi %112, %53 : tensor<16xi32, #blocked10> 2026-02-21T12:32:02.9999857Z %114 = arith.extsi %arg4 : i32 to i64 2026-02-21T12:32:02.9999998Z %115 = tt.splat %114 : i64 -> tensor<16xi64, #blocked10> 2026-02-21T12:32:03.0000159Z %116 = arith.addi %115, %61 : tensor<16xi64, #blocked10> 2026-02-21T12:32:03.0000410Z %117 = ttg.convert_layout %116 : tensor<16xi64, #blocked10> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked11}>> 2026-02-21T12:32:03.0000762Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x16xi64, #blocked11> 2026-02-21T12:32:03.0001087Z %119 = ttg.convert_layout %118 : tensor<1x16xi64, #blocked11> -> tensor<1x16xi64, #blocked9> 2026-02-21T12:32:03.0001382Z %120 = ttg.convert_layout %119 : tensor<1x16xi64, #blocked9> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked16}>> 2026-02-21T12:32:03.0001738Z %121 = tt.expand_dims %120 {axis = 1 : i32} : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked16}>> -> tensor<1x1x16xi64, #blocked16> 2026-02-21T12:32:03.0002049Z %122 = ttg.convert_layout %121 : tensor<1x1x16xi64, #blocked16> -> tensor<1x1x16xi64, #blocked1> 2026-02-21T12:32:03.0002271Z %123 = arith.muli %122, %cst_4 : tensor<1x1x16xi64, #blocked1> 2026-02-21T12:32:03.0002505Z %124 = tt.broadcast %123 : tensor<1x1x16xi64, #blocked1> -> tensor<1x128x16xi64, #blocked1> 2026-02-21T12:32:03.0002813Z %125 = ttg.convert_layout %124 : tensor<1x128x16xi64, #blocked1> -> tensor<1x128x16xi64, #blocked> 2026-02-21T12:32:03.0003036Z %126 = arith.addi %60, %125 : tensor<1x128x16xi64, #blocked> 2026-02-21T12:32:03.0003202Z %127 = arith.addi %55, %126 : tensor<1x128x16xi64, #blocked> 2026-02-21T12:32:03.0003423Z %128 = tt.addptr %54, %127 : tensor<1x128x16x!tt.ptr, #blocked>, tensor<1x128x16xi64, #blocked> 2026-02-21T12:32:03.0003681Z %129 = arith.cmpi sge, %122, %cst_1 : tensor<1x1x16xi64, #blocked1> 2026-02-21T12:32:03.0003869Z %130 = arith.cmpi slt, %122, %cst_0 : tensor<1x1x16xi64, #blocked1> 2026-02-21T12:32:03.0004045Z %131 = arith.andi %129, %130 : tensor<1x1x16xi1, #blocked1> 2026-02-21T12:32:03.0004253Z %132 = tt.broadcast %131 : tensor<1x1x16xi1, #blocked1> -> tensor<1x128x16xi1, #blocked1> 2026-02-21T12:32:03.0004508Z %133 = ttg.convert_layout %132 : tensor<1x128x16xi1, #blocked1> -> tensor<1x128x16xi1, #blocked> 2026-02-21T12:32:03.0004722Z %134 = arith.andi %68, %133 : tensor<1x128x16xi1, #blocked> 2026-02-21T12:32:03.0004903Z %135 = tt.load %128, %134, %cst : tensor<1x128x16x!tt.ptr, #blocked> 2026-02-21T12:32:03.0005125Z %136 = tt.reshape %135 : tensor<1x128x16xbf16, #blocked> -> tensor<128x16xbf16, #blocked9> 2026-02-21T12:32:03.0005433Z %137 = ttg.convert_layout %69 : tensor<2x128xbf16, #blocked13> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked9}>> 2026-02-21T12:32:03.0005794Z %138 = ttg.convert_layout %136 : tensor<128x16xbf16, #blocked9> -> tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked9}>> 2026-02-21T12:32:03.0006104Z %139 = ttg.convert_layout %cst_15 : tensor<2x16xf32, #blocked9> -> tensor<2x16xf32, #blocked9> 2026-02-21T12:32:03.0006526Z %140 = tt.dot %137, %138, %139, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked9}>> * tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked9}>> -> tensor<2x16xf32, #blocked9> 2026-02-21T12:32:03.0006950Z %141 = tt.reshape %140 : tensor<2x16xf32, #blocked9> -> tensor<1x2x16xf32, #blocked7> 2026-02-21T12:32:03.0007197Z %142 = arith.truncf %141 : tensor<1x2x16xf32, #blocked7> to tensor<1x2x16xbf16, #blocked7> 2026-02-21T12:32:03.0007440Z %143 = arith.extf %142 : tensor<1x2x16xbf16, #blocked7> to tensor<1x2x16xf32, #blocked7> 2026-02-21T12:32:03.0007631Z %144 = "tt.reduce"(%143) <{axis = 2 : i32}> ({ 2026-02-21T12:32:03.0007787Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T12:32:03.0007910Z %199 = arith.maxnumf %arg8, %arg9 : f32 2026-02-21T12:32:03.0008039Z tt.reduce.return %199 : f32 2026-02-21T12:32:03.0008228Z }) : (tensor<1x2x16xf32, #blocked7>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:32:03.0008527Z %145 = ttg.convert_layout %144 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2xf32, #blocked8> 2026-02-21T12:32:03.0008807Z %146 = arith.truncf %145 : tensor<1x2xf32, #blocked8> to tensor<1x2xbf16, #blocked8> 2026-02-21T12:32:03.0009034Z %147 = arith.extf %146 : tensor<1x2xbf16, #blocked8> to tensor<1x2xf32, #blocked8> 2026-02-21T12:32:03.0009251Z %148 = arith.mulf %147, %cst_14 : tensor<1x2xf32, #blocked8> 2026-02-21T12:32:03.0009445Z %149 = arith.truncf %148 : tensor<1x2xf32, #blocked8> to tensor<1x2xbf16, #blocked8> 2026-02-21T12:32:03.0009671Z %150 = arith.extf %149 : tensor<1x2xbf16, #blocked8> to tensor<1x2xf32, #blocked8> 2026-02-21T12:32:03.0009877Z %151 = arith.cmpf ogt, %arg5, %150 : tensor<1x2xf32, #blocked8> 2026-02-21T12:32:03.0010051Z %152 = arith.cmpf une, %arg5, %arg5 : tensor<1x2xf32, #blocked8> 2026-02-21T12:32:03.0010222Z %153 = arith.ori %151, %152 : tensor<1x2xi1, #blocked8> 2026-02-21T12:32:03.0010438Z %154 = arith.select %153, %arg5, %150 : tensor<1x2xi1, #blocked8>, tensor<1x2xf32, #blocked8> 2026-02-21T12:32:03.0010653Z %155 = arith.mulf %143, %cst_13 : tensor<1x2x16xf32, #blocked7> 2026-02-21T12:32:03.0010863Z %156 = arith.truncf %155 : tensor<1x2x16xf32, #blocked7> to tensor<1x2x16xbf16, #blocked7> 2026-02-21T12:32:03.0011157Z %157 = ttg.convert_layout %154 : tensor<1x2xf32, #blocked8> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked12}>> 2026-02-21T12:32:03.0011504Z %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x2x1xf32, #blocked12> 2026-02-21T12:32:03.0011810Z %159 = ttg.convert_layout %158 : tensor<1x2x1xf32, #blocked12> -> tensor<1x2x1xf32, #blocked5> 2026-02-21T12:32:03.0012060Z %160 = arith.extf %156 : tensor<1x2x16xbf16, #blocked7> to tensor<1x2x16xf32, #blocked7> 2026-02-21T12:32:03.0012300Z %161 = tt.broadcast %159 : tensor<1x2x1xf32, #blocked5> -> tensor<1x2x16xf32, #blocked5> 2026-02-21T12:32:03.0012546Z %162 = ttg.convert_layout %161 : tensor<1x2x16xf32, #blocked5> -> tensor<1x2x16xf32, #blocked7> 2026-02-21T12:32:03.0012762Z %163 = arith.subf %160, %162 : tensor<1x2x16xf32, #blocked7> 2026-02-21T12:32:03.0013069Z %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x16xf32, #blocked7>) -> tensor<1x2x16xf32, #blocked7> 2026-02-21T12:32:03.0013363Z %165 = "tt.reduce"(%164) <{axis = 2 : i32}> ({ 2026-02-21T12:32:03.0013497Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T12:32:03.0013617Z %199 = arith.addf %arg8, %arg9 : f32 2026-02-21T12:32:03.0013743Z tt.reduce.return %199 : f32 2026-02-21T12:32:03.0013929Z }) : (tensor<1x2x16xf32, #blocked7>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:32:03.0014224Z %166 = ttg.convert_layout %165 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2xf32, #blocked8> 2026-02-21T12:32:03.0014472Z %167 = arith.subf %arg5, %154 : tensor<1x2xf32, #blocked8> 2026-02-21T12:32:03.0014766Z %168 = tt.extern_elementwise %167 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked8>) -> tensor<1x2xf32, #blocked8> 2026-02-21T12:32:03.0015077Z %169 = arith.mulf %arg6, %168 : tensor<1x2xf32, #blocked8> 2026-02-21T12:32:03.0015240Z %170 = arith.addf %169, %166 : tensor<1x2xf32, #blocked8> 2026-02-21T12:32:03.0015492Z %171 = ttg.convert_layout %168 : tensor<1x2xf32, #blocked8> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked12}>> 2026-02-21T12:32:03.0015836Z %172 = tt.expand_dims %171 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x2x1xf32, #blocked12> 2026-02-21T12:32:03.0016157Z %173 = ttg.convert_layout %172 : tensor<1x2x1xf32, #blocked12> -> tensor<1x2x1xf32, #blocked5> 2026-02-21T12:32:03.0016410Z %174 = tt.broadcast %173 : tensor<1x2x1xf32, #blocked5> -> tensor<1x2x128xf32, #blocked5> 2026-02-21T12:32:03.0016661Z %175 = ttg.convert_layout %174 : tensor<1x2x128xf32, #blocked5> -> tensor<1x2x128xf32, #blocked3> 2026-02-21T12:32:03.0016883Z %176 = arith.mulf %arg7, %175 : tensor<1x2x128xf32, #blocked3> 2026-02-21T12:32:03.0017145Z %177 = ttg.convert_layout %113 : tensor<16xi32, #blocked10> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked11}>> 2026-02-21T12:32:03.0017483Z %178 = tt.expand_dims %177 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x16xi32, #blocked11> 2026-02-21T12:32:03.0017782Z %179 = ttg.convert_layout %178 : tensor<1x16xi32, #blocked11> -> tensor<1x16xi32, #blocked9> 2026-02-21T12:32:03.0018074Z %180 = ttg.convert_layout %179 : tensor<1x16xi32, #blocked9> -> tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked17}>> 2026-02-21T12:32:03.0018424Z %181 = tt.expand_dims %180 {axis = 2 : i32} : tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked17}>> -> tensor<1x16x1xi32, #blocked17> 2026-02-21T12:32:03.0018752Z %182 = ttg.convert_layout %181 : tensor<1x16x1xi32, #blocked17> -> tensor<1x16x1xi32, #blocked6> 2026-02-21T12:32:03.0018971Z %183 = arith.muli %182, %cst_12 : tensor<1x16x1xi32, #blocked6> 2026-02-21T12:32:03.0019142Z %184 = arith.addi %71, %183 : tensor<1x16x1xi32, #blocked6> 2026-02-21T12:32:03.0019344Z %185 = tt.broadcast %184 : tensor<1x16x1xi32, #blocked6> -> tensor<1x16x128xi32, #blocked6> 2026-02-21T12:32:03.0019603Z %186 = ttg.convert_layout %185 : tensor<1x16x128xi32, #blocked6> -> tensor<1x16x128xi32, #blocked3> 2026-02-21T12:32:03.0019821Z %187 = arith.addi %186, %79 : tensor<1x16x128xi32, #blocked3> 2026-02-21T12:32:03.0020048Z %188 = tt.addptr %80, %187 : tensor<1x16x128x!tt.ptr, #blocked3>, tensor<1x16x128xi32, #blocked3> 2026-02-21T12:32:03.0020274Z %189 = tt.load %188 : tensor<1x16x128x!tt.ptr, #blocked3> 2026-02-21T12:32:03.0020481Z %190 = arith.truncf %164 : tensor<1x2x16xf32, #blocked7> to tensor<1x2x16xbf16, #blocked7> 2026-02-21T12:32:03.0020729Z %191 = tt.reshape %176 : tensor<1x2x128xf32, #blocked3> -> tensor<2x128xf32, #blocked13> 2026-02-21T12:32:03.0020965Z %192 = tt.reshape %190 : tensor<1x2x16xbf16, #blocked7> -> tensor<2x16xbf16, #blocked9> 2026-02-21T12:32:03.0021217Z %193 = tt.reshape %189 : tensor<1x16x128xbf16, #blocked3> -> tensor<16x128xbf16, #blocked13> 2026-02-21T12:32:03.0021527Z %194 = ttg.convert_layout %192 : tensor<2x16xbf16, #blocked9> -> tensor<2x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked13}>> 2026-02-21T12:32:03.0021889Z %195 = ttg.convert_layout %193 : tensor<16x128xbf16, #blocked13> -> tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked13}>> 2026-02-21T12:32:03.0022205Z %196 = ttg.convert_layout %191 : tensor<2x128xf32, #blocked13> -> tensor<2x128xf32, #blocked13> 2026-02-21T12:32:03.0022623Z %197 = tt.dot %194, %195, %196, inputPrecision = tf32 : tensor<2x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked13}>> * tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked13}>> -> tensor<2x128xf32, #blocked13> 2026-02-21T12:32:03.0023028Z %198 = tt.reshape %197 : tensor<2x128xf32, #blocked13> -> tensor<1x2x128xf32, #blocked3> 2026-02-21T12:32:03.0023322Z scf.yield %154, %170, %198 : tensor<1x2xf32, #blocked8>, tensor<1x2xf32, #blocked8>, tensor<1x2x128xf32, #blocked3> 2026-02-21T12:32:03.0023539Z } {tt.num_stages = 4 : i32} 2026-02-21T12:32:03.0023759Z %82 = ttg.convert_layout %81#1 : tensor<1x2xf32, #blocked8> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked12}>> 2026-02-21T12:32:03.0024099Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x2x1xf32, #blocked12> 2026-02-21T12:32:03.0029531Z %84 = ttg.convert_layout %83 : tensor<1x2x1xf32, #blocked12> -> tensor<1x2x1xf32, #blocked5> 2026-02-21T12:32:03.0029777Z %85 = tt.broadcast %84 : tensor<1x2x1xf32, #blocked5> -> tensor<1x2x128xf32, #blocked5> 2026-02-21T12:32:03.0030022Z %86 = ttg.convert_layout %85 : tensor<1x2x128xf32, #blocked5> -> tensor<1x2x128xf32, #blocked3> 2026-02-21T12:32:03.0030241Z %87 = arith.divf %81#2, %86 : tensor<1x2x128xf32, #blocked3> 2026-02-21T12:32:03.0030446Z %88 = arith.truncf %87 : tensor<1x2x128xf32, #blocked3> to tensor<1x2x128xbf16, #blocked3> 2026-02-21T12:32:03.0030666Z %89 = arith.muli %0, %c32768_i32 : i32 2026-02-21T12:32:03.0030888Z %90 = ttg.convert_layout %5 : tensor<2xi32, #blocked10> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked11}>> 2026-02-21T12:32:03.0031208Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x2xi32, #blocked11> 2026-02-21T12:32:03.0031498Z %92 = ttg.convert_layout %91 : tensor<1x2xi32, #blocked11> -> tensor<1x2xi32, #blocked8> 2026-02-21T12:32:03.0031780Z %93 = ttg.convert_layout %92 : tensor<1x2xi32, #blocked8> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked12}>> 2026-02-21T12:32:03.0032130Z %94 = tt.expand_dims %93 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x2x1xi32, #blocked12> 2026-02-21T12:32:03.0032430Z %95 = ttg.convert_layout %94 : tensor<1x2x1xi32, #blocked12> -> tensor<1x2x1xi32, #blocked5> 2026-02-21T12:32:03.0032639Z %96 = arith.muli %95, %cst_11 : tensor<1x2x1xi32, #blocked5> 2026-02-21T12:32:03.0032805Z %97 = tt.splat %89 : i32 -> tensor<1x2x1xi32, #blocked5> 2026-02-21T12:32:03.0032959Z %98 = arith.addi %97, %96 : tensor<1x2x1xi32, #blocked5> 2026-02-21T12:32:03.0033202Z %99 = ttg.convert_layout %6 : tensor<128xi32, #blocked10> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked11}>> 2026-02-21T12:32:03.0033557Z %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x128xi32, #blocked11> 2026-02-21T12:32:03.0033862Z %101 = ttg.convert_layout %100 : tensor<1x128xi32, #blocked11> -> tensor<1x128xi32, #blocked13> 2026-02-21T12:32:03.0034171Z %102 = ttg.convert_layout %101 : tensor<1x128xi32, #blocked13> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked14}>> 2026-02-21T12:32:03.0034528Z %103 = tt.expand_dims %102 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked14}>> -> tensor<1x1x128xi32, #blocked14> 2026-02-21T12:32:03.0034844Z %104 = ttg.convert_layout %103 : tensor<1x1x128xi32, #blocked14> -> tensor<1x1x128xi32, #blocked4> 2026-02-21T12:32:03.0035097Z %105 = tt.broadcast %98 : tensor<1x2x1xi32, #blocked5> -> tensor<1x2x128xi32, #blocked5> 2026-02-21T12:32:03.0035344Z %106 = ttg.convert_layout %105 : tensor<1x2x128xi32, #blocked5> -> tensor<1x2x128xi32, #blocked3> 2026-02-21T12:32:03.0035601Z %107 = tt.broadcast %104 : tensor<1x1x128xi32, #blocked4> -> tensor<1x2x128xi32, #blocked4> 2026-02-21T12:32:03.0035855Z %108 = ttg.convert_layout %107 : tensor<1x2x128xi32, #blocked4> -> tensor<1x2x128xi32, #blocked3> 2026-02-21T12:32:03.0036072Z %109 = arith.addi %106, %108 : tensor<1x2x128xi32, #blocked3> 2026-02-21T12:32:03.0036268Z %110 = tt.splat %arg3 : !tt.ptr -> tensor<1x2x128x!tt.ptr, #blocked3> 2026-02-21T12:32:03.0036523Z %111 = tt.addptr %110, %109 : tensor<1x2x128x!tt.ptr, #blocked3>, tensor<1x2x128xi32, #blocked3> 2026-02-21T12:32:03.0036745Z tt.store %111, %88 : tensor<1x2x128x!tt.ptr, #blocked3> 2026-02-21T12:32:03.0036883Z tt.return 2026-02-21T12:32:03.0036969Z } 2026-02-21T12:32:03.0037053Z } 2026-02-21T12:32:03.0037099Z 2026-02-21T12:32:03.0037131Z {-# 2026-02-21T12:32:03.0037218Z external_resources: { 2026-02-21T12:32:03.0037321Z mlir_reproducer: { 2026-02-21T12:32:03.0039564Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T12:32:03.0041896Z disable_threading: false, 2026-02-21T12:32:03.0042009Z verify_each: true 2026-02-21T12:32:03.0042101Z } 2026-02-21T12:32:03.0042184Z } 2026-02-21T12:32:03.0042254Z #-} 2026-02-21T12:32:03.0042540Z /tmp/torchinductor_root/4e/c4eorrzxwkgp4flh757wxxie4yjblurji4dyrobpe2unl45lzbz2.py:17:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T12:32:03.0043297Z /tmp/torchinductor_root/4e/c4eorrzxwkgp4flh757wxxie4yjblurji4dyrobpe2unl45lzbz2.py:17:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T12:32:03.0043857Z [25s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T12:32:03.0044595Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 16], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=4, pid_type='xyz', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T12:32:03.0045264Z Error: RuntimeError: PassManager::run failed 2026-02-21T12:32:03.0045438Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T12:32:06.7100839Z /tmp/torchinductor_root/l7/cl7esyxzbgkcnqcndexrw6mheqxzagayrpobtydvrxos4vagr6b7.py:66:133: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T12:32:06.7102139Z k = tl.load(k_view + (indices_0[:, None, None] * 32768 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None) 2026-02-21T12:32:06.7102887Z ^ 2026-02-21T12:32:06.7104815Z /tmp/torchinductor_root/l7/cl7esyxzbgkcnqcndexrw6mheqxzagayrpobtydvrxos4vagr6b7.py:68:145: note: - use: %118 = "tt.reshape"(<>) : (tensor<1x128x64xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 4], order = [1, 0, 2]}>>) -> tensor<128x64xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 4], order = [0, 1]}>> 2026-02-21T12:32:06.7107361Z 2026-02-21T12:32:06.7108303Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T12:32:06.7109728Z ^ 2026-02-21T12:32:06.7110239Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T12:32:06.7110983Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}> 2026-02-21T12:32:06.7111789Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:32:06.7112459Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [2, 1, 0]}> 2026-02-21T12:32:06.7113020Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:32:06.7113577Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T12:32:06.7114126Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}> 2026-02-21T12:32:06.7114659Z #blocked6 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [4], order = [0]}> 2026-02-21T12:32:06.7115310Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [0, 1]}> 2026-02-21T12:32:06.7115837Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}> 2026-02-21T12:32:06.7116400Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [0, 1, 2]}> 2026-02-21T12:32:06.7116980Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [0, 1, 2]}> 2026-02-21T12:32:06.7117555Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}> 2026-02-21T12:32:06.7118129Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 1], order = [2, 1, 0]}> 2026-02-21T12:32:06.7118731Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T12:32:06.7119308Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T12:32:06.7119874Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:32:06.7120451Z #blocked16 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}> 2026-02-21T12:32:06.7121071Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T12:32:06.7121907Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T12:32:06.7122445Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T12:32:06.7122695Z %c256_i64 = arith.constant 256 : i64 2026-02-21T12:32:06.7122861Z %c192_i64 = arith.constant 192 : i64 2026-02-21T12:32:06.7123059Z %c0_i64 = arith.constant 0 : i64 2026-02-21T12:32:06.7123211Z %c128_i64 = arith.constant 128 : i64 2026-02-21T12:32:06.7123373Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T12:32:06.7123598Z %cst = arith.constant dense<0.000000e+00> : tensor<1x64x128xbf16, #blocked> 2026-02-21T12:32:06.7123868Z %cst_0 = arith.constant dense<256> : tensor<1x64x1xi64, #blocked1> 2026-02-21T12:32:06.7124108Z %cst_1 = arith.constant dense<0> : tensor<1x64x1xi64, #blocked1> 2026-02-21T12:32:06.7124376Z %cst_2 = arith.constant dense<128> : tensor<1x64x1xi64, #blocked1> 2026-02-21T12:32:06.7124637Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<1x1x128xbf16, #blocked2> 2026-02-21T12:32:06.7124898Z %cst_4 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked2> 2026-02-21T12:32:06.7125143Z %cst_5 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked2> 2026-02-21T12:32:06.7125344Z %c12288_i32 = arith.constant 12288 : i32 2026-02-21T12:32:06.7125509Z %c1_i32 = arith.constant 1 : i32 2026-02-21T12:32:06.7125718Z %cst_6 = arith.constant dense<0.127517432> : tensor<1x1x64xf32, #blocked3> 2026-02-21T12:32:06.7126007Z %cst_7 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7126269Z %cst_8 = arith.constant dense<0.000000e+00> : tensor<1x64xf32, #blocked5> 2026-02-21T12:32:06.7126519Z %cst_9 = arith.constant dense<128> : tensor<1x1x64xi32, #blocked3> 2026-02-21T12:32:06.7126721Z %c0_i32 = arith.constant 0 : i32 2026-02-21T12:32:06.7126929Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked2> 2026-02-21T12:32:06.7127200Z %cst_11 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7127460Z %cst_12 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7127700Z %c64_i32 = arith.constant 64 : i32 2026-02-21T12:32:06.7127858Z %c256_i32 = arith.constant 256 : i32 2026-02-21T12:32:06.7128016Z %c49152_i32 = arith.constant 49152 : i32 2026-02-21T12:32:06.7128181Z %c162_i32 = arith.constant 162 : i32 2026-02-21T12:32:06.7128339Z %0 = tt.get_program_id x : i32 2026-02-21T12:32:06.7128495Z %1 = arith.muli %0, %c162_i32 : i32 2026-02-21T12:32:06.7128646Z %2 = arith.addi %1, %c162_i32 : i32 2026-02-21T12:32:06.7128803Z %3 = arith.minsi %2, %c49152_i32 : i32 2026-02-21T12:32:06.7129023Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked6> 2026-02-21T12:32:06.7129307Z %5 = tt.splat %arg0 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked2> 2026-02-21T12:32:06.7129588Z %6 = arith.extsi %4 : tensor<128xi32, #blocked6> to tensor<128xi64, #blocked6> 2026-02-21T12:32:06.7129944Z %7 = ttg.convert_layout %6 : tensor<128xi64, #blocked6> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T12:32:06.7130389Z %8 = tt.expand_dims %7 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi64, #blocked7> 2026-02-21T12:32:06.7130775Z %9 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked7> -> tensor<1x128xi64, #blocked8> 2026-02-21T12:32:06.7131160Z %10 = ttg.convert_layout %9 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T12:32:06.7131619Z %11 = tt.expand_dims %10 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi64, #blocked9> 2026-02-21T12:32:06.7132026Z %12 = ttg.convert_layout %11 : tensor<1x1x128xi64, #blocked9> -> tensor<1x1x128xi64, #blocked2> 2026-02-21T12:32:06.7132323Z %13 = arith.cmpi sge, %12, %cst_5 : tensor<1x1x128xi64, #blocked2> 2026-02-21T12:32:06.7132558Z %14 = arith.cmpi slt, %12, %cst_4 : tensor<1x1x128xi64, #blocked2> 2026-02-21T12:32:06.7132754Z %15 = arith.andi %13, %14 : tensor<1x1x128xi1, #blocked2> 2026-02-21T12:32:06.7132955Z %16 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked6> 2026-02-21T12:32:06.7133254Z %17 = ttg.convert_layout %4 : tensor<128xi32, #blocked6> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T12:32:06.7133602Z %18 = tt.expand_dims %17 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7> 2026-02-21T12:32:06.7133913Z %19 = ttg.convert_layout %18 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked8> 2026-02-21T12:32:06.7134222Z %20 = ttg.convert_layout %19 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked10}>> 2026-02-21T12:32:06.7134608Z %21 = tt.expand_dims %20 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x128x1xi32, #blocked10> 2026-02-21T12:32:06.7134940Z %22 = ttg.convert_layout %21 : tensor<1x128x1xi32, #blocked10> -> tensor<1x128x1xi32, #blocked11> 2026-02-21T12:32:06.7135200Z %23 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x64x!tt.ptr, #blocked12> 2026-02-21T12:32:06.7135429Z %24 = tt.splat %arg2 : !tt.ptr -> tensor<1x64x128x!tt.ptr, #blocked> 2026-02-21T12:32:06.7135651Z %25 = arith.extsi %16 : tensor<64xi32, #blocked6> to tensor<64xi64, #blocked6> 2026-02-21T12:32:06.7135912Z %26 = tt.broadcast %12 : tensor<1x1x128xi64, #blocked2> -> tensor<1x64x128xi64, #blocked2> 2026-02-21T12:32:06.7136174Z %27 = ttg.convert_layout %26 : tensor<1x64x128xi64, #blocked2> -> tensor<1x64x128xi64, #blocked> 2026-02-21T12:32:06.7136443Z %28 = tt.broadcast %15 : tensor<1x1x128xi1, #blocked2> -> tensor<1x64x128xi1, #blocked2> 2026-02-21T12:32:06.7136702Z %29 = ttg.convert_layout %28 : tensor<1x64x128xi1, #blocked2> -> tensor<1x64x128xi1, #blocked> 2026-02-21T12:32:06.7136945Z %30 = tt.splat %arg3 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked2> 2026-02-21T12:32:06.7137137Z scf.for %arg4 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T12:32:06.7137306Z %31 = arith.divsi %arg4, %c12288_i32 : i32 2026-02-21T12:32:06.7137448Z %32 = arith.muli %31, %c64_i32 : i32 2026-02-21T12:32:06.7137579Z %33 = arith.subi %c256_i32, %32 : i32 2026-02-21T12:32:06.7137709Z %34 = arith.minsi %33, %c64_i32 : i32 2026-02-21T12:32:06.7137842Z %35 = arith.remsi %arg4, %c12288_i32 : i32 2026-02-21T12:32:06.7137976Z %36 = arith.remsi %35, %34 : i32 2026-02-21T12:32:06.7138103Z %37 = arith.addi %32, %36 : i32 2026-02-21T12:32:06.7138222Z %38 = arith.divsi %35, %34 : i32 2026-02-21T12:32:06.7138347Z %39 = arith.extsi %38 : i32 to i64 2026-02-21T12:32:06.7138472Z %40 = arith.extsi %37 : i32 to i64 2026-02-21T12:32:06.7138604Z %41 = arith.muli %39, %c32768_i64 : i64 2026-02-21T12:32:06.7138756Z %42 = tt.splat %41 : i64 -> tensor<1x1x128xi64, #blocked2> 2026-02-21T12:32:06.7138909Z %43 = arith.muli %40, %c128_i64 : i64 2026-02-21T12:32:06.7139062Z %44 = tt.splat %43 : i64 -> tensor<1x1x128xi64, #blocked2> 2026-02-21T12:32:06.7139228Z %45 = arith.addi %44, %12 : tensor<1x1x128xi64, #blocked2> 2026-02-21T12:32:06.7139400Z %46 = arith.addi %42, %45 : tensor<1x1x128xi64, #blocked2> 2026-02-21T12:32:06.7139618Z %47 = tt.addptr %5, %46 : tensor<1x1x128x!tt.ptr, #blocked2>, tensor<1x1x128xi64, #blocked2> 2026-02-21T12:32:06.7139829Z %48 = arith.cmpi sge, %39, %c0_i64 : i64 2026-02-21T12:32:06.7139962Z %49 = arith.cmpi slt, %39, %c192_i64 : i64 2026-02-21T12:32:06.7140094Z %50 = arith.andi %48, %49 : i1 2026-02-21T12:32:06.7140224Z %51 = arith.cmpi sge, %40, %c0_i64 : i64 2026-02-21T12:32:06.7140355Z %52 = arith.cmpi slt, %40, %c256_i64 : i64 2026-02-21T12:32:06.7140486Z %53 = arith.andi %51, %52 : i1 2026-02-21T12:32:06.7140602Z %54 = arith.andi %50, %53 : i1 2026-02-21T12:32:06.7140748Z %55 = tt.splat %54 : i1 -> tensor<1x1x128xi1, #blocked2> 2026-02-21T12:32:06.7140917Z %56 = arith.andi %55, %15 : tensor<1x1x128xi1, #blocked2> 2026-02-21T12:32:06.7141101Z %57 = tt.load %47, %56, %cst_3 : tensor<1x1x128x!tt.ptr, #blocked2> 2026-02-21T12:32:06.7141288Z %58 = arith.muli %38, %c32768_i32 : i32 2026-02-21T12:32:06.7141441Z %59 = tt.splat %58 : i32 -> tensor<1x128x1xi32, #blocked11> 2026-02-21T12:32:06.7141618Z %60 = arith.addi %59, %22 : tensor<1x128x1xi32, #blocked11> 2026-02-21T12:32:06.7141842Z %61 = tt.broadcast %60 : tensor<1x128x1xi32, #blocked11> -> tensor<1x128x64xi32, #blocked11> 2026-02-21T12:32:06.7142124Z %62 = ttg.convert_layout %61 : tensor<1x128x64xi32, #blocked11> -> tensor<1x128x64xi32, #blocked12> 2026-02-21T12:32:06.7142393Z %63 = tt.reshape %57 : tensor<1x1x128xbf16, #blocked2> -> tensor<1x128xbf16, #blocked8> 2026-02-21T12:32:06.7142597Z %64 = tt.splat %41 : i64 -> tensor<1x64x128xi64, #blocked> 2026-02-21T12:32:06.7142760Z %65 = tt.splat %50 : i1 -> tensor<1x64x1xi1, #blocked1> 2026-02-21T12:32:06.7143126Z %66:3 = scf.for %arg5 = %c0_i32 to %c256_i32 step %c64_i32 iter_args(%arg6 = %cst_12, %arg7 = %cst_11, %arg8 = %cst_10) -> (tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked2>) : i32 { 2026-02-21T12:32:06.7143492Z %75 = tt.splat %arg5 : i32 -> tensor<64xi32, #blocked6> 2026-02-21T12:32:06.7143696Z %76 = arith.addi %75, %16 : tensor<64xi32, #blocked6> 2026-02-21T12:32:06.7143938Z %77 = ttg.convert_layout %76 : tensor<64xi32, #blocked6> -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T12:32:06.7144279Z %78 = tt.expand_dims %77 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x64xi32, #blocked7> 2026-02-21T12:32:06.7144575Z %79 = ttg.convert_layout %78 : tensor<1x64xi32, #blocked7> -> tensor<1x64xi32, #blocked5> 2026-02-21T12:32:06.7144872Z %80 = ttg.convert_layout %79 : tensor<1x64xi32, #blocked5> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked13}>> 2026-02-21T12:32:06.7145238Z %81 = tt.expand_dims %80 {axis = 1 : i32} : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked13}>> -> tensor<1x1x64xi32, #blocked13> 2026-02-21T12:32:06.7145551Z %82 = ttg.convert_layout %81 : tensor<1x1x64xi32, #blocked13> -> tensor<1x1x64xi32, #blocked3> 2026-02-21T12:32:06.7145773Z %83 = arith.muli %82, %cst_9 : tensor<1x1x64xi32, #blocked3> 2026-02-21T12:32:06.7145979Z %84 = tt.broadcast %83 : tensor<1x1x64xi32, #blocked3> -> tensor<1x128x64xi32, #blocked3> 2026-02-21T12:32:06.7146239Z %85 = ttg.convert_layout %84 : tensor<1x128x64xi32, #blocked3> -> tensor<1x128x64xi32, #blocked12> 2026-02-21T12:32:06.7146464Z %86 = arith.addi %62, %85 : tensor<1x128x64xi32, #blocked12> 2026-02-21T12:32:06.7146690Z %87 = tt.addptr %23, %86 : tensor<1x128x64x!tt.ptr, #blocked12>, tensor<1x128x64xi32, #blocked12> 2026-02-21T12:32:06.7146919Z %88 = tt.load %87 : tensor<1x128x64x!tt.ptr, #blocked12> 2026-02-21T12:32:06.7147128Z %89 = tt.reshape %88 : tensor<1x128x64xbf16, #blocked12> -> tensor<128x64xbf16, #blocked5> 2026-02-21T12:32:06.7147437Z %90 = ttg.convert_layout %63 : tensor<1x128xbf16, #blocked8> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>> 2026-02-21T12:32:06.7147799Z %91 = ttg.convert_layout %89 : tensor<128x64xbf16, #blocked5> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>> 2026-02-21T12:32:06.7148109Z %92 = ttg.convert_layout %cst_8 : tensor<1x64xf32, #blocked5> -> tensor<1x64xf32, #blocked5> 2026-02-21T12:32:06.7148525Z %93 = tt.dot %90, %91, %92, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>> * tensor<128x64xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>> -> tensor<1x64xf32, #blocked5> 2026-02-21T12:32:06.7148920Z %94 = tt.reshape %93 : tensor<1x64xf32, #blocked5> -> tensor<1x1x64xf32, #blocked3> 2026-02-21T12:32:06.7149163Z %95 = arith.truncf %94 : tensor<1x1x64xf32, #blocked3> to tensor<1x1x64xbf16, #blocked3> 2026-02-21T12:32:06.7149404Z %96 = arith.extf %95 : tensor<1x1x64xbf16, #blocked3> to tensor<1x1x64xf32, #blocked3> 2026-02-21T12:32:06.7149610Z %97 = "tt.reduce"(%96) <{axis = 2 : i32}> ({ 2026-02-21T12:32:06.7149745Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:32:06.7149871Z %162 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:32:06.7150007Z tt.reduce.return %162 : f32 2026-02-21T12:32:06.7150200Z }) : (tensor<1x1x64xf32, #blocked3>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> 2026-02-21T12:32:06.7150499Z %98 = ttg.convert_layout %97 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> -> tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7150793Z %99 = arith.truncf %98 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4> 2026-02-21T12:32:06.7151017Z %100 = arith.extf %99 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7151218Z %101 = arith.mulf %100, %cst_7 : tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7151414Z %102 = arith.truncf %101 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4> 2026-02-21T12:32:06.7151643Z %103 = arith.extf %102 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7151864Z %104 = arith.cmpf ogt, %arg6, %103 : tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7152041Z %105 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7152217Z %106 = arith.ori %104, %105 : tensor<1x1xi1, #blocked4> 2026-02-21T12:32:06.7152419Z %107 = arith.select %106, %arg6, %103 : tensor<1x1xi1, #blocked4>, tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7152636Z %108 = arith.mulf %96, %cst_6 : tensor<1x1x64xf32, #blocked3> 2026-02-21T12:32:06.7152843Z %109 = arith.truncf %108 : tensor<1x1x64xf32, #blocked3> to tensor<1x1x64xbf16, #blocked3> 2026-02-21T12:32:06.7153174Z %110 = ttg.convert_layout %107 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T12:32:06.7153524Z %111 = tt.expand_dims %110 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T12:32:06.7153833Z %112 = ttg.convert_layout %111 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15> 2026-02-21T12:32:06.7154092Z %113 = arith.extf %109 : tensor<1x1x64xbf16, #blocked3> to tensor<1x1x64xf32, #blocked3> 2026-02-21T12:32:06.7154336Z %114 = tt.broadcast %112 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x64xf32, #blocked15> 2026-02-21T12:32:06.7154591Z %115 = ttg.convert_layout %114 : tensor<1x1x64xf32, #blocked15> -> tensor<1x1x64xf32, #blocked3> 2026-02-21T12:32:06.7154811Z %116 = arith.subf %113, %115 : tensor<1x1x64xf32, #blocked3> 2026-02-21T12:32:06.7155119Z %117 = tt.extern_elementwise %116 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x64xf32, #blocked3>) -> tensor<1x1x64xf32, #blocked3> 2026-02-21T12:32:06.7155416Z %118 = "tt.reduce"(%117) <{axis = 2 : i32}> ({ 2026-02-21T12:32:06.7155553Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:32:06.7155676Z %162 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:32:06.7155808Z tt.reduce.return %162 : f32 2026-02-21T12:32:06.7155996Z }) : (tensor<1x1x64xf32, #blocked3>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> 2026-02-21T12:32:06.7156296Z %119 = ttg.convert_layout %118 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> -> tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7156542Z %120 = arith.subf %arg6, %107 : tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7156838Z %121 = tt.extern_elementwise %120 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked4>) -> tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7157140Z %122 = arith.mulf %arg7, %121 : tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7157303Z %123 = arith.addf %122, %119 : tensor<1x1xf32, #blocked4> 2026-02-21T12:32:06.7157574Z %124 = ttg.convert_layout %121 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T12:32:06.7157919Z %125 = tt.expand_dims %124 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T12:32:06.7158229Z %126 = ttg.convert_layout %125 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15> 2026-02-21T12:32:06.7158502Z %127 = tt.broadcast %126 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x128xf32, #blocked15> 2026-02-21T12:32:06.7158788Z %128 = ttg.convert_layout %127 : tensor<1x1x128xf32, #blocked15> -> tensor<1x1x128xf32, #blocked2> 2026-02-21T12:32:06.7159007Z %129 = arith.mulf %arg8, %128 : tensor<1x1x128xf32, #blocked2> 2026-02-21T12:32:06.7159163Z %130 = arith.extsi %arg5 : i32 to i64 2026-02-21T12:32:06.7159309Z %131 = tt.splat %130 : i64 -> tensor<64xi64, #blocked6> 2026-02-21T12:32:06.7159472Z %132 = arith.addi %131, %25 : tensor<64xi64, #blocked6> 2026-02-21T12:32:06.7159714Z %133 = ttg.convert_layout %132 : tensor<64xi64, #blocked6> -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T12:32:06.7160070Z %134 = tt.expand_dims %133 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x64xi64, #blocked7> 2026-02-21T12:32:06.7160369Z %135 = ttg.convert_layout %134 : tensor<1x64xi64, #blocked7> -> tensor<1x64xi64, #blocked5> 2026-02-21T12:32:06.7160666Z %136 = ttg.convert_layout %135 : tensor<1x64xi64, #blocked5> -> tensor<1x64xi64, #ttg.slice<{dim = 2, parent = #blocked16}>> 2026-02-21T12:32:06.7161020Z %137 = tt.expand_dims %136 {axis = 2 : i32} : tensor<1x64xi64, #ttg.slice<{dim = 2, parent = #blocked16}>> -> tensor<1x64x1xi64, #blocked16> 2026-02-21T12:32:06.7161354Z %138 = ttg.convert_layout %137 : tensor<1x64x1xi64, #blocked16> -> tensor<1x64x1xi64, #blocked1> 2026-02-21T12:32:06.7161573Z %139 = arith.muli %138, %cst_2 : tensor<1x64x1xi64, #blocked1> 2026-02-21T12:32:06.7161784Z %140 = tt.broadcast %139 : tensor<1x64x1xi64, #blocked1> -> tensor<1x64x128xi64, #blocked1> 2026-02-21T12:32:06.7162045Z %141 = ttg.convert_layout %140 : tensor<1x64x128xi64, #blocked1> -> tensor<1x64x128xi64, #blocked> 2026-02-21T12:32:06.7162269Z %142 = arith.addi %141, %27 : tensor<1x64x128xi64, #blocked> 2026-02-21T12:32:06.7162437Z %143 = arith.addi %64, %142 : tensor<1x64x128xi64, #blocked> 2026-02-21T12:32:06.7162713Z %144 = tt.addptr %24, %143 : tensor<1x64x128x!tt.ptr, #blocked>, tensor<1x64x128xi64, #blocked> 2026-02-21T12:32:06.7162946Z %145 = arith.cmpi sge, %138, %cst_1 : tensor<1x64x1xi64, #blocked1> 2026-02-21T12:32:06.7163130Z %146 = arith.cmpi slt, %138, %cst_0 : tensor<1x64x1xi64, #blocked1> 2026-02-21T12:32:06.7163316Z %147 = arith.andi %145, %146 : tensor<1x64x1xi1, #blocked1> 2026-02-21T12:32:06.7163483Z %148 = arith.andi %65, %147 : tensor<1x64x1xi1, #blocked1> 2026-02-21T12:32:06.7163690Z %149 = tt.broadcast %148 : tensor<1x64x1xi1, #blocked1> -> tensor<1x64x128xi1, #blocked1> 2026-02-21T12:32:06.7163950Z %150 = ttg.convert_layout %149 : tensor<1x64x128xi1, #blocked1> -> tensor<1x64x128xi1, #blocked> 2026-02-21T12:32:06.7164164Z %151 = arith.andi %150, %29 : tensor<1x64x128xi1, #blocked> 2026-02-21T12:32:06.7164350Z %152 = tt.load %144, %151, %cst : tensor<1x64x128x!tt.ptr, #blocked> 2026-02-21T12:32:06.7164575Z %153 = arith.truncf %117 : tensor<1x1x64xf32, #blocked3> to tensor<1x1x64xbf16, #blocked3> 2026-02-21T12:32:06.7164821Z %154 = tt.reshape %129 : tensor<1x1x128xf32, #blocked2> -> tensor<1x128xf32, #blocked8> 2026-02-21T12:32:06.7165061Z %155 = tt.reshape %153 : tensor<1x1x64xbf16, #blocked3> -> tensor<1x64xbf16, #blocked5> 2026-02-21T12:32:06.7165301Z %156 = tt.reshape %152 : tensor<1x64x128xbf16, #blocked> -> tensor<64x128xbf16, #blocked8> 2026-02-21T12:32:06.7168895Z %157 = ttg.convert_layout %155 : tensor<1x64xbf16, #blocked5> -> tensor<1x64xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T12:32:06.7169253Z %158 = ttg.convert_layout %156 : tensor<64x128xbf16, #blocked8> -> tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T12:32:06.7169560Z %159 = ttg.convert_layout %154 : tensor<1x128xf32, #blocked8> -> tensor<1x128xf32, #blocked8> 2026-02-21T12:32:06.7170004Z %160 = tt.dot %157, %158, %159, inputPrecision = tf32 : tensor<1x64xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x128xf32, #blocked8> 2026-02-21T12:32:06.7170401Z %161 = tt.reshape %160 : tensor<1x128xf32, #blocked8> -> tensor<1x1x128xf32, #blocked2> 2026-02-21T12:32:06.7170686Z scf.yield %107, %123, %161 : tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked2> 2026-02-21T12:32:06.7170903Z } {tt.flatten} 2026-02-21T12:32:06.7171114Z %67 = ttg.convert_layout %66#1 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> 2026-02-21T12:32:06.7171479Z %68 = tt.expand_dims %67 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T12:32:06.7171777Z %69 = ttg.convert_layout %68 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15> 2026-02-21T12:32:06.7172031Z %70 = tt.broadcast %69 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x128xf32, #blocked15> 2026-02-21T12:32:06.7172279Z %71 = ttg.convert_layout %70 : tensor<1x1x128xf32, #blocked15> -> tensor<1x1x128xf32, #blocked2> 2026-02-21T12:32:06.7172501Z %72 = arith.divf %66#2, %71 : tensor<1x1x128xf32, #blocked2> 2026-02-21T12:32:06.7172733Z %73 = arith.truncf %72 : tensor<1x1x128xf32, #blocked2> to tensor<1x1x128xbf16, #blocked2> 2026-02-21T12:32:06.7172984Z %74 = tt.addptr %30, %46 : tensor<1x1x128x!tt.ptr, #blocked2>, tensor<1x1x128xi64, #blocked2> 2026-02-21T12:32:06.7173208Z tt.store %74, %73, %56 : tensor<1x1x128x!tt.ptr, #blocked2> 2026-02-21T12:32:06.7173352Z } 2026-02-21T12:32:06.7173439Z tt.return 2026-02-21T12:32:06.7173523Z } 2026-02-21T12:32:06.7173609Z } 2026-02-21T12:32:06.7173654Z 2026-02-21T12:32:06.7173690Z {-# 2026-02-21T12:32:06.7173776Z external_resources: { 2026-02-21T12:32:06.7173883Z mlir_reproducer: { 2026-02-21T12:32:06.7176127Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T12:32:06.7178432Z disable_threading: false, 2026-02-21T12:32:06.7178548Z verify_each: true 2026-02-21T12:32:06.7178642Z } 2026-02-21T12:32:06.7178739Z } 2026-02-21T12:32:06.7178812Z #-} 2026-02-21T12:32:06.7179095Z /tmp/torchinductor_root/l7/cl7esyxzbgkcnqcndexrw6mheqxzagayrpobtydvrxos4vagr6b7.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T12:32:06.7179820Z /tmp/torchinductor_root/l7/cl7esyxzbgkcnqcndexrw6mheqxzagayrpobtydvrxos4vagr6b7.py:19:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T12:32:06.7180404Z [29s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T12:32:06.7181203Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T12:32:06.7181948Z Error: RuntimeError: PassManager::run failed 2026-02-21T12:32:06.7182119Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T12:32:06.8255250Z /tmp/torchinductor_root/fe/cfedhhxnlm7fu6mbsljolujgdmvn2hjqtzdhyrtxbjqnqca3oatr.py:56:129: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T12:32:06.8256438Z k = tl.load(k_view + (indices_0[:, None, None] * 32768 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None) 2026-02-21T12:32:06.8257163Z ^ 2026-02-21T12:32:06.8259156Z /tmp/torchinductor_root/fe/cfedhhxnlm7fu6mbsljolujgdmvn2hjqtzdhyrtxbjqnqca3oatr.py:58:141: note: - use: %140 = "tt.reshape"(<>) : (tensor<1x128x32xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 1], order = [1, 0, 2]}>>) -> tensor<128x32xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [0, 1]}>> 2026-02-21T12:32:06.8260762Z 2026-02-21T12:32:06.8261703Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T12:32:06.8262947Z ^ 2026-02-21T12:32:06.8263388Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T12:32:06.8263700Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [8, 8, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:32:06.8264037Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:32:06.8264358Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 32, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:32:06.8264679Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:32:06.8264998Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T12:32:06.8265300Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T12:32:06.8265610Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 1, 32], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:32:06.8265909Z #blocked7 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T12:32:06.8266238Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T12:32:06.8266544Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [8, 8, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T12:32:06.8266861Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T12:32:06.8267175Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T12:32:06.8267517Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T12:32:06.8267843Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:32:06.8268172Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 1, 32], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T12:32:06.8268489Z #blocked15 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T12:32:06.8268827Z #blocked16 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 32, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}> 2026-02-21T12:32:06.8269137Z #blocked17 = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T12:32:06.8269472Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T12:32:06.8270029Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T12:32:06.8270438Z %c192_i64 = arith.constant 192 : i64 2026-02-21T12:32:06.8270565Z %c0_i64 = arith.constant 0 : i64 2026-02-21T12:32:06.8270690Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T12:32:06.8270814Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T12:32:06.8270971Z %cst = arith.constant dense<256> : tensor<1x8x1xi64, #blocked> 2026-02-21T12:32:06.8271150Z %cst_0 = arith.constant dense<0> : tensor<1x8x1xi64, #blocked> 2026-02-21T12:32:06.8271338Z %cst_1 = arith.constant dense<128> : tensor<1x8x1xi64, #blocked> 2026-02-21T12:32:06.8271535Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<1x32x128xbf16, #blocked1> 2026-02-21T12:32:06.8271740Z %cst_3 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked1> 2026-02-21T12:32:06.8271921Z %cst_4 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked1> 2026-02-21T12:32:06.8272102Z %cst_5 = arith.constant dense<256> : tensor<1x32x1xi64, #blocked2> 2026-02-21T12:32:06.8272281Z %cst_6 = arith.constant dense<0> : tensor<1x32x1xi64, #blocked2> 2026-02-21T12:32:06.8272456Z %cst_7 = arith.constant dense<128> : tensor<1x32x1xi64, #blocked2> 2026-02-21T12:32:06.8272607Z %c32_i32 = arith.constant 32 : i32 2026-02-21T12:32:06.8272720Z %c256_i32 = arith.constant 256 : i32 2026-02-21T12:32:06.8272841Z %c768_i32 = arith.constant 768 : i32 2026-02-21T12:32:06.8273001Z %cst_8 = arith.constant dense<0.127517432> : tensor<1x8x32xf32, #blocked3> 2026-02-21T12:32:06.8273197Z %cst_9 = arith.constant dense<0.127517432> : tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8273397Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<8x32xf32, #blocked5> 2026-02-21T12:32:06.8273588Z %cst_11 = arith.constant dense<128> : tensor<1x1x32xi32, #blocked6> 2026-02-21T12:32:06.8273739Z %c0_i32 = arith.constant 0 : i32 2026-02-21T12:32:06.8273881Z %cst_12 = arith.constant dense<128> : tensor<1x8x1xi32, #blocked> 2026-02-21T12:32:06.8274074Z %cst_13 = arith.constant dense<0.000000e+00> : tensor<1x8x128xf32, #blocked1> 2026-02-21T12:32:06.8274294Z %cst_14 = arith.constant dense<1.000000e+00> : tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8274493Z %cst_15 = arith.constant dense<0xFF800000> : tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8274663Z %c8_i32 = arith.constant 8 : i32 2026-02-21T12:32:06.8274782Z %c4_i32 = arith.constant 4 : i32 2026-02-21T12:32:06.8274898Z %0 = tt.get_program_id x : i32 2026-02-21T12:32:06.8275009Z %1 = arith.divsi %0, %c768_i32 : i32 2026-02-21T12:32:06.8275125Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T12:32:06.8275251Z %3 = arith.subi %c32_i32, %2 : i32 2026-02-21T12:32:06.8275364Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T12:32:06.8275476Z %5 = arith.remsi %0, %c768_i32 : i32 2026-02-21T12:32:06.8275587Z %6 = arith.remsi %5, %4 : i32 2026-02-21T12:32:06.8275698Z %7 = arith.addi %2, %6 : i32 2026-02-21T12:32:06.8275803Z %8 = arith.divsi %5, %4 : i32 2026-02-21T12:32:06.8275913Z %9 = arith.muli %7, %c8_i32 : i32 2026-02-21T12:32:06.8276067Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked7> 2026-02-21T12:32:06.8276250Z %11 = tt.splat %9 : i32 -> tensor<8xi32, #blocked7> 2026-02-21T12:32:06.8276416Z %12 = arith.addi %11, %10 : tensor<8xi32, #blocked7> 2026-02-21T12:32:06.8276590Z %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked7> 2026-02-21T12:32:06.8276763Z %14 = arith.muli %8, %c32768_i32 : i32 2026-02-21T12:32:06.8276981Z %15 = ttg.convert_layout %12 : tensor<8xi32, #blocked7> -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T12:32:06.8277305Z %16 = tt.expand_dims %15 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x8xi32, #blocked8> 2026-02-21T12:32:06.8277585Z %17 = ttg.convert_layout %16 : tensor<1x8xi32, #blocked8> -> tensor<1x8xi32, #blocked4> 2026-02-21T12:32:06.8277882Z %18 = ttg.convert_layout %17 : tensor<1x8xi32, #blocked4> -> tensor<1x8xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T12:32:06.8278211Z %19 = tt.expand_dims %18 {axis = 2 : i32} : tensor<1x8xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x8x1xi32, #blocked9> 2026-02-21T12:32:06.8278504Z %20 = ttg.convert_layout %19 : tensor<1x8x1xi32, #blocked9> -> tensor<1x8x1xi32, #blocked> 2026-02-21T12:32:06.8278711Z %21 = arith.muli %20, %cst_12 : tensor<1x8x1xi32, #blocked> 2026-02-21T12:32:06.8278869Z %22 = tt.splat %14 : i32 -> tensor<1x8x1xi32, #blocked> 2026-02-21T12:32:06.8279023Z %23 = arith.addi %22, %21 : tensor<1x8x1xi32, #blocked> 2026-02-21T12:32:06.8279255Z %24 = ttg.convert_layout %13 : tensor<128xi32, #blocked7> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T12:32:06.8279577Z %25 = tt.expand_dims %24 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi32, #blocked8> 2026-02-21T12:32:06.8279872Z %26 = ttg.convert_layout %25 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #blocked10> 2026-02-21T12:32:06.8280165Z %27 = ttg.convert_layout %26 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> 2026-02-21T12:32:06.8280511Z %28 = tt.expand_dims %27 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi32, #blocked11> 2026-02-21T12:32:06.8280816Z %29 = ttg.convert_layout %28 : tensor<1x1x128xi32, #blocked11> -> tensor<1x1x128xi32, #blocked1> 2026-02-21T12:32:06.8281059Z %30 = tt.broadcast %23 : tensor<1x8x1xi32, #blocked> -> tensor<1x8x128xi32, #blocked> 2026-02-21T12:32:06.8281300Z %31 = ttg.convert_layout %30 : tensor<1x8x128xi32, #blocked> -> tensor<1x8x128xi32, #blocked1> 2026-02-21T12:32:06.8281542Z %32 = tt.broadcast %29 : tensor<1x1x128xi32, #blocked1> -> tensor<1x8x128xi32, #blocked1> 2026-02-21T12:32:06.8281739Z %33 = arith.addi %31, %32 : tensor<1x8x128xi32, #blocked1> 2026-02-21T12:32:06.8281924Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<1x8x128x!tt.ptr, #blocked1> 2026-02-21T12:32:06.8282176Z %35 = tt.addptr %34, %33 : tensor<1x8x128x!tt.ptr, #blocked1>, tensor<1x8x128xi32, #blocked1> 2026-02-21T12:32:06.8282386Z %36 = tt.load %35 : tensor<1x8x128x!tt.ptr, #blocked1> 2026-02-21T12:32:06.8282624Z %37 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked7> 2026-02-21T12:32:06.8282985Z %38 = ttg.convert_layout %26 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked12}>> 2026-02-21T12:32:06.8283351Z %39 = tt.expand_dims %38 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x128x1xi32, #blocked12> 2026-02-21T12:32:06.8283658Z %40 = ttg.convert_layout %39 : tensor<1x128x1xi32, #blocked12> -> tensor<1x128x1xi32, #blocked13> 2026-02-21T12:32:06.8283867Z %41 = tt.splat %14 : i32 -> tensor<1x128x1xi32, #blocked13> 2026-02-21T12:32:06.8284027Z %42 = arith.addi %41, %40 : tensor<1x128x1xi32, #blocked13> 2026-02-21T12:32:06.8284229Z %43 = tt.broadcast %42 : tensor<1x128x1xi32, #blocked13> -> tensor<1x128x32xi32, #blocked13> 2026-02-21T12:32:06.8284482Z %44 = ttg.convert_layout %43 : tensor<1x128x32xi32, #blocked13> -> tensor<1x128x32xi32, #blocked3> 2026-02-21T12:32:06.8284739Z %45 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x32x!tt.ptr, #blocked3> 2026-02-21T12:32:06.8284959Z %46 = tt.reshape %36 : tensor<1x8x128xbf16, #blocked1> -> tensor<8x128xbf16, #blocked10> 2026-02-21T12:32:06.8285139Z %47 = arith.extsi %8 : i32 to i64 2026-02-21T12:32:06.8285299Z %48 = tt.splat %arg2 : !tt.ptr -> tensor<1x32x128x!tt.ptr, #blocked1> 2026-02-21T12:32:06.8285465Z %49 = arith.muli %47, %c32768_i64 : i64 2026-02-21T12:32:06.8285604Z %50 = tt.splat %49 : i64 -> tensor<1x32x128xi64, #blocked1> 2026-02-21T12:32:06.8285802Z %51 = arith.extsi %37 : tensor<32xi32, #blocked7> to tensor<32xi64, #blocked7> 2026-02-21T12:32:06.8286011Z %52 = arith.extsi %13 : tensor<128xi32, #blocked7> to tensor<128xi64, #blocked7> 2026-02-21T12:32:06.8286277Z %53 = ttg.convert_layout %52 : tensor<128xi64, #blocked7> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T12:32:06.8286608Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi64, #blocked8> 2026-02-21T12:32:06.8286899Z %55 = ttg.convert_layout %54 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #blocked10> 2026-02-21T12:32:06.8287187Z %56 = ttg.convert_layout %55 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> 2026-02-21T12:32:06.8287533Z %57 = tt.expand_dims %56 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi64, #blocked11> 2026-02-21T12:32:06.8287838Z %58 = ttg.convert_layout %57 : tensor<1x1x128xi64, #blocked11> -> tensor<1x1x128xi64, #blocked1> 2026-02-21T12:32:06.8288086Z %59 = tt.broadcast %58 : tensor<1x1x128xi64, #blocked1> -> tensor<1x32x128xi64, #blocked1> 2026-02-21T12:32:06.8288273Z %60 = arith.cmpi sge, %47, %c0_i64 : i64 2026-02-21T12:32:06.8288396Z %61 = arith.cmpi slt, %47, %c192_i64 : i64 2026-02-21T12:32:06.8288518Z %62 = arith.andi %60, %61 : i1 2026-02-21T12:32:06.8288647Z %63 = tt.splat %62 : i1 -> tensor<1x32x1xi1, #blocked2> 2026-02-21T12:32:06.8288815Z %64 = arith.cmpi sge, %58, %cst_4 : tensor<1x1x128xi64, #blocked1> 2026-02-21T12:32:06.8288990Z %65 = arith.cmpi slt, %58, %cst_3 : tensor<1x1x128xi64, #blocked1> 2026-02-21T12:32:06.8289158Z %66 = arith.andi %64, %65 : tensor<1x1x128xi1, #blocked1> 2026-02-21T12:32:06.8289356Z %67 = tt.broadcast %66 : tensor<1x1x128xi1, #blocked1> -> tensor<1x32x128xi1, #blocked1> 2026-02-21T12:32:06.8289760Z %68:3 = scf.for %arg4 = %c0_i32 to %c256_i32 step %c32_i32 iter_args(%arg5 = %cst_15, %arg6 = %cst_14, %arg7 = %cst_13) -> (tensor<1x8xf32, #blocked4>, tensor<1x8xf32, #blocked4>, tensor<1x8x128xf32, #blocked1>) : i32 { 2026-02-21T12:32:06.8290117Z %119 = tt.splat %arg4 : i32 -> tensor<32xi32, #blocked7> 2026-02-21T12:32:06.8290295Z %120 = arith.addi %119, %37 : tensor<32xi32, #blocked7> 2026-02-21T12:32:06.8290532Z %121 = ttg.convert_layout %120 : tensor<32xi32, #blocked7> -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T12:32:06.8290863Z %122 = tt.expand_dims %121 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x32xi32, #blocked8> 2026-02-21T12:32:06.8291175Z %123 = ttg.convert_layout %122 : tensor<1x32xi32, #blocked8> -> tensor<1x32xi32, #blocked5> 2026-02-21T12:32:06.8291469Z %124 = ttg.convert_layout %123 : tensor<1x32xi32, #blocked5> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked14}>> 2026-02-21T12:32:06.8291817Z %125 = tt.expand_dims %124 {axis = 1 : i32} : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked14}>> -> tensor<1x1x32xi32, #blocked14> 2026-02-21T12:32:06.8292125Z %126 = ttg.convert_layout %125 : tensor<1x1x32xi32, #blocked14> -> tensor<1x1x32xi32, #blocked6> 2026-02-21T12:32:06.8292343Z %127 = arith.muli %126, %cst_11 : tensor<1x1x32xi32, #blocked6> 2026-02-21T12:32:06.8292562Z %128 = tt.broadcast %127 : tensor<1x1x32xi32, #blocked6> -> tensor<1x128x32xi32, #blocked6> 2026-02-21T12:32:06.8292821Z %129 = ttg.convert_layout %128 : tensor<1x128x32xi32, #blocked6> -> tensor<1x128x32xi32, #blocked3> 2026-02-21T12:32:06.8293038Z %130 = arith.addi %44, %129 : tensor<1x128x32xi32, #blocked3> 2026-02-21T12:32:06.8293259Z %131 = tt.addptr %45, %130 : tensor<1x128x32x!tt.ptr, #blocked3>, tensor<1x128x32xi32, #blocked3> 2026-02-21T12:32:06.8293481Z %132 = tt.load %131 : tensor<1x128x32x!tt.ptr, #blocked3> 2026-02-21T12:32:06.8293686Z %133 = tt.reshape %132 : tensor<1x128x32xbf16, #blocked3> -> tensor<128x32xbf16, #blocked5> 2026-02-21T12:32:06.8294004Z %134 = ttg.convert_layout %46 : tensor<8x128xbf16, #blocked10> -> tensor<8x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked15}>> 2026-02-21T12:32:06.8294366Z %135 = ttg.convert_layout %133 : tensor<128x32xbf16, #blocked5> -> tensor<128x32xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked15}>> 2026-02-21T12:32:06.8294673Z %136 = ttg.convert_layout %cst_10 : tensor<8x32xf32, #blocked5> -> tensor<8x32xf32, #blocked15> 2026-02-21T12:32:06.8295092Z %137 = tt.dot %134, %135, %136, inputPrecision = tf32 : tensor<8x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked15}>> * tensor<128x32xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked15}>> -> tensor<8x32xf32, #blocked15> 2026-02-21T12:32:06.8295499Z %138 = ttg.convert_layout %137 : tensor<8x32xf32, #blocked15> -> tensor<8x32xf32, #blocked5> 2026-02-21T12:32:06.8295741Z %139 = tt.reshape %138 : tensor<8x32xf32, #blocked5> -> tensor<1x8x32xf32, #blocked3> 2026-02-21T12:32:06.8295978Z %140 = arith.truncf %139 : tensor<1x8x32xf32, #blocked3> to tensor<1x8x32xbf16, #blocked3> 2026-02-21T12:32:06.8296218Z %141 = arith.extf %140 : tensor<1x8x32xbf16, #blocked3> to tensor<1x8x32xf32, #blocked3> 2026-02-21T12:32:06.8296411Z %142 = "tt.reduce"(%141) <{axis = 2 : i32}> ({ 2026-02-21T12:32:06.8296539Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T12:32:06.8296663Z %208 = arith.maxnumf %arg8, %arg9 : f32 2026-02-21T12:32:06.8296790Z tt.reduce.return %208 : f32 2026-02-21T12:32:06.8296976Z }) : (tensor<1x8x32xf32, #blocked3>) -> tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> 2026-02-21T12:32:06.8297271Z %143 = ttg.convert_layout %142 : tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> -> tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8297543Z %144 = arith.truncf %143 : tensor<1x8xf32, #blocked4> to tensor<1x8xbf16, #blocked4> 2026-02-21T12:32:06.8297766Z %145 = arith.extf %144 : tensor<1x8xbf16, #blocked4> to tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8297960Z %146 = arith.mulf %145, %cst_9 : tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8298150Z %147 = arith.truncf %146 : tensor<1x8xf32, #blocked4> to tensor<1x8xbf16, #blocked4> 2026-02-21T12:32:06.8298388Z %148 = arith.extf %147 : tensor<1x8xbf16, #blocked4> to tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8298583Z %149 = arith.cmpf ogt, %arg5, %148 : tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8298757Z %150 = arith.cmpf une, %arg5, %arg5 : tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8298920Z %151 = arith.ori %149, %150 : tensor<1x8xi1, #blocked4> 2026-02-21T12:32:06.8299118Z %152 = arith.select %151, %arg5, %148 : tensor<1x8xi1, #blocked4>, tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8299358Z %153 = arith.mulf %141, %cst_8 : tensor<1x8x32xf32, #blocked3> 2026-02-21T12:32:06.8299563Z %154 = arith.truncf %153 : tensor<1x8x32xf32, #blocked3> to tensor<1x8x32xbf16, #blocked3> 2026-02-21T12:32:06.8299857Z %155 = ttg.convert_layout %152 : tensor<1x8xf32, #blocked4> -> tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T12:32:06.8300198Z %156 = tt.expand_dims %155 {axis = 2 : i32} : tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x8x1xf32, #blocked9> 2026-02-21T12:32:06.8300531Z %157 = ttg.convert_layout %156 : tensor<1x8x1xf32, #blocked9> -> tensor<1x8x1xf32, #blocked> 2026-02-21T12:32:06.8300780Z %158 = arith.extf %154 : tensor<1x8x32xbf16, #blocked3> to tensor<1x8x32xf32, #blocked3> 2026-02-21T12:32:06.8301014Z %159 = tt.broadcast %157 : tensor<1x8x1xf32, #blocked> -> tensor<1x8x32xf32, #blocked> 2026-02-21T12:32:06.8301261Z %160 = ttg.convert_layout %159 : tensor<1x8x32xf32, #blocked> -> tensor<1x8x32xf32, #blocked3> 2026-02-21T12:32:06.8301474Z %161 = arith.subf %158, %160 : tensor<1x8x32xf32, #blocked3> 2026-02-21T12:32:06.8301803Z %162 = tt.extern_elementwise %161 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x8x32xf32, #blocked3>) -> tensor<1x8x32xf32, #blocked3> 2026-02-21T12:32:06.8302098Z %163 = "tt.reduce"(%162) <{axis = 2 : i32}> ({ 2026-02-21T12:32:06.8302228Z ^bb0(%arg8: f32, %arg9: f32): 2026-02-21T12:32:06.8302358Z %208 = arith.addf %arg8, %arg9 : f32 2026-02-21T12:32:06.8302480Z tt.reduce.return %208 : f32 2026-02-21T12:32:06.8302674Z }) : (tensor<1x8x32xf32, #blocked3>) -> tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> 2026-02-21T12:32:06.8302967Z %164 = ttg.convert_layout %163 : tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> -> tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8303221Z %165 = arith.subf %arg5, %152 : tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8303521Z %166 = tt.extern_elementwise %165 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x8xf32, #blocked4>) -> tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8303813Z %167 = arith.mulf %arg6, %166 : tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8303981Z %168 = arith.addf %167, %164 : tensor<1x8xf32, #blocked4> 2026-02-21T12:32:06.8304227Z %169 = ttg.convert_layout %166 : tensor<1x8xf32, #blocked4> -> tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T12:32:06.8304577Z %170 = tt.expand_dims %169 {axis = 2 : i32} : tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x8x1xf32, #blocked9> 2026-02-21T12:32:06.8304878Z %171 = ttg.convert_layout %170 : tensor<1x8x1xf32, #blocked9> -> tensor<1x8x1xf32, #blocked> 2026-02-21T12:32:06.8305125Z %172 = tt.broadcast %171 : tensor<1x8x1xf32, #blocked> -> tensor<1x8x128xf32, #blocked> 2026-02-21T12:32:06.8305375Z %173 = ttg.convert_layout %172 : tensor<1x8x128xf32, #blocked> -> tensor<1x8x128xf32, #blocked1> 2026-02-21T12:32:06.8305593Z %174 = arith.mulf %arg7, %173 : tensor<1x8x128xf32, #blocked1> 2026-02-21T12:32:06.8305746Z %175 = arith.extsi %arg4 : i32 to i64 2026-02-21T12:32:06.8305890Z %176 = tt.splat %175 : i64 -> tensor<32xi64, #blocked7> 2026-02-21T12:32:06.8306047Z %177 = arith.addi %176, %51 : tensor<32xi64, #blocked7> 2026-02-21T12:32:06.8306309Z %178 = ttg.convert_layout %177 : tensor<32xi64, #blocked7> -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T12:32:06.8306638Z %179 = tt.expand_dims %178 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x32xi64, #blocked8> 2026-02-21T12:32:06.8306935Z %180 = ttg.convert_layout %179 : tensor<1x32xi64, #blocked8> -> tensor<1x32xi64, #blocked5> 2026-02-21T12:32:06.8307229Z %181 = ttg.convert_layout %180 : tensor<1x32xi64, #blocked5> -> tensor<1x32xi64, #ttg.slice<{dim = 2, parent = #blocked16}>> 2026-02-21T12:32:06.8307592Z %182 = tt.expand_dims %181 {axis = 2 : i32} : tensor<1x32xi64, #ttg.slice<{dim = 2, parent = #blocked16}>> -> tensor<1x32x1xi64, #blocked16> 2026-02-21T12:32:06.8307907Z %183 = ttg.convert_layout %182 : tensor<1x32x1xi64, #blocked16> -> tensor<1x32x1xi64, #blocked2> 2026-02-21T12:32:06.8308124Z %184 = arith.muli %183, %cst_7 : tensor<1x32x1xi64, #blocked2> 2026-02-21T12:32:06.8308333Z %185 = tt.broadcast %184 : tensor<1x32x1xi64, #blocked2> -> tensor<1x32x128xi64, #blocked2> 2026-02-21T12:32:06.8308598Z %186 = ttg.convert_layout %185 : tensor<1x32x128xi64, #blocked2> -> tensor<1x32x128xi64, #blocked1> 2026-02-21T12:32:06.8308839Z %187 = arith.addi %186, %59 : tensor<1x32x128xi64, #blocked1> 2026-02-21T12:32:06.8309011Z %188 = arith.addi %50, %187 : tensor<1x32x128xi64, #blocked1> 2026-02-21T12:32:06.8309232Z %189 = tt.addptr %48, %188 : tensor<1x32x128x!tt.ptr, #blocked1>, tensor<1x32x128xi64, #blocked1> 2026-02-21T12:32:06.8309465Z %190 = arith.cmpi sge, %183, %cst_6 : tensor<1x32x1xi64, #blocked2> 2026-02-21T12:32:06.8309648Z %191 = arith.cmpi slt, %183, %cst_5 : tensor<1x32x1xi64, #blocked2> 2026-02-21T12:32:06.8309820Z %192 = arith.andi %190, %191 : tensor<1x32x1xi1, #blocked2> 2026-02-21T12:32:06.8310002Z %193 = arith.andi %63, %192 : tensor<1x32x1xi1, #blocked2> 2026-02-21T12:32:06.8310203Z %194 = tt.broadcast %193 : tensor<1x32x1xi1, #blocked2> -> tensor<1x32x128xi1, #blocked2> 2026-02-21T12:32:06.8310465Z %195 = ttg.convert_layout %194 : tensor<1x32x128xi1, #blocked2> -> tensor<1x32x128xi1, #blocked1> 2026-02-21T12:32:06.8310685Z %196 = arith.andi %195, %67 : tensor<1x32x128xi1, #blocked1> 2026-02-21T12:32:06.8310867Z %197 = tt.load %189, %196, %cst_2 : tensor<1x32x128x!tt.ptr, #blocked1> 2026-02-21T12:32:06.8311094Z %198 = arith.truncf %162 : tensor<1x8x32xf32, #blocked3> to tensor<1x8x32xbf16, #blocked3> 2026-02-21T12:32:06.8311337Z %199 = tt.reshape %174 : tensor<1x8x128xf32, #blocked1> -> tensor<8x128xf32, #blocked10> 2026-02-21T12:32:06.8311576Z %200 = tt.reshape %198 : tensor<1x8x32xbf16, #blocked3> -> tensor<8x32xbf16, #blocked5> 2026-02-21T12:32:06.8311814Z %201 = tt.reshape %197 : tensor<1x32x128xbf16, #blocked1> -> tensor<32x128xbf16, #blocked10> 2026-02-21T12:32:06.8312125Z %202 = ttg.convert_layout %200 : tensor<8x32xbf16, #blocked5> -> tensor<8x32xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked17}>> 2026-02-21T12:32:06.8312492Z %203 = ttg.convert_layout %201 : tensor<32x128xbf16, #blocked10> -> tensor<32x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked17}>> 2026-02-21T12:32:06.8312805Z %204 = ttg.convert_layout %199 : tensor<8x128xf32, #blocked10> -> tensor<8x128xf32, #blocked17> 2026-02-21T12:32:06.8313228Z %205 = tt.dot %202, %203, %204, inputPrecision = tf32 : tensor<8x32xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked17}>> * tensor<32x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked17}>> -> tensor<8x128xf32, #blocked17> 2026-02-21T12:32:06.8313647Z %206 = ttg.convert_layout %205 : tensor<8x128xf32, #blocked17> -> tensor<8x128xf32, #blocked10> 2026-02-21T12:32:06.8313896Z %207 = tt.reshape %206 : tensor<8x128xf32, #blocked10> -> tensor<1x8x128xf32, #blocked1> 2026-02-21T12:32:06.8314177Z scf.yield %152, %168, %207 : tensor<1x8xf32, #blocked4>, tensor<1x8xf32, #blocked4>, tensor<1x8x128xf32, #blocked1> 2026-02-21T12:32:06.8314401Z } 2026-02-21T12:32:06.8314593Z %69 = ttg.convert_layout %68#1 : tensor<1x8xf32, #blocked4> -> tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T12:32:06.8314929Z %70 = tt.expand_dims %69 {axis = 2 : i32} : tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x8x1xf32, #blocked9> 2026-02-21T12:32:06.8315224Z %71 = ttg.convert_layout %70 : tensor<1x8x1xf32, #blocked9> -> tensor<1x8x1xf32, #blocked> 2026-02-21T12:32:06.8315477Z %72 = tt.broadcast %71 : tensor<1x8x1xf32, #blocked> -> tensor<1x8x128xf32, #blocked> 2026-02-21T12:32:06.8315713Z %73 = ttg.convert_layout %72 : tensor<1x8x128xf32, #blocked> -> tensor<1x8x128xf32, #blocked1> 2026-02-21T12:32:06.8315927Z %74 = arith.divf %68#2, %73 : tensor<1x8x128xf32, #blocked1> 2026-02-21T12:32:06.8316137Z %75 = arith.truncf %74 : tensor<1x8x128xf32, #blocked1> to tensor<1x8x128xbf16, #blocked1> 2026-02-21T12:32:06.8316320Z %76 = arith.extsi %8 : i32 to i64 2026-02-21T12:32:06.8316445Z %77 = arith.extsi %9 : i32 to i64 2026-02-21T12:32:06.8316608Z %78 = tt.splat %arg3 : !tt.ptr -> tensor<1x8x128x!tt.ptr, #blocked1> 2026-02-21T12:32:06.8316784Z %79 = arith.muli %76, %c32768_i64 : i64 2026-02-21T12:32:06.8316940Z %80 = tt.splat %79 : i64 -> tensor<1x8x128xi64, #blocked1> 2026-02-21T12:32:06.8317097Z %81 = tt.splat %77 : i64 -> tensor<8xi64, #blocked7> 2026-02-21T12:32:06.8317274Z %82 = arith.extsi %10 : tensor<8xi32, #blocked7> to tensor<8xi64, #blocked7> 2026-02-21T12:32:06.8317453Z %83 = arith.addi %81, %82 : tensor<8xi64, #blocked7> 2026-02-21T12:32:06.8317683Z %84 = ttg.convert_layout %83 : tensor<8xi64, #blocked7> -> tensor<8xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T12:32:06.8318020Z %85 = tt.expand_dims %84 {axis = 0 : i32} : tensor<8xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x8xi64, #blocked8> 2026-02-21T12:32:06.8318306Z %86 = ttg.convert_layout %85 : tensor<1x8xi64, #blocked8> -> tensor<1x8xi64, #blocked4> 2026-02-21T12:32:06.8318588Z %87 = ttg.convert_layout %86 : tensor<1x8xi64, #blocked4> -> tensor<1x8xi64, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T12:32:06.8318917Z %88 = tt.expand_dims %87 {axis = 2 : i32} : tensor<1x8xi64, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x8x1xi64, #blocked9> 2026-02-21T12:32:06.8319213Z %89 = ttg.convert_layout %88 : tensor<1x8x1xi64, #blocked9> -> tensor<1x8x1xi64, #blocked> 2026-02-21T12:32:06.8319415Z %90 = arith.muli %89, %cst_1 : tensor<1x8x1xi64, #blocked> 2026-02-21T12:32:06.8319612Z %91 = tt.broadcast %90 : tensor<1x8x1xi64, #blocked> -> tensor<1x8x128xi64, #blocked> 2026-02-21T12:32:06.8319851Z %92 = ttg.convert_layout %91 : tensor<1x8x128xi64, #blocked> -> tensor<1x8x128xi64, #blocked1> 2026-02-21T12:32:06.8320084Z %93 = arith.extsi %13 : tensor<128xi32, #blocked7> to tensor<128xi64, #blocked7> 2026-02-21T12:32:06.8320354Z %94 = ttg.convert_layout %93 : tensor<128xi64, #blocked7> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> 2026-02-21T12:32:06.8320680Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi64, #blocked8> 2026-02-21T12:32:06.8320973Z %96 = ttg.convert_layout %95 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #blocked10> 2026-02-21T12:32:06.8321266Z %97 = ttg.convert_layout %96 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> 2026-02-21T12:32:06.8321610Z %98 = tt.expand_dims %97 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi64, #blocked11> 2026-02-21T12:32:06.8321915Z %99 = ttg.convert_layout %98 : tensor<1x1x128xi64, #blocked11> -> tensor<1x1x128xi64, #blocked1> 2026-02-21T12:32:06.8322162Z %100 = tt.broadcast %99 : tensor<1x1x128xi64, #blocked1> -> tensor<1x8x128xi64, #blocked1> 2026-02-21T12:32:06.8322367Z %101 = arith.addi %92, %100 : tensor<1x8x128xi64, #blocked1> 2026-02-21T12:32:06.8322555Z %102 = arith.addi %80, %101 : tensor<1x8x128xi64, #blocked1> 2026-02-21T12:32:06.8322807Z %103 = tt.addptr %78, %102 : tensor<1x8x128x!tt.ptr, #blocked1>, tensor<1x8x128xi64, #blocked1> 2026-02-21T12:32:06.8323013Z %104 = arith.cmpi sge, %76, %c0_i64 : i64 2026-02-21T12:32:06.8323146Z %105 = arith.cmpi slt, %76, %c192_i64 : i64 2026-02-21T12:32:06.8323276Z %106 = arith.andi %104, %105 : i1 2026-02-21T12:32:06.8323418Z %107 = arith.cmpi sge, %89, %cst_0 : tensor<1x8x1xi64, #blocked> 2026-02-21T12:32:06.8323619Z %108 = arith.cmpi slt, %89, %cst : tensor<1x8x1xi64, #blocked> 2026-02-21T12:32:06.8323790Z %109 = arith.andi %107, %108 : tensor<1x8x1xi1, #blocked> 2026-02-21T12:32:06.8323949Z %110 = tt.splat %106 : i1 -> tensor<1x8x1xi1, #blocked> 2026-02-21T12:32:06.8324110Z %111 = arith.andi %110, %109 : tensor<1x8x1xi1, #blocked> 2026-02-21T12:32:06.8324307Z %112 = tt.broadcast %111 : tensor<1x8x1xi1, #blocked> -> tensor<1x8x128xi1, #blocked> 2026-02-21T12:32:06.8324555Z %113 = ttg.convert_layout %112 : tensor<1x8x128xi1, #blocked> -> tensor<1x8x128xi1, #blocked1> 2026-02-21T12:32:06.8324779Z %114 = arith.cmpi sge, %99, %cst_4 : tensor<1x1x128xi64, #blocked1> 2026-02-21T12:32:06.8324981Z %115 = arith.cmpi slt, %99, %cst_3 : tensor<1x1x128xi64, #blocked1> 2026-02-21T12:32:06.8325156Z %116 = arith.andi %114, %115 : tensor<1x1x128xi1, #blocked1> 2026-02-21T12:32:06.8325354Z %117 = tt.broadcast %116 : tensor<1x1x128xi1, #blocked1> -> tensor<1x8x128xi1, #blocked1> 2026-02-21T12:32:06.8325562Z %118 = arith.andi %113, %117 : tensor<1x8x128xi1, #blocked1> 2026-02-21T12:32:06.8325734Z tt.store %103, %75, %118 : tensor<1x8x128x!tt.ptr, #blocked1> 2026-02-21T12:32:06.8325883Z tt.return 2026-02-21T12:32:06.8325966Z } 2026-02-21T12:32:06.8326046Z } 2026-02-21T12:32:06.8326090Z 2026-02-21T12:32:06.8326148Z {-# 2026-02-21T12:32:06.8326232Z external_resources: { 2026-02-21T12:32:06.8326344Z mlir_reproducer: { 2026-02-21T12:32:06.8328576Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T12:32:06.8330870Z disable_threading: false, 2026-02-21T12:32:06.8330984Z verify_each: true 2026-02-21T12:32:06.8331083Z } 2026-02-21T12:32:06.8331158Z } 2026-02-21T12:32:06.8331237Z #-} 2026-02-21T12:32:06.8331517Z /tmp/torchinductor_root/fe/cfedhhxnlm7fu6mbsljolujgdmvn2hjqtzdhyrtxbjqnqca3oatr.py:17:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T12:32:06.8332213Z /tmp/torchinductor_root/fe/cfedhhxnlm7fu6mbsljolujgdmvn2hjqtzdhyrtxbjqnqca3oatr.py:17:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T12:32:06.8332793Z [29s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T12:32:06.8333536Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 8, 32], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T12:32:06.8334221Z Error: RuntimeError: PassManager::run failed 2026-02-21T12:32:06.8334393Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T12:32:07.2405671Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.4 configs/s 2026-02-21T12:32:07.2415371Z [29s] Adaptive compile timeout: 30s (90% percentile=10.2s, bounds=[30.0s, 30s]) 2026-02-21T12:32:07.3569493Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 9401.3 configs/s 2026-02-21T12:32:07.5917491Z [30s] Initial random population of 100, 5 starting points: 2026-02-21T12:32:07.5918108Z error=13 2026-02-21T12:32:07.5918270Z ok=87 2026-02-21T12:32:07.5918417Z min=0.0540 2026-02-21T12:32:07.5918575Z mid=0.3969 2026-02-21T12:32:07.5918726Z max=23.3332 2026-02-21T12:32:07.5918896Z best={'block_sizes': [1, 32, 16], 2026-02-21T12:32:07.5919201Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T12:32:07.5919490Z 'l2_groupings': [2], 2026-02-21T12:32:07.5919697Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:32:07.5919927Z 'loop_orders': [[1, 0]], 2026-02-21T12:32:07.5920130Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:32:07.5920325Z 'num_stages': 4, 2026-02-21T12:32:07.5920500Z 'num_warps': 2, 2026-02-21T12:32:07.5921550Z 'pid_type': 'flat', 2026-02-21T12:32:07.5921745Z 'range_flattens': [None, True], 2026-02-21T12:32:07.5921974Z 'range_multi_buffers': [None, None], 2026-02-21T12:32:07.5922196Z 'range_num_stages': [0, 1], 2026-02-21T12:32:07.5922403Z 'range_unroll_factors': [0, 1], 2026-02-21T12:32:07.5922678Z 'range_warp_specializes': [], 2026-02-21T12:32:07.5922888Z 'waves_per_eu': 4} 2026-02-21T12:32:07.5942905Z [30s] Fitting surrogate: 100 points, 100 targets 2026-02-21T12:32:08.3848108Z [30s] Generation 1 starting: 82 neighbors, 5 active search path(s) 2026-02-21T12:32:41.5971881Z [64s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T12:32:41.7833436Z [64s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:32:41.8929811Z [64s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:32:42.0007970Z [64s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:32:42.3106355Z [64s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:32:42.9128525Z [65s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:32:43.2531468Z [65s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[3, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:32:43.2556787Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 0.7 configs/s 2026-02-21T12:32:48.3860170Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 16.1 configs/s 2026-02-21T12:32:50.9430541Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 370.0 2026-02-21T12:32:50.9433037Z configs/s 2026-02-21T12:32:51.5011359Z [73s] Generation 1 complete: 2026-02-21T12:32:51.5011508Z error=1 2026-02-21T12:32:51.5011600Z timeout=7 2026-02-21T12:32:51.5011681Z ok=79 2026-02-21T12:32:51.5011763Z min=0.0505 2026-02-21T12:32:51.5011866Z mid=0.0945 2026-02-21T12:32:51.5011952Z max=0.8442 2026-02-21T12:32:51.5012044Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:32:51.5012209Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:32:51.5012369Z 'l2_groupings': [16], 2026-02-21T12:32:51.5012483Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:32:51.5012606Z 'loop_orders': [[0, 1]], 2026-02-21T12:32:51.5012725Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:32:51.5012833Z 'num_stages': 4, 2026-02-21T12:32:51.5012932Z 'num_warps': 4, 2026-02-21T12:32:51.5013024Z 'pid_type': 'xyz', 2026-02-21T12:32:51.5013124Z 'range_flattens': [None, None], 2026-02-21T12:32:51.5013248Z 'range_multi_buffers': [None, None], 2026-02-21T12:32:51.5013374Z 'range_num_stages': [0, 2], 2026-02-21T12:32:51.5013483Z 'range_unroll_factors': [0, 3], 2026-02-21T12:32:51.5013595Z 'range_warp_specializes': [], 2026-02-21T12:32:51.5013705Z 'waves_per_eu': 4} 2026-02-21T12:32:51.5450158Z [74s] Fitting surrogate: 187 points, 187 targets 2026-02-21T12:32:52.3953440Z [74s] Generation 2 starting: 69 neighbors, 5 active search path(s) 2026-02-21T12:33:25.7801845Z [108s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:33:26.1596613Z [108s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:33:26.1618121Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 0.3 configs/s 2026-02-21T12:33:30.1891356Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 17.3 configs/s 2026-02-21T12:33:34.7353264Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 242.6 2026-02-21T12:33:34.7353759Z configs/s 2026-02-21T12:33:35.4147659Z [117s] Generation 2 complete: 2026-02-21T12:33:35.4148059Z error=2 2026-02-21T12:33:35.4148280Z timeout=2 2026-02-21T12:33:35.4148490Z ok=70 2026-02-21T12:33:35.4148737Z min=0.0463 2026-02-21T12:33:35.4149518Z mid=0.0753 2026-02-21T12:33:35.4149721Z max=0.8389 2026-02-21T12:33:35.4149914Z best={'block_sizes': [1, 64, 16], 2026-02-21T12:33:35.4150075Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:33:35.4150240Z 'l2_groupings': [16], 2026-02-21T12:33:35.4150364Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:33:35.4150484Z 'loop_orders': [[0, 1]], 2026-02-21T12:33:35.4150597Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:33:35.4150700Z 'num_stages': 4, 2026-02-21T12:33:35.4150792Z 'num_warps': 4, 2026-02-21T12:33:35.4150883Z 'pid_type': 'xyz', 2026-02-21T12:33:35.4150986Z 'range_flattens': [None, None], 2026-02-21T12:33:35.4151209Z 'range_multi_buffers': [None, None], 2026-02-21T12:33:35.4151329Z 'range_num_stages': [0, 2], 2026-02-21T12:33:35.4151442Z 'range_unroll_factors': [0, 3], 2026-02-21T12:33:35.4151568Z 'range_warp_specializes': [], 2026-02-21T12:33:35.4151679Z 'waves_per_eu': 2} 2026-02-21T12:33:35.4200217Z [117s] Fitting surrogate: 261 points, 261 targets 2026-02-21T12:33:36.1281545Z [118s] Generation 3 starting: 66 neighbors, 4 active search path(s) 2026-02-21T12:34:10.7538596Z [153s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[2, 3], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:34:11.1585657Z [153s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:34:11.1607737Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 0.6 configs/s 2026-02-21T12:34:15.7049285Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 67/67 14.9 configs/s 2026-02-21T12:34:18.0776291Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 403.1 2026-02-21T12:34:18.0779543Z configs/s 2026-02-21T12:34:18.5874984Z [161s] Generation 3 complete: 2026-02-21T12:34:18.5875379Z error=1 2026-02-21T12:34:18.5875594Z timeout=2 2026-02-21T12:34:18.5875839Z ok=67 2026-02-21T12:34:18.5876044Z min=0.0465 2026-02-21T12:34:18.5876263Z mid=0.0817 2026-02-21T12:34:18.5876685Z max=3.5996 2026-02-21T12:34:18.5876919Z best={'block_sizes': [1, 64, 16], 2026-02-21T12:34:18.5877348Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:34:18.5877763Z 'l2_groupings': [16], 2026-02-21T12:34:18.5878068Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:34:18.5878381Z 'loop_orders': [[0, 1]], 2026-02-21T12:34:18.5878672Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:34:18.5878938Z 'num_stages': 4, 2026-02-21T12:34:18.5879182Z 'num_warps': 4, 2026-02-21T12:34:18.5879414Z 'pid_type': 'xyz', 2026-02-21T12:34:18.5879672Z 'range_flattens': [None, None], 2026-02-21T12:34:18.5879981Z 'range_multi_buffers': [None, None], 2026-02-21T12:34:18.5880292Z 'range_num_stages': [0, 2], 2026-02-21T12:34:18.5880572Z 'range_unroll_factors': [0, 2], 2026-02-21T12:34:18.5880866Z 'range_warp_specializes': [], 2026-02-21T12:34:18.5881152Z 'waves_per_eu': 2} 2026-02-21T12:34:18.6375022Z [161s] Fitting surrogate: 331 points, 331 targets 2026-02-21T12:34:19.1801934Z [161s] Generation 4 starting: 41 neighbors, 3 active search path(s) 2026-02-21T12:34:52.1214507Z [194s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:34:52.3775259Z [194s] Timeout after 30s compiling Config(block_sizes=[1, 64, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[2, 3], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:34:52.3785142Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 0.4 configs/s 2026-02-21T12:34:54.8821172Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 17.1 configs/s 2026-02-21T12:34:56.6667477Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 529.1 2026-02-21T12:34:56.6670370Z configs/s 2026-02-21T12:34:57.1444426Z [199s] Generation 4 complete: 2026-02-21T12:34:57.1444793Z timeout=2 2026-02-21T12:34:57.1445016Z ok=42 2026-02-21T12:34:57.1445247Z min=0.0447 2026-02-21T12:34:57.1445451Z mid=0.0686 2026-02-21T12:34:57.1445658Z max=1.1353 2026-02-21T12:34:57.1445884Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:34:57.1446300Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:34:57.1446733Z 'l2_groupings': [16], 2026-02-21T12:34:57.1447014Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:34:57.1447348Z 'loop_orders': [[0, 1]], 2026-02-21T12:34:57.1447633Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:34:57.1447922Z 'num_stages': 3, 2026-02-21T12:34:57.1448155Z 'num_warps': 4, 2026-02-21T12:34:57.1448407Z 'pid_type': 'xyz', 2026-02-21T12:34:57.1448658Z 'range_flattens': [None, None], 2026-02-21T12:34:57.1448966Z 'range_multi_buffers': [None, None], 2026-02-21T12:34:57.1449280Z 'range_num_stages': [0, 2], 2026-02-21T12:34:57.1449565Z 'range_unroll_factors': [0, 2], 2026-02-21T12:34:57.1450200Z 'range_warp_specializes': [], 2026-02-21T12:34:57.1450410Z 'waves_per_eu': 2} 2026-02-21T12:34:57.1823106Z [199s] Fitting surrogate: 375 points, 375 targets 2026-02-21T12:34:58.2761352Z [200s] Generation 5 starting: 46 neighbors, 3 active search path(s) 2026-02-21T12:35:20.9022428Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 0.3 configs/s 2026-02-21T12:35:23.9701457Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 16.1 configs/s 2026-02-21T12:35:26.3387567Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 403.1 2026-02-21T12:35:26.3391086Z configs/s 2026-02-21T12:35:26.8392102Z [229s] Generation 5 complete: 2026-02-21T12:35:26.8392449Z ok=49 2026-02-21T12:35:26.8392674Z min=0.0437 2026-02-21T12:35:26.8392901Z mid=0.0653 2026-02-21T12:35:26.8395589Z max=3.4227 2026-02-21T12:35:26.8395826Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:35:26.8396155Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:35:26.8396463Z 'l2_groupings': [16], 2026-02-21T12:35:26.8396667Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:35:26.8396901Z 'loop_orders': [[0, 1]], 2026-02-21T12:35:26.8397105Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:35:26.8397307Z 'num_stages': 3, 2026-02-21T12:35:26.8397474Z 'num_warps': 4, 2026-02-21T12:35:26.8397648Z 'pid_type': 'xyz', 2026-02-21T12:35:26.8397829Z 'range_flattens': [None, None], 2026-02-21T12:35:26.8398059Z 'range_multi_buffers': [None, None], 2026-02-21T12:35:26.8398285Z 'range_num_stages': [0, 2], 2026-02-21T12:35:26.8398485Z 'range_unroll_factors': [0, 2], 2026-02-21T12:35:26.8398994Z 'range_warp_specializes': [], 2026-02-21T12:35:26.8399193Z 'waves_per_eu': 3} 2026-02-21T12:35:26.8898436Z [229s] Fitting surrogate: 424 points, 424 targets 2026-02-21T12:35:27.3103188Z [229s] Generation 6 starting: 30 neighbors, 2 active search path(s) 2026-02-21T12:35:37.2704191Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 1.0 configs/s 2026-02-21T12:35:39.2762879Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 16.6 configs/s 2026-02-21T12:35:40.9987763Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 802.1 2026-02-21T12:35:40.9988396Z configs/s 2026-02-21T12:35:41.4292487Z [243s] Generation 6 complete: 2026-02-21T12:35:41.4292688Z ok=32 2026-02-21T12:35:41.4292827Z min=0.0432 2026-02-21T12:35:41.4292954Z mid=0.0710 2026-02-21T12:35:41.4293072Z max=0.6745 2026-02-21T12:35:41.4293211Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:35:41.4293471Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:35:41.4293697Z 'l2_groupings': [16], 2026-02-21T12:35:41.4293869Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:35:41.4294043Z 'loop_orders': [[0, 1]], 2026-02-21T12:35:41.4294210Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:35:41.4294390Z 'num_stages': 4, 2026-02-21T12:35:41.4294519Z 'num_warps': 4, 2026-02-21T12:35:41.4294657Z 'pid_type': 'xyz', 2026-02-21T12:35:41.4294801Z 'range_flattens': [None, None], 2026-02-21T12:35:41.4295010Z 'range_multi_buffers': [None, None], 2026-02-21T12:35:41.4295192Z 'range_num_stages': [0, 2], 2026-02-21T12:35:41.4295348Z 'range_unroll_factors': [0, 2], 2026-02-21T12:35:41.4295522Z 'range_warp_specializes': [], 2026-02-21T12:35:41.4295682Z 'waves_per_eu': 3} 2026-02-21T12:35:41.4553567Z [243s] Fitting surrogate: 456 points, 456 targets 2026-02-21T12:35:41.8622532Z [244s] Generation 7 starting: 31 neighbors, 2 active search path(s) 2026-02-21T12:36:06.8587069Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 0.2 configs/s 2026-02-21T12:36:08.9216747Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 16.7 configs/s 2026-02-21T12:36:10.2860096Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 679.9 2026-02-21T12:36:10.2863059Z configs/s 2026-02-21T12:36:10.7333242Z [273s] Generation 7 complete: 2026-02-21T12:36:10.7333638Z ok=33 2026-02-21T12:36:10.7333849Z min=0.0433 2026-02-21T12:36:10.7334057Z mid=0.0638 2026-02-21T12:36:10.7334268Z max=0.4157 2026-02-21T12:36:10.7334504Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:36:10.7334969Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:36:10.7335386Z 'l2_groupings': [16], 2026-02-21T12:36:10.7336111Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:36:10.7336430Z 'loop_orders': [[0, 1]], 2026-02-21T12:36:10.7336711Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:36:10.7336977Z 'num_stages': 3, 2026-02-21T12:36:10.7337352Z 'num_warps': 4, 2026-02-21T12:36:10.7337588Z 'pid_type': 'xyz', 2026-02-21T12:36:10.7337836Z 'range_flattens': [None, None], 2026-02-21T12:36:10.7338133Z 'range_multi_buffers': [None, None], 2026-02-21T12:36:10.7338443Z 'range_num_stages': [0, 1], 2026-02-21T12:36:10.7338724Z 'range_unroll_factors': [0, 2], 2026-02-21T12:36:10.7339039Z 'range_warp_specializes': [], 2026-02-21T12:36:10.7339312Z 'waves_per_eu': 3} 2026-02-21T12:36:10.7531808Z [273s] Fitting surrogate: 489 points, 489 targets 2026-02-21T12:36:11.1736711Z [273s] Generation 8 starting: 35 neighbors, 2 active search path(s) 2026-02-21T12:36:22.8443188Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 1.2 configs/s 2026-02-21T12:36:25.1157744Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 16.9 configs/s 2026-02-21T12:36:26.9059853Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 656.2 2026-02-21T12:36:26.9060446Z configs/s 2026-02-21T12:36:27.3418522Z [289s] Generation 8 complete: 2026-02-21T12:36:27.3418795Z ok=37 2026-02-21T12:36:27.3418965Z min=0.0438 2026-02-21T12:36:27.3419128Z mid=0.0653 2026-02-21T12:36:27.3419283Z max=0.8369 2026-02-21T12:36:27.3419487Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:36:27.3419816Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:36:27.3420137Z 'l2_groupings': [16], 2026-02-21T12:36:27.3420358Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:36:27.3420610Z 'loop_orders': [[0, 1]], 2026-02-21T12:36:27.3420823Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:36:27.3421438Z 'num_stages': 3, 2026-02-21T12:36:27.3421625Z 'num_warps': 4, 2026-02-21T12:36:27.3421808Z 'pid_type': 'xyz', 2026-02-21T12:36:27.3422017Z 'range_flattens': [None, True], 2026-02-21T12:36:27.3422257Z 'range_multi_buffers': [None, None], 2026-02-21T12:36:27.3422506Z 'range_num_stages': [0, 1], 2026-02-21T12:36:27.3422720Z 'range_unroll_factors': [0, 2], 2026-02-21T12:36:27.3422959Z 'range_warp_specializes': [], 2026-02-21T12:36:27.3423179Z 'waves_per_eu': 3} 2026-02-21T12:36:27.3653435Z [289s] Fitting surrogate: 526 points, 526 targets 2026-02-21T12:36:27.7508810Z [290s] Generation 9 starting: 31 neighbors, 2 active search path(s) 2026-02-21T12:36:32.6301145Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 5.3 configs/s 2026-02-21T12:36:34.5580847Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 16.4 configs/s 2026-02-21T12:36:36.4893503Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 487.3 2026-02-21T12:36:36.4894117Z configs/s 2026-02-21T12:36:36.9425715Z [299s] Generation 9 complete: 2026-02-21T12:36:36.9426258Z ok=33 2026-02-21T12:36:36.9426417Z min=0.0436 2026-02-21T12:36:36.9426571Z mid=0.0476 2026-02-21T12:36:36.9426723Z max=0.3350 2026-02-21T12:36:36.9426895Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:36:36.9427223Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:36:36.9427534Z 'l2_groupings': [16], 2026-02-21T12:36:36.9427742Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:36:36.9427985Z 'loop_orders': [[0, 1]], 2026-02-21T12:36:36.9428308Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:36:36.9428510Z 'num_stages': 3, 2026-02-21T12:36:36.9428681Z 'num_warps': 4, 2026-02-21T12:36:36.9428855Z 'pid_type': 'xyz', 2026-02-21T12:36:36.9429041Z 'range_flattens': [None, True], 2026-02-21T12:36:36.9429272Z 'range_multi_buffers': [None, None], 2026-02-21T12:36:36.9429507Z 'range_num_stages': [0, 1], 2026-02-21T12:36:36.9429721Z 'range_unroll_factors': [0, 2], 2026-02-21T12:36:36.9429941Z 'range_warp_specializes': [], 2026-02-21T12:36:36.9430151Z 'waves_per_eu': 3} 2026-02-21T12:36:36.9782932Z [299s] Fitting surrogate: 559 points, 559 targets 2026-02-21T12:36:37.3037073Z [299s] Generation 10 starting: 25 neighbors, 2 active search path(s) 2026-02-21T12:36:41.6288313Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 6.3 configs/s 2026-02-21T12:36:43.2781231Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 25/25 16.6 configs/s 2026-02-21T12:36:45.2908815Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 579.3 2026-02-21T12:36:45.2909455Z configs/s 2026-02-21T12:36:45.7188748Z [308s] Generation 10 complete: 2026-02-21T12:36:45.7189132Z ok=27 2026-02-21T12:36:45.7189337Z min=0.0434 2026-02-21T12:36:45.7189552Z mid=0.0466 2026-02-21T12:36:45.7189754Z max=0.2063 2026-02-21T12:36:45.7190441Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:36:45.7190864Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:36:45.7191307Z 'l2_groupings': [16], 2026-02-21T12:36:45.7191586Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:36:45.7191906Z 'loop_orders': [[0, 1]], 2026-02-21T12:36:45.7192189Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:36:45.7192469Z 'num_stages': 4, 2026-02-21T12:36:45.7192707Z 'num_warps': 4, 2026-02-21T12:36:45.7192975Z 'pid_type': 'xyz', 2026-02-21T12:36:45.7193231Z 'range_flattens': [None, True], 2026-02-21T12:36:45.7193537Z 'range_multi_buffers': [None, None], 2026-02-21T12:36:45.7193850Z 'range_num_stages': [0, 1], 2026-02-21T12:36:45.7194134Z 'range_unroll_factors': [0, 2], 2026-02-21T12:36:45.7194424Z 'range_warp_specializes': [], 2026-02-21T12:36:45.7194707Z 'waves_per_eu': 3} 2026-02-21T12:36:45.7435655Z [308s] Fitting surrogate: 586 points, 586 targets 2026-02-21T12:36:46.1290766Z [308s] Generation 11 starting: 28 neighbors, 2 active search path(s) 2026-02-21T12:36:50.2697082Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28/28 9.3 configs/s 2026-02-21T12:36:52.0521750Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 28/28 16.5 configs/s 2026-02-21T12:36:53.4932562Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 641.9 2026-02-21T12:36:53.4933161Z configs/s 2026-02-21T12:36:53.9340422Z [316s] Generation 11 complete: 2026-02-21T12:36:53.9340718Z ok=30 2026-02-21T12:36:53.9340907Z min=0.0443 2026-02-21T12:36:53.9341096Z mid=0.0542 2026-02-21T12:36:53.9341635Z max=0.4257 2026-02-21T12:36:53.9341841Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:36:53.9342219Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:36:53.9342591Z 'l2_groupings': [16], 2026-02-21T12:36:53.9342837Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:36:53.9343118Z 'loop_orders': [[0, 1]], 2026-02-21T12:36:53.9343377Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:36:53.9343621Z 'num_stages': 3, 2026-02-21T12:36:53.9343823Z 'num_warps': 4, 2026-02-21T12:36:53.9344198Z 'pid_type': 'xyz', 2026-02-21T12:36:53.9344446Z 'range_flattens': [None, True], 2026-02-21T12:36:53.9344705Z 'range_multi_buffers': [None, True], 2026-02-21T12:36:53.9344967Z 'range_num_stages': [0, 3], 2026-02-21T12:36:53.9345210Z 'range_unroll_factors': [0, 2], 2026-02-21T12:36:53.9345471Z 'range_warp_specializes': [], 2026-02-21T12:36:53.9345710Z 'waves_per_eu': 2} 2026-02-21T12:36:53.9566153Z [316s] Fitting surrogate: 616 points, 616 targets 2026-02-21T12:36:54.2131825Z [316s] Generation 12 starting: 15 neighbors, 1 active search path(s) 2026-02-21T12:36:56.5480057Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 14.1 configs/s 2026-02-21T12:36:57.5496716Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.5 configs/s 2026-02-21T12:36:58.1792729Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1398.1 2026-02-21T12:36:58.1793313Z configs/s 2026-02-21T12:36:58.5560320Z [321s] Generation 12 complete: 2026-02-21T12:36:58.5560698Z ok=16 2026-02-21T12:36:58.5560909Z min=0.0438 2026-02-21T12:36:58.5561132Z mid=0.0487 2026-02-21T12:36:58.5561338Z max=0.1585 2026-02-21T12:36:58.5570008Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:36:58.5570435Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:36:58.5570827Z 'l2_groupings': [16], 2026-02-21T12:36:58.5571099Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:36:58.5571418Z 'loop_orders': [[0, 1]], 2026-02-21T12:36:58.5571685Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:36:58.5571940Z 'num_stages': 3, 2026-02-21T12:36:58.5572164Z 'num_warps': 4, 2026-02-21T12:36:58.5572382Z 'pid_type': 'xyz', 2026-02-21T12:36:58.5572628Z 'range_flattens': [None, True], 2026-02-21T12:36:58.5572914Z 'range_multi_buffers': [None, True], 2026-02-21T12:36:58.5573384Z 'range_num_stages': [0, 2], 2026-02-21T12:36:58.5573650Z 'range_unroll_factors': [0, 2], 2026-02-21T12:36:58.5573854Z 'range_warp_specializes': [], 2026-02-21T12:36:58.5574036Z 'waves_per_eu': 2} 2026-02-21T12:36:58.5686275Z [321s] Fitting surrogate: 632 points, 632 targets 2026-02-21T12:36:58.7908394Z [321s] Generation 13 starting: 11 neighbors, 1 active search path(s) 2026-02-21T12:37:01.2159779Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 3.3 configs/s 2026-02-21T12:37:01.9770714Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 18.0 configs/s 2026-02-21T12:37:02.7652700Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1109.4 2026-02-21T12:37:02.7653264Z configs/s 2026-02-21T12:37:03.1459268Z [325s] Generation 13 complete: 2026-02-21T12:37:03.1459628Z ok=12 2026-02-21T12:37:03.1459839Z min=0.0480 2026-02-21T12:37:03.1460065Z mid=0.0486 2026-02-21T12:37:03.1460270Z max=0.0668 2026-02-21T12:37:03.1460533Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:37:03.1460959Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:37:03.1461403Z 'l2_groupings': [16], 2026-02-21T12:37:03.1461686Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:37:03.1462013Z 'loop_orders': [[0, 1]], 2026-02-21T12:37:03.1462313Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:37:03.1462607Z 'num_stages': 2, 2026-02-21T12:37:03.1462844Z 'num_warps': 4, 2026-02-21T12:37:03.1463077Z 'pid_type': 'xyz', 2026-02-21T12:37:03.1463342Z 'range_flattens': [None, True], 2026-02-21T12:37:03.1463966Z 'range_multi_buffers': [None, True], 2026-02-21T12:37:03.1464287Z 'range_num_stages': [0, 2], 2026-02-21T12:37:03.1464568Z 'range_unroll_factors': [0, 2], 2026-02-21T12:37:03.1464870Z 'range_warp_specializes': [], 2026-02-21T12:37:03.1465153Z 'waves_per_eu': 3} 2026-02-21T12:37:03.1572424Z [325s] Fitting surrogate: 644 points, 644 targets 2026-02-21T12:37:03.4086969Z [325s] Generation 14 starting: 15 neighbors, 1 active search path(s) 2026-02-21T12:37:06.3661983Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 8.5 configs/s 2026-02-21T12:37:07.4297915Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.3 configs/s 2026-02-21T12:37:08.5782901Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1659.6 2026-02-21T12:37:08.5783409Z configs/s 2026-02-21T12:37:08.9876031Z [331s] Generation 14 complete: 2026-02-21T12:37:08.9876377Z ok=17 2026-02-21T12:37:08.9876596Z min=0.0438 2026-02-21T12:37:08.9876835Z mid=0.0558 2026-02-21T12:37:08.9877035Z max=0.2195 2026-02-21T12:37:08.9877270Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:37:08.9877692Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:37:08.9878109Z 'l2_groupings': [16], 2026-02-21T12:37:08.9878390Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:37:08.9878732Z 'loop_orders': [[0, 1]], 2026-02-21T12:37:08.9879014Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:37:08.9879291Z 'num_stages': 2, 2026-02-21T12:37:08.9879537Z 'num_warps': 4, 2026-02-21T12:37:08.9879772Z 'pid_type': 'xyz', 2026-02-21T12:37:08.9880033Z 'range_flattens': [None, True], 2026-02-21T12:37:08.9880335Z 'range_multi_buffers': [None, True], 2026-02-21T12:37:08.9880882Z 'range_num_stages': [0, 2], 2026-02-21T12:37:08.9881165Z 'range_unroll_factors': [0, 2], 2026-02-21T12:37:08.9881467Z 'range_warp_specializes': [], 2026-02-21T12:37:08.9881747Z 'waves_per_eu': 3} 2026-02-21T12:37:09.0028488Z [331s] Fitting surrogate: 661 points, 661 targets 2026-02-21T12:37:09.3025989Z [331s] Generation 15 starting: 16 neighbors, 1 active search path(s) 2026-02-21T12:37:12.7548887Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 4.1 configs/s 2026-02-21T12:37:13.8366006Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.1 configs/s 2026-02-21T12:37:14.5471161Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1237.9 2026-02-21T12:37:14.5471723Z configs/s 2026-02-21T12:37:14.9477852Z [337s] Generation 15 complete: 2026-02-21T12:37:14.9478166Z ok=18 2026-02-21T12:37:14.9478383Z min=0.0473 2026-02-21T12:37:14.9478589Z mid=0.0673 2026-02-21T12:37:14.9478803Z max=0.2220 2026-02-21T12:37:14.9479027Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:37:14.9479430Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:37:14.9479823Z 'l2_groupings': [16], 2026-02-21T12:37:14.9480111Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:37:14.9480416Z 'loop_orders': [[0, 1]], 2026-02-21T12:37:14.9480691Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:37:14.9480949Z 'num_stages': 2, 2026-02-21T12:37:14.9481176Z 'num_warps': 4, 2026-02-21T12:37:14.9481398Z 'pid_type': 'xyz', 2026-02-21T12:37:14.9481647Z 'range_flattens': [None, True], 2026-02-21T12:37:14.9481948Z 'range_multi_buffers': [None, None], 2026-02-21T12:37:14.9482247Z 'range_num_stages': [0, 2], 2026-02-21T12:37:14.9482514Z 'range_unroll_factors': [0, 2], 2026-02-21T12:37:14.9482916Z 'range_warp_specializes': [], 2026-02-21T12:37:14.9483187Z 'waves_per_eu': 3} 2026-02-21T12:37:14.9646095Z [337s] Fitting surrogate: 679 points, 679 targets 2026-02-21T12:37:15.2523969Z [337s] Generation 16 starting: 16 neighbors, 1 active search path(s) 2026-02-21T12:37:18.5171288Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 6.5 configs/s 2026-02-21T12:37:19.6633611Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.0 configs/s 2026-02-21T12:37:20.3714995Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1247.0 2026-02-21T12:37:20.3715431Z configs/s 2026-02-21T12:37:20.7323470Z [343s] Generation 16 complete: 2026-02-21T12:37:20.7323819Z ok=18 2026-02-21T12:37:20.7324027Z min=0.0441 2026-02-21T12:37:20.7324261Z mid=0.0584 2026-02-21T12:37:20.7324466Z max=0.1888 2026-02-21T12:37:20.7324701Z best={'block_sizes': [1, 64, 32], 2026-02-21T12:37:20.7325379Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:37:20.7325786Z 'l2_groupings': [16], 2026-02-21T12:37:20.7326068Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:37:20.7326509Z 'loop_orders': [[0, 1]], 2026-02-21T12:37:20.7326794Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:37:20.7327060Z 'num_stages': 1, 2026-02-21T12:37:20.7327293Z 'num_warps': 4, 2026-02-21T12:37:20.7327522Z 'pid_type': 'xyz', 2026-02-21T12:37:20.7327788Z 'range_flattens': [None, True], 2026-02-21T12:37:20.7328090Z 'range_multi_buffers': [None, True], 2026-02-21T12:37:20.7328398Z 'range_num_stages': [0, 2], 2026-02-21T12:37:20.7328678Z 'range_unroll_factors': [0, 2], 2026-02-21T12:37:20.7328970Z 'range_warp_specializes': [], 2026-02-21T12:37:20.7329251Z 'waves_per_eu': 3} 2026-02-21T12:37:20.7488538Z [343s] Fitting surrogate: 697 points, 697 targets 2026-02-21T12:37:21.0151439Z [343s] Generation 17 starting: 15 neighbors, 1 active search path(s) 2026-02-21T12:37:24.1297676Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 7.8 configs/s 2026-02-21T12:37:25.1042880Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 16.1 configs/s 2026-02-21T12:37:25.9279155Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1080.0 2026-02-21T12:37:25.9279741Z configs/s 2026-02-21T12:37:26.3143396Z [348s] Generation 17 complete: 2026-02-21T12:37:26.3143767Z ok=16 2026-02-21T12:37:26.3143980Z min=0.0467 2026-02-21T12:37:26.3144181Z mid=0.0542 2026-02-21T12:37:26.3144378Z max=0.2231 2026-02-21T12:37:26.3144605Z best={'block_sizes': [1, 64, 64], 2026-02-21T12:37:26.3145021Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:37:26.3145427Z 'l2_groupings': [16], 2026-02-21T12:37:26.3145702Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:37:26.3146378Z 'loop_orders': [[0, 1]], 2026-02-21T12:37:26.3146658Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:37:26.3146949Z 'num_stages': 1, 2026-02-21T12:37:26.3147175Z 'num_warps': 4, 2026-02-21T12:37:26.3147403Z 'pid_type': 'xyz', 2026-02-21T12:37:26.3147649Z 'range_flattens': [None, True], 2026-02-21T12:37:26.3147958Z 'range_multi_buffers': [None, True], 2026-02-21T12:37:26.3148259Z 'range_num_stages': [0, 2], 2026-02-21T12:37:26.3148533Z 'range_unroll_factors': [0, 3], 2026-02-21T12:37:26.3148821Z 'range_warp_specializes': [], 2026-02-21T12:37:26.3149103Z 'waves_per_eu': 3} 2026-02-21T12:37:26.3329529Z [348s] Fitting surrogate: 713 points, 713 targets 2026-02-21T12:37:26.6240678Z [349s] Generation 18 starting: 17 neighbors, 1 active search path(s) 2026-02-21T12:37:31.5150334Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 2.8 configs/s 2026-02-21T12:37:32.7275487Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 16.8 configs/s 2026-02-21T12:37:33.6381078Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3288.8 2026-02-21T12:37:33.6382006Z configs/s 2026-02-21T12:37:33.9911328Z [356s] Generation 18 complete: 2026-02-21T12:37:33.9911544Z ok=19 2026-02-21T12:37:33.9911674Z min=0.0460 2026-02-21T12:37:33.9911815Z mid=0.0804 2026-02-21T12:37:33.9911944Z max=0.3543 2026-02-21T12:37:33.9912081Z best={'block_sizes': [1, 64, 64], 2026-02-21T12:37:33.9912329Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:37:33.9912580Z 'l2_groupings': [16], 2026-02-21T12:37:33.9912948Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:37:33.9913143Z 'loop_orders': [[0, 1]], 2026-02-21T12:37:33.9913312Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:37:33.9913475Z 'num_stages': 1, 2026-02-21T12:37:33.9913608Z 'num_warps': 4, 2026-02-21T12:37:33.9913750Z 'pid_type': 'xyz', 2026-02-21T12:37:33.9913901Z 'range_flattens': [None, True], 2026-02-21T12:37:33.9914089Z 'range_multi_buffers': [None, True], 2026-02-21T12:37:33.9914270Z 'range_num_stages': [0, 2], 2026-02-21T12:37:33.9914444Z 'range_unroll_factors': [0, 3], 2026-02-21T12:37:33.9914622Z 'range_warp_specializes': [], 2026-02-21T12:37:33.9914781Z 'waves_per_eu': 3} 2026-02-21T12:37:34.0026208Z [356s] Fitting surrogate: 732 points, 732 targets 2026-02-21T12:37:34.3109759Z [356s] Generation 19 starting: 18 neighbors, 1 active search path(s) 2026-02-21T12:37:38.8361936Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 2.8 configs/s 2026-02-21T12:37:40.0832550Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 17.2 configs/s 2026-02-21T12:37:40.2086129Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 7963.7 2026-02-21T12:37:40.2086533Z configs/s 2026-02-21T12:37:40.4189982Z [362s] Generation 19 complete: 2026-02-21T12:37:40.4190345Z ok=20 2026-02-21T12:37:40.4190944Z min=0.0462 2026-02-21T12:37:40.4191160Z mid=0.1308 2026-02-21T12:37:40.4191364Z max=0.3769 2026-02-21T12:37:40.4191618Z best={'block_sizes': [1, 64, 64], 2026-02-21T12:37:40.4192046Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:37:40.4192451Z 'l2_groupings': [16], 2026-02-21T12:37:40.4192754Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:37:40.4193075Z 'loop_orders': [[0, 1]], 2026-02-21T12:37:40.4193352Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:37:40.4193625Z 'num_stages': 1, 2026-02-21T12:37:40.4193860Z 'num_warps': 4, 2026-02-21T12:37:40.4194104Z 'pid_type': 'xyz', 2026-02-21T12:37:40.4194353Z 'range_flattens': [None, True], 2026-02-21T12:37:40.4194658Z 'range_multi_buffers': [None, True], 2026-02-21T12:37:40.4194964Z 'range_num_stages': [0, 2], 2026-02-21T12:37:40.4195243Z 'range_unroll_factors': [0, 3], 2026-02-21T12:37:40.4195534Z 'range_warp_specializes': [], 2026-02-21T12:37:40.4195812Z 'waves_per_eu': 3} 2026-02-21T12:37:40.4253834Z [362s] Fitting surrogate: 752 points, 752 targets 2026-02-21T12:37:40.6766867Z [363s] Generation 20 starting: 15 neighbors, 1 active search path(s) 2026-02-21T12:37:43.8756475Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 10.0 configs/s 2026-02-21T12:37:44.9477149Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.3 configs/s 2026-02-21T12:37:45.1065983Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 6564.9 2026-02-21T12:37:45.1066534Z configs/s 2026-02-21T12:37:45.3630271Z [367s] Generation 20 complete: 2026-02-21T12:37:45.3630619Z ok=17 2026-02-21T12:37:45.3630821Z min=0.0417 2026-02-21T12:37:45.3631034Z mid=0.1098 2026-02-21T12:37:45.3631235Z max=0.2755 2026-02-21T12:37:45.3631459Z best={'block_sizes': [1, 64, 64], 2026-02-21T12:37:45.3631868Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T12:37:45.3632274Z 'l2_groupings': [16], 2026-02-21T12:37:45.3632568Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:37:45.3632884Z 'loop_orders': [[0, 1]], 2026-02-21T12:37:45.3633414Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:37:45.3633687Z 'num_stages': 1, 2026-02-21T12:37:45.3633915Z 'num_warps': 4, 2026-02-21T12:37:45.3634138Z 'pid_type': 'xyz', 2026-02-21T12:37:45.3634402Z 'range_flattens': [None, True], 2026-02-21T12:37:45.3634696Z 'range_multi_buffers': [None, True], 2026-02-21T12:37:45.3635003Z 'range_num_stages': [0, 2], 2026-02-21T12:37:45.3635277Z 'range_unroll_factors': [0, 3], 2026-02-21T12:37:45.3635565Z 'range_warp_specializes': [], 2026-02-21T12:37:45.3635847Z 'waves_per_eu': 3} 2026-02-21T12:37:45.3721051Z [367s] Fitting surrogate: 769 points, 769 targets 2026-02-21T12:37:45.5029307Z [367s] Autotuning complete in 368.0s after searching 720 configs. 2026-02-21T12:37:45.5029833Z One can hardcode the best config and skip autotuning with: 2026-02-21T12:37:45.5031896Z @helion.kernel(config=helion.Config(block_sizes=[1, 64, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T12:37:45.5033680Z 2026-02-21T12:37:45.5034137Z [367s] Code of selected kernel: /tmp/torchinductor_root/sc/csc5gevho6toucdyyzbxtu6mbk3qvwwzcnzxldinj7cxnesrc5vi.py 2026-02-21T12:37:45.5274849Z from __future__ import annotations 2026-02-21T12:37:45.5275022Z 2026-02-21T12:37:45.5275082Z import torch 2026-02-21T12:37:45.5275209Z import triton 2026-02-21T12:37:45.5275354Z import triton.language as tl 2026-02-21T12:37:45.5275553Z from torch._inductor.runtime import triton_helpers 2026-02-21T12:37:45.5275816Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T12:37:45.5276285Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T12:37:45.5276465Z 2026-02-21T12:37:45.5276532Z _BLOCK_SIZE_1 = tl.constexpr(64) 2026-02-21T12:37:45.5276704Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T12:37:45.5276867Z _BLOCK_SIZE_3 = tl.constexpr(64) 2026-02-21T12:37:45.5277032Z _SHAPE_DIM_3 = tl.constexpr(128) 2026-02-21T12:37:45.5277193Z _SHAPE_DIM_7 = tl.constexpr(128) 2026-02-21T12:37:45.5277299Z 2026-02-21T12:37:45.5277349Z @triton.jit 2026-02-21T12:37:45.5277560Z def _helion_attention(q_view, k_view, v_view, out, _RDIM_SIZE_2: tl.constexpr): 2026-02-21T12:37:45.5277910Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T12:37:45.5278172Z num_blocks_0 = 192 2026-02-21T12:37:45.5278314Z num_pid_m = 192 2026-02-21T12:37:45.5278464Z num_pid_n = tl.cdiv(256, _BLOCK_SIZE_1) 2026-02-21T12:37:45.5278706Z inner_2d_pid = tl.program_id(0) + tl.program_id(1) * num_blocks_0 2026-02-21T12:37:45.5278946Z num_pid_in_group = 16 * num_pid_n 2026-02-21T12:37:45.5279140Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T12:37:45.5279328Z first_pid_m = group_id * 16 2026-02-21T12:37:45.5279513Z group_size_m = min(num_pid_m - first_pid_m, 16) 2026-02-21T12:37:45.5279770Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T12:37:45.5280044Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T12:37:45.5280246Z offset_0 = pid_0 2026-02-21T12:37:45.5280410Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T12:37:45.5280604Z offset_1 = pid_1 * _BLOCK_SIZE_1 2026-02-21T12:37:45.5280868Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T12:37:45.5281119Z indices_4 = tl.arange(0, _RDIM_SIZE_2).to(tl.int32) 2026-02-21T12:37:45.5281420Z # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T12:37:45.5281759Z m_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], float('-inf'), tl.float32) 2026-02-21T12:37:45.5282035Z # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0) 2026-02-21T12:37:45.5282288Z l_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 1.0, tl.float32) 2026-02-21T12:37:45.5282691Z # src[attention.py:70]: acc = hl.zeros([tile_b, tile_m, head_dim], dtype=torch.float32) 2026-02-21T12:37:45.5283011Z acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128], 0.0, tl.float32) 2026-02-21T12:37:45.5283266Z # src[attention.py:71]: q = q_view[tile_b, tile_m, :] 2026-02-21T12:37:45.5283630Z q = tl.load(q_view + (indices_0[:, None, None] * 32768 + indices_1[None, :, None] * 128 + indices_4[None, None, :] * 1), None) 2026-02-21T12:37:45.5284017Z # src[attention.py:72]: for tile_n in hl.tile(v_view.size(1)): 2026-02-21T12:37:45.5284274Z # src[attention.py:73]: k = k_view[tile_b, :, tile_n] 2026-02-21T12:37:45.5284505Z # src[attention.py:74]: qk = torch.bmm(q, k) 2026-02-21T12:37:45.5284702Z # src[attention.py:72-85]: ... 2026-02-21T12:37:45.5285066Z for offset_2 in tl.range(0, 256, _BLOCK_SIZE_3, loop_unroll_factor=3, num_stages=1, disallow_acc_multi_buffer=False, flatten=True): 2026-02-21T12:37:45.5285486Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_3).to(tl.int32) 2026-02-21T12:37:45.5285705Z q_copy = q 2026-02-21T12:37:45.5285837Z m_i_copy = m_i 2026-02-21T12:37:45.5286004Z l_i_copy = l_i 2026-02-21T12:37:45.5286141Z acc_copy = acc 2026-02-21T12:37:45.5286277Z q_copy_0 = q_copy 2026-02-21T12:37:45.5286431Z m_i_copy_0 = m_i_copy 2026-02-21T12:37:45.5286585Z l_i_copy_0 = l_i_copy 2026-02-21T12:37:45.5286739Z acc_copy_0 = acc_copy 2026-02-21T12:37:45.5286925Z # src[attention.py:73]: k = k_view[tile_b, :, tile_n] 2026-02-21T12:37:45.5287451Z k = tl.load(tl.make_block_ptr(k_view, [192, 128, 256], [32768, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_3, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T12:37:45.5287980Z # src[attention.py:74]: qk = torch.bmm(q, k) 2026-02-21T12:37:45.5288624Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T12:37:45.5289292Z # src[attention.py:75]: m_ij = torch.maximum(m_i, torch.amax(qk, -1) * qk_scale) 2026-02-21T12:37:45.5289507Z amax = tl.cast(tl.max(qk, 2), tl.bfloat16) 2026-02-21T12:37:45.5289648Z v_0 = 0.12751743074602467 2026-02-21T12:37:45.5289783Z v_1 = tl.cast(amax * v_0, tl.bfloat16) 2026-02-21T12:37:45.5289922Z v_2 = tl.cast(v_1, tl.float32) 2026-02-21T12:37:45.5290073Z v_3 = triton_helpers.maximum(m_i_copy_0, v_2) 2026-02-21T12:37:45.5290256Z # src[attention.py:76]: qk = qk * qk_scale - m_ij[:, :, None] 2026-02-21T12:37:45.5290420Z v_4 = 0.12751743074602467 2026-02-21T12:37:45.5290551Z v_5 = tl.cast(qk * v_4, tl.bfloat16) 2026-02-21T12:37:45.5290688Z subscript = v_3[:, :, None] 2026-02-21T12:37:45.5290821Z v_6 = tl.cast(v_5, tl.float32) 2026-02-21T12:37:45.5290977Z v_7 = v_6 - subscript 2026-02-21T12:37:45.5291111Z # src[attention.py:77]: p = torch.exp2(qk) 2026-02-21T12:37:45.5291260Z v_8 = libdevice.exp2(v_7) 2026-02-21T12:37:45.5291406Z # src[attention.py:78]: l_ij = torch.sum(p, -1) 2026-02-21T12:37:45.5291567Z l_ij = tl.cast(tl.sum(v_8, 2), tl.float32) 2026-02-21T12:37:45.5291736Z # src[attention.py:79]: alpha = torch.exp2(m_i - m_ij) 2026-02-21T12:37:45.5291924Z v_9 = m_i_copy_0 - v_3 2026-02-21T12:37:45.5292046Z v_10 = libdevice.exp2(v_9) 2026-02-21T12:37:45.5292193Z # src[attention.py:80]: l_i = l_i * alpha + l_ij 2026-02-21T12:37:45.5292343Z v_11 = l_i_copy_0 * v_10 2026-02-21T12:37:45.5292462Z l_i = v_11 + l_ij 2026-02-21T12:37:45.5300110Z # src[attention.py:81]: acc = acc * alpha[:, :, None] 2026-02-21T12:37:45.5300270Z subscript_1 = v_10[:, :, None] 2026-02-21T12:37:45.5300390Z v_13 = acc_copy_0 * subscript_1 2026-02-21T12:37:45.5300569Z # src[attention.py:82]: v = v_view[tile_b, tile_n, :] 2026-02-21T12:37:45.5300818Z v = tl.load(v_view + (indices_0[:, None, None] * 32768 + indices_2[None, :, None] * 128 + indices_4[None, None, :] * 1), None) 2026-02-21T12:37:45.5301055Z # src[attention.py:83]: p = p.to(v.dtype) 2026-02-21T12:37:45.5301181Z v_14 = tl.cast(v_8, tl.bfloat16) 2026-02-21T12:37:45.5301319Z # src[attention.py:84]: acc = torch.baddbmm(acc, p, v) 2026-02-21T12:37:45.5301771Z acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128]) 2026-02-21T12:37:45.5302199Z # src[attention.py:85]: m_i = m_ij 2026-02-21T12:37:45.5302310Z m_i = v_3 2026-02-21T12:37:45.5302415Z # src[attention.py:87]: acc = acc / l_i[:, :, None] 2026-02-21T12:37:45.5302549Z subscript_2 = l_i[:, :, None] 2026-02-21T12:37:45.5302655Z v_15 = acc / subscript_2 2026-02-21T12:37:45.5302824Z # src[attention.py:88]: out[tile_b, tile_m, :] = acc.to(out.dtype) 2026-02-21T12:37:45.5302973Z v_16 = tl.cast(v_15, tl.bfloat16) 2026-02-21T12:37:45.5303252Z tl.store(tl.make_block_ptr(out, [192, 256, 128], [32768, 128, 1], [offset_0, offset_1, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _SHAPE_DIM_7], [2, 1, 0]), v_16, boundary_check=[0, 1, 2]) 2026-02-21T12:37:45.5303499Z 2026-02-21T12:37:45.5303632Z def attention(q_in: torch.Tensor, k_in: torch.Tensor, v_in: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T12:37:45.5303829Z """ 2026-02-21T12:37:45.5303919Z Computes scaled dot-product attention. 2026-02-21T12:37:45.5304001Z 2026-02-21T12:37:45.5304129Z Implements the attention mechanism: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V 2026-02-21T12:37:45.5304280Z 2026-02-21T12:37:45.5304314Z Args: 2026-02-21T12:37:45.5304424Z q_in: Query tensor of shape [..., seq_len_q, head_dim] 2026-02-21T12:37:45.5304577Z k_in: Key tensor of shape [..., seq_len_k, head_dim] 2026-02-21T12:37:45.5304730Z v_in: Value tensor of shape [..., seq_len_k, head_dim] 2026-02-21T12:37:45.5304825Z 2026-02-21T12:37:45.5304865Z Returns: 2026-02-21T12:37:45.5304969Z Output tensor of shape [..., seq_len_q, head_dim] 2026-02-21T12:37:45.5305094Z """ 2026-02-21T12:37:45.5305184Z # src[attention.py:56]: m_dim = q_in.size(-2) 2026-02-21T12:37:45.5305308Z m_dim = q_in.size(-2) 2026-02-21T12:37:45.5305414Z # src[attention.py:57]: n_dim = k_in.size(-2) 2026-02-21T12:37:45.5305536Z n_dim = k_in.size(-2) 2026-02-21T12:37:45.5305650Z # src[attention.py:58]: assert n_dim == v_in.size(-2) 2026-02-21T12:37:45.5305785Z assert n_dim == v_in.size(-2) 2026-02-21T12:37:45.5305930Z # src[attention.py:59]: head_dim = hl.specialize(q_in.size(-1)) 2026-02-21T12:37:45.5306071Z head_dim = 128 2026-02-21T12:37:45.5306200Z # src[attention.py:60]: assert head_dim == k_in.size(-1) == v_in.size(-1) 2026-02-21T12:37:45.5306371Z assert head_dim == k_in.size(-1) == v_in.size(-1) 2026-02-21T12:37:45.5306539Z # src[attention.py:61]: q_view = q_in.reshape([-1, m_dim, head_dim]) 2026-02-21T12:37:45.5306696Z q_view = q_in.reshape([-1, m_dim, head_dim]) 2026-02-21T12:37:45.5306851Z # src[attention.py:62]: v_view = v_in.reshape([-1, n_dim, head_dim]) 2026-02-21T12:37:45.5307021Z v_view = v_in.reshape([-1, n_dim, head_dim]) 2026-02-21T12:37:45.5307195Z # src[attention.py:63]: k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2) 2026-02-21T12:37:45.5307392Z k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2) 2026-02-21T12:37:45.5307553Z # src[attention.py:64]: out = torch.empty_like(q_view) 2026-02-21T12:37:45.5307691Z out = torch.empty_like(q_view) 2026-02-21T12:37:45.5307847Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T12:37:45.5308040Z _BLOCK_SIZE_1 = 64 2026-02-21T12:37:45.5308130Z _RDIM_SIZE_2 = 128 2026-02-21T12:37:45.5308273Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T12:37:45.5308500Z # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T12:37:45.5308704Z # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0) 2026-02-21T12:37:45.5308848Z # src[attention.py:67-88]: ... 2026-02-21T12:37:45.5309144Z _launcher(_helion_attention, (192, triton.cdiv(256, _BLOCK_SIZE_1)), q_view, k_view, v_view, out, _RDIM_SIZE_2, num_warps=4, num_stages=1, waves_per_eu=3, matrix_instr_nonkdim=16) 2026-02-21T12:37:45.5309461Z # src[attention.py:89]: return out.view(q_in.size()) 2026-02-21T12:37:45.5309594Z return out.view(q_in.size()) 2026-02-21T12:37:46.1794524Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T12:37:46.1796863Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 64, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T12:37:46.1798934Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T12:37:46.1799354Z WARNING:tritonbench.utils.triton_op:Completed input ID 1: 2026-02-21T12:37:46.1799719Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T12:37:46.1800014Z ------------------------------------------ 2026-02-21T12:37:46.1800280Z (4, 48, 256, 256, 128) 2026-02-21T12:37:46.1800483Z 2026-02-21T12:37:46.1802894Z 33%|███▎ | 2/6 [10:52<22:17, 334.27s/it]WARNING:tritonbench.utils.triton_op:Running input ID 2: 2026-02-21T12:37:46.1803383Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T12:37:46.1803684Z ------------------------------------------ 2026-02-21T12:37:46.1803953Z (4, 48, 512, 512, 128) 2026-02-21T12:37:46.1804967Z INFO:tritonbench.utils.triton_op:Took 0.08ms to get benchmark function for aten 2026-02-21T12:37:47.2958230Z INFO:tritonbench.utils.triton_op:Took 2.51ms to get benchmark function for flex_attention 2026-02-21T12:37:48.6761114Z WARNING:__main__:Input tensor metadata: 2026-02-21T12:37:48.6761559Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T12:37:48.6761886Z 'dtype': 'torch.bfloat16', 2026-02-21T12:37:48.6762210Z 'shape': (4, 48, 512, 128), 2026-02-21T12:37:48.6762535Z 'stride': (3145728, 65536, 128, 1)}, 2026-02-21T12:37:48.6762969Z { 'device': 'cuda:0', 2026-02-21T12:37:48.6763276Z 'dtype': 'torch.bfloat16', 2026-02-21T12:37:48.6763579Z 'shape': (4, 48, 512, 128), 2026-02-21T12:37:48.6763905Z 'stride': (3145728, 65536, 128, 1)}, 2026-02-21T12:37:48.6764224Z { 'device': 'cuda:0', 2026-02-21T12:37:48.6764512Z 'dtype': 'torch.bfloat16', 2026-02-21T12:37:48.6764811Z 'shape': (4, 48, 512, 128), 2026-02-21T12:37:48.6765127Z 'stride': (3145728, 65536, 128, 1)}), 2026-02-21T12:37:48.6765437Z 'kwargs': {}} 2026-02-21T12:37:48.6795468Z INFO:tritonbench.utils.triton_op:Took 3.84ms to get benchmark function for helion_attention 2026-02-21T12:37:48.9337947Z [0s] Autotune random seed: 2150287535 2026-02-21T12:37:48.9641749Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T12:38:23.2682115Z [34s] Timeout after 30s compiling Config(block_sizes=[1, 64, 512], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[1, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T12:38:28.8678999Z [39s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[2, 1], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T12:38:31.2610902Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 4, 512], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, None], range_num_stages=[0, 3], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:38:31.3795442Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:38:32.4741520Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 8, 512], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[0, 2], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T12:38:32.6952167Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:38:32.6978204Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s 2026-02-21T12:38:38.1819582Z /tmp/torchinductor_root/ne/cnewmtkcvegrxtiokl5mr4xyshhjpytfcvozpnfvt7sf7g5ytajx.py:93:133: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T12:38:38.1820795Z v = tl.load(v_view + (indices_0[:, None, None] * 65536 + indices_2[None, :, None] * 128 + indices_4[None, None, :] * 1), None) 2026-02-21T12:38:38.1821532Z ^ 2026-02-21T12:38:38.1823489Z /tmp/torchinductor_root/ne/cnewmtkcvegrxtiokl5mr4xyshhjpytfcvozpnfvt7sf7g5ytajx.py:97:144: note: - use: %121 = "tt.reshape"(<>) : (tensor<1x256x128xbf16, #ttg.blocked<{sizePerThread = [1, 1, 8], threadsPerWarp = [1, 4, 16], warpsPerCTA = [1, 2, 1], order = [2, 0, 1]}>>) -> tensor<256x128xbf16, #ttg.linear<{register = [[0, 1], [0, 2], [0, 4], [8, 0], [16, 0], [32, 0], [64, 0], [128, 0]], lane = [[0, 8], [0, 16], [0, 32], [0, 64], [1, 0], [2, 0]], warp = [[4, 0]], block = []}>> 2026-02-21T12:38:38.1825445Z 2026-02-21T12:38:38.1826345Z acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128]) 2026-02-21T12:38:38.1827661Z ^ 2026-02-21T12:38:38.1828126Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T12:38:38.1854412Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}> 2026-02-21T12:38:38.1854989Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T12:38:38.1855514Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T12:38:38.1856005Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T12:38:38.1856481Z #blocked4 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T12:38:38.1856947Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T12:38:38.1857557Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [0, 1, 2]}> 2026-02-21T12:38:38.1858070Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 2, 1], order = [0, 1, 2]}> 2026-02-21T12:38:38.1858583Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [2, 1, 1], order = [0, 1, 2]}> 2026-02-21T12:38:38.1859099Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:38:38.1859749Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T12:38:38.1860607Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T12:38:38.1861268Z %c128_i32 = arith.constant 128 : i32 2026-02-21T12:38:38.1861474Z %c65536_i32 = arith.constant 65536 : i32 2026-02-21T12:38:38.1861675Z %c256_i32 = arith.constant 256 : i32 2026-02-21T12:38:38.1861859Z %c512_i32 = arith.constant 512 : i32 2026-02-21T12:38:38.1862049Z %c0_i32 = arith.constant 0 : i32 2026-02-21T12:38:38.1862241Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T12:38:38.1862429Z %c1_i32 = arith.constant 1 : i32 2026-02-21T12:38:38.1862672Z %cst = arith.constant dense<128> : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1862993Z %cst_0 = arith.constant dense<0.127517432> : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1863325Z %cst_1 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1863645Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.1863959Z %cst_3 = arith.constant dense<128> : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.1864275Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1864598Z %cst_5 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1864911Z %cst_6 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1865200Z %c16_i32 = arith.constant 16 : i32 2026-02-21T12:38:38.1865388Z %c192_i32 = arith.constant 192 : i32 2026-02-21T12:38:38.1865591Z %c98304_i32 = arith.constant 98304 : i32 2026-02-21T12:38:38.1865757Z %c21_i32 = arith.constant 21 : i32 2026-02-21T12:38:38.1865920Z %0 = tt.get_program_id x : i32 2026-02-21T12:38:38.1866078Z %1 = arith.muli %0, %c21_i32 : i32 2026-02-21T12:38:38.1866234Z %2 = arith.addi %1, %c21_i32 : i32 2026-02-21T12:38:38.1866391Z %3 = arith.minsi %2, %c98304_i32 : i32 2026-02-21T12:38:38.1866652Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked4> 2026-02-21T12:38:38.1867039Z %5 = ttg.convert_layout %4 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.1867503Z %6 = tt.expand_dims %5 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x128xi32, #blocked5> 2026-02-21T12:38:38.1867913Z %7 = ttg.convert_layout %6 : tensor<1x128xi32, #blocked5> -> tensor<1x128xi32, #blocked3> 2026-02-21T12:38:38.1868313Z %8 = ttg.convert_layout %7 : tensor<1x128xi32, #blocked3> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.1868786Z %9 = tt.expand_dims %8 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x128xi32, #blocked6> 2026-02-21T12:38:38.1869210Z %10 = ttg.convert_layout %9 : tensor<1x1x128xi32, #blocked6> -> tensor<1x1x128xi32, #blocked1> 2026-02-21T12:38:38.1869541Z %11 = tt.splat %arg0 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1869864Z %12 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.1870247Z %13 = ttg.convert_layout %7 : tensor<1x128xi32, #blocked3> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.1870722Z %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x128x1xi32, #blocked7> 2026-02-21T12:38:38.1871147Z %15 = ttg.convert_layout %14 : tensor<1x128x1xi32, #blocked7> -> tensor<1x128x1xi32, #blocked> 2026-02-21T12:38:38.1871474Z %16 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1871824Z %17 = tt.broadcast %10 : tensor<1x1x128xi32, #blocked1> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1872163Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1872456Z %19 = tt.splat %arg3 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1872683Z %20 = arith.subi %3, %1 : i32 2026-02-21T12:38:38.1872847Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T12:38:38.1873021Z %21 = arith.subi %c1_i32, %c1_i32_7 : i32 2026-02-21T12:38:38.1873187Z %22 = arith.addi %20, %21 : i32 2026-02-21T12:38:38.1873348Z %23 = arith.divui %22, %c1_i32 : i32 2026-02-21T12:38:38.1873510Z %c2_i32 = arith.constant 2 : i32 2026-02-21T12:38:38.1873655Z %24 = arith.remsi %23, %c2_i32 : i32 2026-02-21T12:38:38.1873777Z %25 = arith.subi %23, %24 : i32 2026-02-21T12:38:38.1873898Z %26 = arith.muli %25, %c1_i32 : i32 2026-02-21T12:38:38.1874020Z %27 = arith.addi %1, %26 : i32 2026-02-21T12:38:38.1874146Z %28 = arith.muli %c1_i32, %c2_i32 : i32 2026-02-21T12:38:38.1874287Z scf.for %arg4 = %1 to %27 step %28 : i32 { 2026-02-21T12:38:38.1874429Z %29 = arith.divsi %arg4, %c8192_i32 : i32 2026-02-21T12:38:38.1874565Z %30 = arith.muli %29, %c16_i32 : i32 2026-02-21T12:38:38.1874692Z %31 = arith.subi %c192_i32, %30 : i32 2026-02-21T12:38:38.1874821Z %32 = arith.minsi %31, %c16_i32 : i32 2026-02-21T12:38:38.1874950Z %33 = arith.remsi %arg4, %c8192_i32 : i32 2026-02-21T12:38:38.1875080Z %34 = arith.remsi %33, %32 : i32 2026-02-21T12:38:38.1875204Z %35 = arith.addi %30, %34 : i32 2026-02-21T12:38:38.1875343Z %36 = arith.divsi %33, %32 : i32 2026-02-21T12:38:38.1875469Z %37 = arith.muli %35, %c65536_i32 : i32 2026-02-21T12:38:38.1875596Z %38 = arith.muli %36, %c128_i32 : i32 2026-02-21T12:38:38.1875723Z %39 = arith.addi %37, %38 : i32 2026-02-21T12:38:38.1875871Z %40 = tt.splat %39 : i32 -> tensor<1x1x128xi32, #blocked1> 2026-02-21T12:38:38.1876050Z %41 = arith.addi %40, %10 : tensor<1x1x128xi32, #blocked1> 2026-02-21T12:38:38.1876280Z %42 = tt.addptr %11, %41 : tensor<1x1x128x!tt.ptr, #blocked1>, tensor<1x1x128xi32, #blocked1> 2026-02-21T12:38:38.1876537Z %43 = tt.load %42 : tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1876716Z %44 = tt.splat %37 : i32 -> tensor<1x128x1xi32, #blocked> 2026-02-21T12:38:38.1876886Z %45 = arith.addi %44, %15 : tensor<1x128x1xi32, #blocked> 2026-02-21T12:38:38.1877104Z %46 = tt.broadcast %45 : tensor<1x128x1xi32, #blocked> -> tensor<1x128x256xi32, #blocked> 2026-02-21T12:38:38.1877382Z %47 = ttg.convert_layout %46 : tensor<1x128x256xi32, #blocked> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1877662Z %48 = tt.reshape %43 : tensor<1x1x128xbf16, #blocked1> -> tensor<1x128xbf16, #blocked3> 2026-02-21T12:38:38.1877876Z %49 = tt.splat %37 : i32 -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1878026Z %c0_i32_8 = arith.constant 0 : i32 2026-02-21T12:38:38.1878162Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T12:38:38.1878540Z %50:3 = scf.for %arg5 = %c0_i32 to %c0_i32_8 step %c1024_i32 iter_args(%arg6 = %cst_6, %arg7 = %cst_5, %arg8 = %cst_4) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>) : i32 { 2026-02-21T12:38:38.1878954Z %93 = tt.splat %arg5 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.1879127Z %94 = arith.addi %93, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.1879386Z %95 = ttg.convert_layout %94 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.1879757Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.1880078Z %97 = ttg.convert_layout %96 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.1880419Z %98 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.1880795Z %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.1881134Z %100 = ttg.convert_layout %99 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.1881375Z %101 = arith.muli %100, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.1881606Z %102 = tt.broadcast %101 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1881841Z %103 = arith.addi %47, %102 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1882087Z %104 = tt.addptr %16, %103 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1882333Z %105 = tt.load %104 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1882638Z %106 = tt.reshape %105 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.1882970Z %107 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.1883333Z %108 = ttg.convert_layout %106 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.1883641Z %109 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.1884051Z %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.1884463Z %111 = tt.reshape %110 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1884705Z %112 = arith.truncf %111 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1884946Z %113 = arith.extf %112 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1885156Z %114 = "tt.reduce"(%113) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.1885284Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.1885405Z %391 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.1885532Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.1885722Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.1886016Z %115 = ttg.convert_layout %114 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1886284Z %116 = arith.truncf %115 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.1886504Z %117 = arith.extf %116 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1886698Z %118 = arith.mulf %117, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1886905Z %119 = arith.truncf %118 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.1887127Z %120 = arith.extf %119 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1887355Z %121 = arith.cmpf ogt, %arg6, %120 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1887523Z %122 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1887690Z %123 = arith.ori %121, %122 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.1887883Z %124 = arith.select %123, %arg6, %120 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1888092Z %125 = arith.mulf %113, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1888298Z %126 = arith.truncf %125 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1888601Z %127 = ttg.convert_layout %124 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.1888939Z %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.1889235Z %129 = ttg.convert_layout %128 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.1889478Z %130 = arith.extf %126 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1889719Z %131 = tt.broadcast %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.1889967Z %132 = ttg.convert_layout %131 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1890183Z %133 = arith.subf %130, %132 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1890489Z %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1890782Z %135 = "tt.reduce"(%134) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.1890908Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.1891026Z %391 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.1891148Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.1891333Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.1891622Z %136 = ttg.convert_layout %135 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1891881Z %137 = arith.subf %arg6, %124 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1892165Z %138 = tt.extern_elementwise %137 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1892453Z %139 = arith.mulf %arg7, %138 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1892611Z %140 = arith.addf %139, %136 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1892869Z %141 = ttg.convert_layout %138 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.1893203Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.1893495Z %143 = ttg.convert_layout %142 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.1893738Z %144 = tt.broadcast %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.1893983Z %145 = ttg.convert_layout %144 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1894199Z %146 = arith.mulf %arg8, %145 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1894450Z %147 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.1894793Z %148 = tt.expand_dims %147 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.1895115Z %149 = ttg.convert_layout %148 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1895326Z %150 = arith.muli %149, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1895489Z %151 = arith.addi %49, %150 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1895689Z %152 = tt.broadcast %151 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.1895943Z %153 = ttg.convert_layout %152 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1896162Z %154 = arith.addi %153, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1896394Z %155 = tt.addptr %18, %154 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1896620Z %156 = tt.load %155 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1896828Z %157 = arith.truncf %134 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1897067Z %158 = tt.reshape %146 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1897301Z %159 = tt.reshape %157 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.1897540Z %160 = tt.reshape %156 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.1897842Z %161 = ttg.convert_layout %159 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.1898199Z %162 = ttg.convert_layout %160 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.1898500Z %163 = ttg.convert_layout %158 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1898909Z %164 = tt.dot %161, %162, %163, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1899308Z %165 = tt.reshape %164 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1899490Z %c1_i32_12 = arith.constant 1 : i32 2026-02-21T12:38:38.1899637Z %166 = arith.muli %c256_i32, %c1_i32_12 : i32 2026-02-21T12:38:38.1899763Z %167 = arith.addi %arg5, %166 : i32 2026-02-21T12:38:38.1899900Z %168 = tt.splat %167 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.1900056Z %169 = arith.addi %168, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.1900297Z %170 = ttg.convert_layout %169 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.1900651Z %171 = tt.expand_dims %170 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.1900941Z %172 = ttg.convert_layout %171 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.1901230Z %173 = ttg.convert_layout %172 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.1901570Z %174 = tt.expand_dims %173 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.1901878Z %175 = ttg.convert_layout %174 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.1902093Z %176 = arith.muli %175, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.1902297Z %177 = tt.broadcast %176 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1902510Z %178 = arith.addi %47, %177 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1902724Z %179 = tt.addptr %16, %178 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1902960Z %180 = tt.load %179 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1903167Z %181 = tt.reshape %180 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.1903463Z %182 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.1903819Z %183 = ttg.convert_layout %181 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.1911708Z %184 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.1912126Z %185 = tt.dot %182, %183, %184, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.1912525Z %186 = tt.reshape %185 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1912764Z %187 = arith.truncf %186 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1913006Z %188 = arith.extf %187 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1913200Z %189 = "tt.reduce"(%188) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.1913325Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.1913447Z %391 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.1913571Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.1913764Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.1914054Z %190 = ttg.convert_layout %189 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1914328Z %191 = arith.truncf %190 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.1914551Z %192 = arith.extf %191 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1914742Z %193 = arith.mulf %192, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1914931Z %194 = arith.truncf %193 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.1915169Z %195 = arith.extf %194 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1915366Z %196 = arith.cmpf ogt, %124, %195 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1915533Z %197 = arith.cmpf une, %124, %124 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1915692Z %198 = arith.ori %196, %197 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.1915883Z %199 = arith.select %198, %124, %195 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1916105Z %200 = arith.mulf %188, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1916312Z %201 = arith.truncf %200 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1916595Z %202 = ttg.convert_layout %199 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.1916928Z %203 = tt.expand_dims %202 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.1917225Z %204 = ttg.convert_layout %203 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.1917469Z %205 = arith.extf %201 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1917707Z %206 = tt.broadcast %204 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.1917957Z %207 = ttg.convert_layout %206 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1918187Z %208 = arith.subf %205, %207 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1918490Z %209 = tt.extern_elementwise %208 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1918777Z %210 = "tt.reduce"(%209) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.1918906Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.1919023Z %391 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.1919145Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.1919334Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.1919638Z %211 = ttg.convert_layout %210 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1919881Z %212 = arith.subf %124, %199 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1920165Z %213 = tt.extern_elementwise %212 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1920452Z %214 = arith.mulf %140, %213 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1920609Z %215 = arith.addf %214, %211 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1920848Z %216 = ttg.convert_layout %213 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.1921182Z %217 = tt.expand_dims %216 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.1921475Z %218 = ttg.convert_layout %217 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.1921715Z %219 = tt.broadcast %218 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.1921961Z %220 = ttg.convert_layout %219 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1922172Z %221 = arith.mulf %165, %220 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1922420Z %222 = ttg.convert_layout %172 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.1922815Z %223 = tt.expand_dims %222 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.1923120Z %224 = ttg.convert_layout %223 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1923331Z %225 = arith.muli %224, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1923493Z %226 = arith.addi %49, %225 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1923692Z %227 = tt.broadcast %226 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.1923973Z %228 = ttg.convert_layout %227 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1924194Z %229 = arith.addi %228, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1924416Z %230 = tt.addptr %18, %229 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1924637Z %231 = tt.load %230 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1924845Z %232 = arith.truncf %209 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1925082Z %233 = tt.reshape %221 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1925320Z %234 = tt.reshape %232 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.1925562Z %235 = tt.reshape %231 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.1925861Z %236 = ttg.convert_layout %234 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.1926240Z %237 = ttg.convert_layout %235 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.1926543Z %238 = ttg.convert_layout %233 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1926948Z %239 = tt.dot %236, %237, %238, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1927367Z %240 = tt.reshape %239 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1927545Z %c2_i32_13 = arith.constant 2 : i32 2026-02-21T12:38:38.1927677Z %241 = arith.muli %c256_i32, %c2_i32_13 : i32 2026-02-21T12:38:38.1927805Z %242 = arith.addi %arg5, %241 : i32 2026-02-21T12:38:38.1927943Z %243 = tt.splat %242 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.1928101Z %244 = arith.addi %243, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.1928340Z %245 = ttg.convert_layout %244 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.1928673Z %246 = tt.expand_dims %245 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.1928964Z %247 = ttg.convert_layout %246 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.1929254Z %248 = ttg.convert_layout %247 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.1929598Z %249 = tt.expand_dims %248 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.1929905Z %250 = ttg.convert_layout %249 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.1930122Z %251 = arith.muli %250, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.1930326Z %252 = tt.broadcast %251 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1930535Z %253 = arith.addi %47, %252 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1930797Z %254 = tt.addptr %16, %253 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1931018Z %255 = tt.load %254 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1931228Z %256 = tt.reshape %255 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.1931525Z %257 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.1931898Z %258 = ttg.convert_layout %256 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.1932203Z %259 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.1932605Z %260 = tt.dot %257, %258, %259, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.1932999Z %261 = tt.reshape %260 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1933239Z %262 = arith.truncf %261 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1933481Z %263 = arith.extf %262 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1933673Z %264 = "tt.reduce"(%263) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.1933797Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.1933945Z %391 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.1934070Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.1934256Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.1934549Z %265 = ttg.convert_layout %264 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1934823Z %266 = arith.truncf %265 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.1935049Z %267 = arith.extf %266 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1935267Z %268 = arith.mulf %267, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1935458Z %269 = arith.truncf %268 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.1935684Z %270 = arith.extf %269 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1935881Z %271 = arith.cmpf ogt, %199, %270 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1936054Z %272 = arith.cmpf une, %199, %199 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1936217Z %273 = arith.ori %271, %272 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.1936415Z %274 = arith.select %273, %199, %270 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1936628Z %275 = arith.mulf %263, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1936837Z %276 = arith.truncf %275 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1937132Z %277 = ttg.convert_layout %274 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.1937470Z %278 = tt.expand_dims %277 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.1937775Z %279 = ttg.convert_layout %278 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.1938025Z %280 = arith.extf %276 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1938268Z %281 = tt.broadcast %279 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.1938540Z %282 = ttg.convert_layout %281 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1938755Z %283 = arith.subf %280, %282 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1939066Z %284 = tt.extern_elementwise %283 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1939361Z %285 = "tt.reduce"(%284) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.1939506Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.1939632Z %391 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.1939756Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.1939949Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.1940243Z %286 = ttg.convert_layout %285 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1940485Z %287 = arith.subf %199, %274 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1940776Z %288 = tt.extern_elementwise %287 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1941066Z %289 = arith.mulf %215, %288 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1941228Z %290 = arith.addf %289, %286 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1941474Z %291 = ttg.convert_layout %288 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.1941825Z %292 = tt.expand_dims %291 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.1942127Z %293 = ttg.convert_layout %292 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.1942370Z %294 = tt.broadcast %293 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.1942623Z %295 = ttg.convert_layout %294 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1942844Z %296 = arith.mulf %240, %295 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1943112Z %297 = ttg.convert_layout %247 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.1943461Z %298 = tt.expand_dims %297 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.1943771Z %299 = ttg.convert_layout %298 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1943990Z %300 = arith.muli %299, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1944160Z %301 = arith.addi %49, %300 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1944360Z %302 = tt.broadcast %301 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.1944623Z %303 = ttg.convert_layout %302 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1944844Z %304 = arith.addi %303, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1945068Z %305 = tt.addptr %18, %304 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1945297Z %306 = tt.load %305 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1945509Z %307 = arith.truncf %284 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1945758Z %308 = tt.reshape %296 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1945991Z %309 = tt.reshape %307 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.1946235Z %310 = tt.reshape %306 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.1946559Z %311 = ttg.convert_layout %309 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.1946915Z %312 = ttg.convert_layout %310 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.1947223Z %313 = ttg.convert_layout %308 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1947650Z %314 = tt.dot %311, %312, %313, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1948051Z %315 = tt.reshape %314 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1948239Z %c3_i32 = arith.constant 3 : i32 2026-02-21T12:38:38.1948366Z %316 = arith.muli %c256_i32, %c3_i32 : i32 2026-02-21T12:38:38.1948497Z %317 = arith.addi %arg5, %316 : i32 2026-02-21T12:38:38.1948637Z %318 = tt.splat %317 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.1948799Z %319 = arith.addi %318, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.1949045Z %320 = ttg.convert_layout %319 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.1949379Z %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.1949680Z %322 = ttg.convert_layout %321 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.1949984Z %323 = ttg.convert_layout %322 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.1950332Z %324 = tt.expand_dims %323 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.1950645Z %325 = ttg.convert_layout %324 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.1950862Z %326 = arith.muli %325, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.1951093Z %327 = tt.broadcast %326 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1951304Z %328 = arith.addi %47, %327 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1951526Z %329 = tt.addptr %16, %328 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1951755Z %330 = tt.load %329 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1951965Z %331 = tt.reshape %330 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.1952270Z %332 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.1952630Z %333 = ttg.convert_layout %331 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.1952946Z %334 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.1953360Z %335 = tt.dot %332, %333, %334, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.1953758Z %336 = tt.reshape %335 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1954004Z %337 = arith.truncf %336 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1954252Z %338 = arith.extf %337 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1954481Z %339 = "tt.reduce"(%338) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.1954615Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.1954738Z %391 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.1954868Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.1955058Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.1955357Z %340 = ttg.convert_layout %339 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1955650Z %341 = arith.truncf %340 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.1955874Z %342 = arith.extf %341 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1956075Z %343 = arith.mulf %342, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1956269Z %344 = arith.truncf %343 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.1956495Z %345 = arith.extf %344 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1956694Z %346 = arith.cmpf ogt, %274, %345 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1956862Z %347 = arith.cmpf une, %274, %274 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1957028Z %348 = arith.ori %346, %347 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.1957221Z %349 = arith.select %348, %274, %345 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1957432Z %350 = arith.mulf %338, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1957656Z %351 = arith.truncf %350 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1957948Z %352 = ttg.convert_layout %349 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.1958283Z %353 = tt.expand_dims %352 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.1958580Z %354 = ttg.convert_layout %353 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.1958828Z %355 = arith.extf %351 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1959089Z %356 = tt.broadcast %354 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.1959338Z %357 = ttg.convert_layout %356 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1959559Z %358 = arith.subf %355, %357 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1959865Z %359 = tt.extern_elementwise %358 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1960161Z %360 = "tt.reduce"(%359) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.1960296Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.1960417Z %391 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.1960543Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.1960730Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.1961025Z %361 = ttg.convert_layout %360 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1961267Z %362 = arith.subf %274, %349 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1961557Z %363 = tt.extern_elementwise %362 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1961849Z %364 = arith.mulf %290, %363 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1962007Z %365 = arith.addf %364, %361 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1962251Z %366 = ttg.convert_layout %363 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.1962642Z %367 = tt.expand_dims %366 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.1962945Z %368 = ttg.convert_layout %367 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.1963193Z %369 = tt.broadcast %368 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.1963463Z %370 = ttg.convert_layout %369 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1963681Z %371 = arith.mulf %315, %370 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1963930Z %372 = ttg.convert_layout %322 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.1964278Z %373 = tt.expand_dims %372 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.1964588Z %374 = ttg.convert_layout %373 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1964800Z %375 = arith.muli %374, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1964973Z %376 = arith.addi %49, %375 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1965171Z %377 = tt.broadcast %376 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.1965438Z %378 = ttg.convert_layout %377 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1965681Z %379 = arith.addi %378, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1965901Z %380 = tt.addptr %18, %379 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1966130Z %381 = tt.load %380 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1966338Z %382 = arith.truncf %359 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1966584Z %383 = tt.reshape %371 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1966823Z %384 = tt.reshape %382 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.1967083Z %385 = tt.reshape %381 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.1967393Z %386 = ttg.convert_layout %384 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.1967753Z %387 = ttg.convert_layout %385 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.1968063Z %388 = ttg.convert_layout %383 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1968476Z %389 = tt.dot %386, %387, %388, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1968871Z %390 = tt.reshape %389 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1969150Z scf.yield %349, %365, %390 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1969378Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T12:38:38.1969714Z %51:3 = scf.for %arg5 = %c0_i32_8 to %c512_i32 step %c256_i32 iter_args(%arg6 = %50#0, %arg7 = %50#1, %arg8 = %50#2) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>) : i32 { 2026-02-21T12:38:38.1970062Z %93 = tt.splat %arg5 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.1970221Z %94 = arith.addi %93, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.1970481Z %95 = ttg.convert_layout %94 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.1970817Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.1971107Z %97 = ttg.convert_layout %96 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.1971397Z %98 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.1971753Z %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.1972065Z %100 = ttg.convert_layout %99 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.1972286Z %101 = arith.muli %100, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.1972496Z %102 = tt.broadcast %101 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1972704Z %103 = arith.addi %47, %102 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1972920Z %104 = tt.addptr %16, %103 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1973146Z %105 = tt.load %104 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1973354Z %106 = tt.reshape %105 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.1973681Z %107 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.1974035Z %108 = ttg.convert_layout %106 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.1974338Z %109 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.1974746Z %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.1975158Z %111 = tt.reshape %110 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1975396Z %112 = arith.truncf %111 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1975641Z %113 = arith.extf %112 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1975831Z %114 = "tt.reduce"(%113) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.1975955Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.1976076Z %166 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.1976200Z tt.reduce.return %166 : f32 2026-02-21T12:38:38.1976390Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.1976679Z %115 = ttg.convert_layout %114 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1976951Z %116 = arith.truncf %115 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.1977174Z %117 = arith.extf %116 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1977368Z %118 = arith.mulf %117, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1977563Z %119 = arith.truncf %118 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.1977782Z %120 = arith.extf %119 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1977978Z %121 = arith.cmpf ogt, %arg6, %120 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1978151Z %122 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1978333Z %123 = arith.ori %121, %122 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.1978530Z %124 = arith.select %123, %arg6, %120 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1978735Z %125 = arith.mulf %113, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1978942Z %126 = arith.truncf %125 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1979229Z %127 = ttg.convert_layout %124 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.1979587Z %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.1979885Z %129 = ttg.convert_layout %128 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.1980129Z %130 = arith.extf %126 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1980368Z %131 = tt.broadcast %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.1980615Z %132 = ttg.convert_layout %131 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1980827Z %133 = arith.subf %130, %132 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1981130Z %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.1981420Z %135 = "tt.reduce"(%134) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.1981563Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.1981682Z %166 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.1981802Z tt.reduce.return %166 : f32 2026-02-21T12:38:38.1981989Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.1982278Z %136 = ttg.convert_layout %135 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1982520Z %137 = arith.subf %arg6, %124 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1982830Z %138 = tt.extern_elementwise %137 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1983123Z %139 = arith.mulf %arg7, %138 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1983282Z %140 = arith.addf %139, %136 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.1983519Z %141 = ttg.convert_layout %138 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.1983851Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.1984147Z %143 = ttg.convert_layout %142 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.1984387Z %144 = tt.broadcast %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.1984633Z %145 = ttg.convert_layout %144 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1984847Z %146 = arith.mulf %arg8, %145 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1985096Z %147 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.1985443Z %148 = tt.expand_dims %147 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.1985745Z %149 = ttg.convert_layout %148 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1985958Z %150 = arith.muli %149, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1986134Z %151 = arith.addi %49, %150 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1986332Z %152 = tt.broadcast %151 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.1986586Z %153 = ttg.convert_layout %152 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1986807Z %154 = arith.addi %153, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1987041Z %155 = tt.addptr %18, %154 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.1987261Z %156 = tt.load %155 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1987468Z %157 = arith.truncf %134 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.1987707Z %158 = tt.reshape %146 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1987940Z %159 = tt.reshape %157 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.1988180Z %160 = tt.reshape %156 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.1988478Z %161 = ttg.convert_layout %159 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.1988836Z %162 = ttg.convert_layout %160 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.1989139Z %163 = ttg.convert_layout %158 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1989559Z %164 = tt.dot %161, %162, %163, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.1989954Z %165 = tt.reshape %164 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1990225Z scf.yield %124, %140, %165 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1990447Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T12:38:38.1990688Z %52 = ttg.convert_layout %51#1 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.1991017Z %53 = tt.expand_dims %52 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.1991309Z %54 = ttg.convert_layout %53 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.1991544Z %55 = tt.broadcast %54 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.1991783Z %56 = ttg.convert_layout %55 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1991994Z %57 = arith.divf %51#2, %56 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.1992192Z %58 = arith.truncf %57 : tensor<1x1x128xf32, #blocked1> to tensor<1x1x128xbf16, #blocked1> 2026-02-21T12:38:38.1992438Z %59 = tt.addptr %19, %41 : tensor<1x1x128x!tt.ptr, #blocked1>, tensor<1x1x128xi32, #blocked1> 2026-02-21T12:38:38.1992648Z tt.store %59, %58 : tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1992792Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T12:38:38.1992917Z %60 = arith.muli %c1_i32, %c1_i32_9 : i32 2026-02-21T12:38:38.1993037Z %61 = arith.addi %arg4, %60 : i32 2026-02-21T12:38:38.1993157Z %62 = arith.divsi %61, %c8192_i32 : i32 2026-02-21T12:38:38.1993276Z %63 = arith.muli %62, %c16_i32 : i32 2026-02-21T12:38:38.1993394Z %64 = arith.subi %c192_i32, %63 : i32 2026-02-21T12:38:38.1993507Z %65 = arith.minsi %64, %c16_i32 : i32 2026-02-21T12:38:38.1993625Z %66 = arith.remsi %61, %c8192_i32 : i32 2026-02-21T12:38:38.1993757Z %67 = arith.remsi %66, %65 : i32 2026-02-21T12:38:38.1993869Z %68 = arith.addi %63, %67 : i32 2026-02-21T12:38:38.1993980Z %69 = arith.divsi %66, %65 : i32 2026-02-21T12:38:38.1994092Z %70 = arith.muli %68, %c65536_i32 : i32 2026-02-21T12:38:38.1994212Z %71 = arith.muli %69, %c128_i32 : i32 2026-02-21T12:38:38.1994324Z %72 = arith.addi %70, %71 : i32 2026-02-21T12:38:38.1994457Z %73 = tt.splat %72 : i32 -> tensor<1x1x128xi32, #blocked1> 2026-02-21T12:38:38.1994629Z %74 = arith.addi %73, %10 : tensor<1x1x128xi32, #blocked1> 2026-02-21T12:38:38.1994834Z %75 = tt.addptr %11, %74 : tensor<1x1x128x!tt.ptr, #blocked1>, tensor<1x1x128xi32, #blocked1> 2026-02-21T12:38:38.1995045Z %76 = tt.load %75 : tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.1995201Z %77 = tt.splat %70 : i32 -> tensor<1x128x1xi32, #blocked> 2026-02-21T12:38:38.1995356Z %78 = arith.addi %77, %15 : tensor<1x128x1xi32, #blocked> 2026-02-21T12:38:38.1995550Z %79 = tt.broadcast %78 : tensor<1x128x1xi32, #blocked> -> tensor<1x128x256xi32, #blocked> 2026-02-21T12:38:38.1995797Z %80 = ttg.convert_layout %79 : tensor<1x128x256xi32, #blocked> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1996043Z %81 = tt.reshape %76 : tensor<1x1x128xbf16, #blocked1> -> tensor<1x128xbf16, #blocked3> 2026-02-21T12:38:38.1996237Z %82 = tt.splat %70 : i32 -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.1996374Z %c0_i32_10 = arith.constant 0 : i32 2026-02-21T12:38:38.1996500Z %c1024_i32_11 = arith.constant 1024 : i32 2026-02-21T12:38:38.1996854Z %83:3 = scf.for %arg5 = %c0_i32 to %c0_i32_10 step %c1024_i32_11 iter_args(%arg6 = %cst_6, %arg7 = %cst_5, %arg8 = %cst_4) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>) : i32 { 2026-02-21T12:38:38.1997204Z %93 = tt.splat %arg5 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.1997362Z %94 = arith.addi %93, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.1997599Z %95 = ttg.convert_layout %94 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.1997924Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.1998236Z %97 = ttg.convert_layout %96 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.1998522Z %98 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.1998861Z %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.1999163Z %100 = ttg.convert_layout %99 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.1999376Z %101 = arith.muli %100, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.1999587Z %102 = tt.broadcast %101 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.1999794Z %103 = arith.addi %80, %102 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2000012Z %104 = tt.addptr %16, %103 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2000232Z %105 = tt.load %104 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2000441Z %106 = tt.reshape %105 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.2000739Z %107 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2001093Z %108 = ttg.convert_layout %106 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2001412Z %109 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2001825Z %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2002222Z %111 = tt.reshape %110 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2002462Z %112 = arith.truncf %111 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2002763Z %113 = arith.extf %112 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2002954Z %114 = "tt.reduce"(%113) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2003080Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2003199Z %391 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.2003361Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.2003547Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2003838Z %115 = ttg.convert_layout %114 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2004111Z %116 = arith.truncf %115 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2004330Z %117 = arith.extf %116 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2004526Z %118 = arith.mulf %117, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2004881Z %119 = arith.truncf %118 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2005102Z %120 = arith.extf %119 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2005298Z %121 = arith.cmpf ogt, %arg6, %120 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2005468Z %122 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2005632Z %123 = arith.ori %121, %122 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.2005823Z %124 = arith.select %123, %arg6, %120 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2006030Z %125 = arith.mulf %113, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2006260Z %126 = arith.truncf %125 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2006545Z %127 = ttg.convert_layout %124 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2006879Z %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2007172Z %129 = ttg.convert_layout %128 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2007416Z %130 = arith.extf %126 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2007654Z %131 = tt.broadcast %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.2007898Z %132 = ttg.convert_layout %131 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2008114Z %133 = arith.subf %130, %132 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2008417Z %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2008709Z %135 = "tt.reduce"(%134) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2008836Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2008953Z %391 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.2009074Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.2009258Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2009563Z %136 = ttg.convert_layout %135 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2009803Z %137 = arith.subf %arg6, %124 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2010087Z %138 = tt.extern_elementwise %137 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2010392Z %139 = arith.mulf %arg7, %138 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2010550Z %140 = arith.addf %139, %136 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2010789Z %141 = ttg.convert_layout %138 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2011121Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2011414Z %143 = ttg.convert_layout %142 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2011657Z %144 = tt.broadcast %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.2011900Z %145 = ttg.convert_layout %144 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2012113Z %146 = arith.mulf %arg8, %145 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2012364Z %147 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.2012727Z %148 = tt.expand_dims %147 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.2013032Z %149 = ttg.convert_layout %148 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2013244Z %150 = arith.muli %149, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2013406Z %151 = arith.addi %82, %150 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2013606Z %152 = tt.broadcast %151 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.2013877Z %153 = ttg.convert_layout %152 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2014098Z %154 = arith.addi %153, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2014314Z %155 = tt.addptr %18, %154 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2014538Z %156 = tt.load %155 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2014744Z %157 = arith.truncf %134 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2014984Z %158 = tt.reshape %146 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2015218Z %159 = tt.reshape %157 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.2015455Z %160 = tt.reshape %156 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.2015755Z %161 = ttg.convert_layout %159 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2016115Z %162 = ttg.convert_layout %160 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2016415Z %163 = ttg.convert_layout %158 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2016820Z %164 = tt.dot %161, %162, %163, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2017223Z %165 = tt.reshape %164 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2017404Z %c1_i32_12 = arith.constant 1 : i32 2026-02-21T12:38:38.2017536Z %166 = arith.muli %c256_i32, %c1_i32_12 : i32 2026-02-21T12:38:38.2017661Z %167 = arith.addi %arg5, %166 : i32 2026-02-21T12:38:38.2017800Z %168 = tt.splat %167 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2017955Z %169 = arith.addi %168, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2018210Z %170 = ttg.convert_layout %169 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.2018543Z %171 = tt.expand_dims %170 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.2018833Z %172 = ttg.convert_layout %171 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.2019123Z %173 = ttg.convert_layout %172 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.2019465Z %174 = tt.expand_dims %173 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.2019773Z %175 = ttg.convert_layout %174 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2019989Z %176 = arith.muli %175, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2028460Z %177 = tt.broadcast %176 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2028732Z %178 = arith.addi %80, %177 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2028964Z %179 = tt.addptr %16, %178 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2029192Z %180 = tt.load %179 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2029406Z %181 = tt.reshape %180 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.2029708Z %182 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2030085Z %183 = ttg.convert_layout %181 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2030393Z %184 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2030808Z %185 = tt.dot %182, %183, %184, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2031211Z %186 = tt.reshape %185 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2031453Z %187 = arith.truncf %186 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2031699Z %188 = arith.extf %187 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2031892Z %189 = "tt.reduce"(%188) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2032023Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2032150Z %391 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.2032277Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.2032470Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2032768Z %190 = ttg.convert_layout %189 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2033042Z %191 = arith.truncf %190 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2033262Z %192 = arith.extf %191 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2033471Z %193 = arith.mulf %192, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2033658Z %194 = arith.truncf %193 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2033878Z %195 = arith.extf %194 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2034073Z %196 = arith.cmpf ogt, %124, %195 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2034238Z %197 = arith.cmpf une, %124, %124 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2034420Z %198 = arith.ori %196, %197 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.2034611Z %199 = arith.select %198, %124, %195 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2034816Z %200 = arith.mulf %188, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2035020Z %201 = arith.truncf %200 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2035311Z %202 = ttg.convert_layout %199 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2035650Z %203 = tt.expand_dims %202 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2035945Z %204 = ttg.convert_layout %203 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2036192Z %205 = arith.extf %201 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2036429Z %206 = tt.broadcast %204 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.2036696Z %207 = ttg.convert_layout %206 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2036912Z %208 = arith.subf %205, %207 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2037215Z %209 = tt.extern_elementwise %208 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2037510Z %210 = "tt.reduce"(%209) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2037636Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2037757Z %391 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.2037895Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.2038082Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2038378Z %211 = ttg.convert_layout %210 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2038618Z %212 = arith.subf %124, %199 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2038904Z %213 = tt.extern_elementwise %212 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2039191Z %214 = arith.mulf %140, %213 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2039347Z %215 = arith.addf %214, %211 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2039588Z %216 = ttg.convert_layout %213 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2039920Z %217 = tt.expand_dims %216 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2040217Z %218 = ttg.convert_layout %217 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2040459Z %219 = tt.broadcast %218 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.2040703Z %220 = ttg.convert_layout %219 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2040916Z %221 = arith.mulf %165, %220 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2041178Z %222 = ttg.convert_layout %172 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.2041525Z %223 = tt.expand_dims %222 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.2041833Z %224 = ttg.convert_layout %223 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2042043Z %225 = arith.muli %224, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2042225Z %226 = arith.addi %82, %225 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2042425Z %227 = tt.broadcast %226 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.2042734Z %228 = ttg.convert_layout %227 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2042955Z %229 = arith.addi %228, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2043172Z %230 = tt.addptr %18, %229 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2043397Z %231 = tt.load %230 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2043602Z %232 = arith.truncf %209 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2043846Z %233 = tt.reshape %221 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2044082Z %234 = tt.reshape %232 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.2044320Z %235 = tt.reshape %231 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.2044642Z %236 = ttg.convert_layout %234 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2044997Z %237 = ttg.convert_layout %235 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2045299Z %238 = ttg.convert_layout %233 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2045724Z %239 = tt.dot %236, %237, %238, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2046119Z %240 = tt.reshape %239 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2046302Z %c2_i32_13 = arith.constant 2 : i32 2026-02-21T12:38:38.2046430Z %241 = arith.muli %c256_i32, %c2_i32_13 : i32 2026-02-21T12:38:38.2046561Z %242 = arith.addi %arg5, %241 : i32 2026-02-21T12:38:38.2046700Z %243 = tt.splat %242 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2046855Z %244 = arith.addi %243, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2047099Z %245 = ttg.convert_layout %244 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.2047430Z %246 = tt.expand_dims %245 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.2047727Z %247 = ttg.convert_layout %246 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.2048016Z %248 = ttg.convert_layout %247 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.2048362Z %249 = tt.expand_dims %248 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.2048671Z %250 = ttg.convert_layout %249 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2048888Z %251 = arith.muli %250, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2049113Z %252 = tt.broadcast %251 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2049321Z %253 = arith.addi %80, %252 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2049534Z %254 = tt.addptr %16, %253 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2049760Z %255 = tt.load %254 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2049965Z %256 = tt.reshape %255 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.2050285Z %257 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2050644Z %258 = ttg.convert_layout %256 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2050949Z %259 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2051359Z %260 = tt.dot %257, %258, %259, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2051753Z %261 = tt.reshape %260 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2051993Z %262 = arith.truncf %261 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2052238Z %263 = arith.extf %262 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2052449Z %264 = "tt.reduce"(%263) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2052578Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2052697Z %391 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.2052823Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.2053011Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2053300Z %265 = ttg.convert_layout %264 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2053572Z %266 = arith.truncf %265 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2053807Z %267 = arith.extf %266 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2054003Z %268 = arith.mulf %267, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2054193Z %269 = arith.truncf %268 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2054412Z %270 = arith.extf %269 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2054604Z %271 = arith.cmpf ogt, %199, %270 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2054763Z %272 = arith.cmpf une, %199, %199 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2054924Z %273 = arith.ori %271, %272 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.2055114Z %274 = arith.select %273, %199, %270 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2055314Z %275 = arith.mulf %263, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2055520Z %276 = arith.truncf %275 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2055802Z %277 = ttg.convert_layout %274 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2056136Z %278 = tt.expand_dims %277 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2056432Z %279 = ttg.convert_layout %278 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2056672Z %280 = arith.extf %276 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2056938Z %281 = tt.broadcast %279 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.2057180Z %282 = ttg.convert_layout %281 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2057392Z %283 = arith.subf %280, %282 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2057695Z %284 = tt.extern_elementwise %283 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2058001Z %285 = "tt.reduce"(%284) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2058128Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2058244Z %391 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.2058363Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.2058545Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2058834Z %286 = ttg.convert_layout %285 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2059073Z %287 = arith.subf %199, %274 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2059355Z %288 = tt.extern_elementwise %287 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2059641Z %289 = arith.mulf %215, %288 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2059795Z %290 = arith.addf %289, %286 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2060048Z %291 = ttg.convert_layout %288 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2060379Z %292 = tt.expand_dims %291 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2060671Z %293 = ttg.convert_layout %292 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2060912Z %294 = tt.broadcast %293 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.2061154Z %295 = ttg.convert_layout %294 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2061384Z %296 = arith.mulf %240, %295 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2061632Z %297 = ttg.convert_layout %247 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.2061974Z %298 = tt.expand_dims %297 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.2062280Z %299 = ttg.convert_layout %298 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2062491Z %300 = arith.muli %299, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2062654Z %301 = arith.addi %82, %300 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2062853Z %302 = tt.broadcast %301 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.2063106Z %303 = ttg.convert_layout %302 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2063327Z %304 = arith.addi %303, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2063542Z %305 = tt.addptr %18, %304 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2063768Z %306 = tt.load %305 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2063975Z %307 = arith.truncf %284 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2064214Z %308 = tt.reshape %296 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2064448Z %309 = tt.reshape %307 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.2064703Z %310 = tt.reshape %306 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.2065002Z %311 = ttg.convert_layout %309 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2065359Z %312 = ttg.convert_layout %310 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2065678Z %313 = ttg.convert_layout %308 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2066084Z %314 = tt.dot %311, %312, %313, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2066476Z %315 = tt.reshape %314 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2066656Z %c3_i32 = arith.constant 3 : i32 2026-02-21T12:38:38.2066779Z %316 = arith.muli %c256_i32, %c3_i32 : i32 2026-02-21T12:38:38.2066900Z %317 = arith.addi %arg5, %316 : i32 2026-02-21T12:38:38.2067033Z %318 = tt.splat %317 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2067186Z %319 = arith.addi %318, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2067424Z %320 = ttg.convert_layout %319 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.2067776Z %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.2068067Z %322 = ttg.convert_layout %321 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.2068354Z %323 = ttg.convert_layout %322 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.2068696Z %324 = tt.expand_dims %323 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.2069003Z %325 = ttg.convert_layout %324 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2069234Z %326 = arith.muli %325, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2069442Z %327 = tt.broadcast %326 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2069647Z %328 = arith.addi %80, %327 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2069862Z %329 = tt.addptr %16, %328 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2070082Z %330 = tt.load %329 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2070286Z %331 = tt.reshape %330 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.2070582Z %332 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2070935Z %333 = ttg.convert_layout %331 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2071238Z %334 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2071642Z %335 = tt.dot %332, %333, %334, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2072032Z %336 = tt.reshape %335 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2072267Z %337 = arith.truncf %336 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2072526Z %338 = arith.extf %337 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2072712Z %339 = "tt.reduce"(%338) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2072834Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2072953Z %391 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.2073073Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.2073256Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2073559Z %340 = ttg.convert_layout %339 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2073829Z %341 = arith.truncf %340 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2074052Z %342 = arith.extf %341 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2074245Z %343 = arith.mulf %342, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2074433Z %344 = arith.truncf %343 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2074648Z %345 = arith.extf %344 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2074839Z %346 = arith.cmpf ogt, %274, %345 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2075003Z %347 = arith.cmpf une, %274, %274 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2075162Z %348 = arith.ori %346, %347 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.2075352Z %349 = arith.select %348, %274, %345 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2075569Z %350 = arith.mulf %338, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2075772Z %351 = arith.truncf %350 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2076054Z %352 = ttg.convert_layout %349 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2076386Z %353 = tt.expand_dims %352 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2076681Z %354 = ttg.convert_layout %353 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2076938Z %355 = arith.extf %351 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2077172Z %356 = tt.broadcast %354 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.2077416Z %357 = ttg.convert_layout %356 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2077625Z %358 = arith.subf %355, %357 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2077922Z %359 = tt.extern_elementwise %358 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2078209Z %360 = "tt.reduce"(%359) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2078332Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2078447Z %391 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.2078566Z tt.reduce.return %391 : f32 2026-02-21T12:38:38.2078753Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2079040Z %361 = ttg.convert_layout %360 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2079275Z %362 = arith.subf %274, %349 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2079557Z %363 = tt.extern_elementwise %362 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2079839Z %364 = arith.mulf %290, %363 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2080007Z %365 = arith.addf %364, %361 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2080243Z %366 = ttg.convert_layout %363 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2080575Z %367 = tt.expand_dims %366 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2080864Z %368 = ttg.convert_layout %367 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2081120Z %369 = tt.broadcast %368 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.2081363Z %370 = ttg.convert_layout %369 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2081573Z %371 = arith.mulf %315, %370 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2081819Z %372 = ttg.convert_layout %322 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.2082159Z %373 = tt.expand_dims %372 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.2082459Z %374 = ttg.convert_layout %373 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2082715Z %375 = arith.muli %374, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2082879Z %376 = arith.addi %82, %375 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2083076Z %377 = tt.broadcast %376 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.2083347Z %378 = ttg.convert_layout %377 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2083562Z %379 = arith.addi %378, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2083776Z %380 = tt.addptr %18, %379 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2083995Z %381 = tt.load %380 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2084198Z %382 = arith.truncf %359 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2084462Z %383 = tt.reshape %371 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2084942Z %384 = tt.reshape %382 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.2085225Z %385 = tt.reshape %381 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.2085529Z %386 = ttg.convert_layout %384 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2085887Z %387 = ttg.convert_layout %385 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2086188Z %388 = ttg.convert_layout %383 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2086593Z %389 = tt.dot %386, %387, %388, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2086990Z %390 = tt.reshape %389 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2087258Z scf.yield %349, %365, %390 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2087477Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T12:38:38.2087813Z %84:3 = scf.for %arg5 = %c0_i32_10 to %c512_i32 step %c256_i32 iter_args(%arg6 = %83#0, %arg7 = %83#1, %arg8 = %83#2) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>) : i32 { 2026-02-21T12:38:38.2088238Z %93 = tt.splat %arg5 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2088394Z %94 = arith.addi %93, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2088627Z %95 = ttg.convert_layout %94 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.2088956Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.2089270Z %97 = ttg.convert_layout %96 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.2089555Z %98 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.2089890Z %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.2090190Z %100 = ttg.convert_layout %99 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2090406Z %101 = arith.muli %100, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2090614Z %102 = tt.broadcast %101 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2090820Z %103 = arith.addi %80, %102 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2091039Z %104 = tt.addptr %16, %103 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2091260Z %105 = tt.load %104 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2091485Z %106 = tt.reshape %105 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.2091784Z %107 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2092135Z %108 = ttg.convert_layout %106 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2092441Z %109 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2092870Z %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2093262Z %111 = tt.reshape %110 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2093501Z %112 = arith.truncf %111 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2093741Z %113 = arith.extf %112 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2093930Z %114 = "tt.reduce"(%113) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2094056Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2094175Z %166 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.2094299Z tt.reduce.return %166 : f32 2026-02-21T12:38:38.2094486Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2094776Z %115 = ttg.convert_layout %114 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2095043Z %116 = arith.truncf %115 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2095265Z %117 = arith.extf %116 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2095458Z %118 = arith.mulf %117, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2095647Z %119 = arith.truncf %118 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2095866Z %120 = arith.extf %119 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2096075Z %121 = arith.cmpf ogt, %arg6, %120 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2096245Z %122 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2096406Z %123 = arith.ori %121, %122 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.2096598Z %124 = arith.select %123, %arg6, %120 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2096801Z %125 = arith.mulf %113, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2097021Z %126 = arith.truncf %125 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2097309Z %127 = ttg.convert_layout %124 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2097640Z %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2097932Z %129 = ttg.convert_layout %128 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2098173Z %130 = arith.extf %126 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2098407Z %131 = tt.broadcast %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.2098651Z %132 = ttg.convert_layout %131 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2098863Z %133 = arith.subf %130, %132 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2099181Z %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2099468Z %135 = "tt.reduce"(%134) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2099591Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2099708Z %166 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.2099826Z tt.reduce.return %166 : f32 2026-02-21T12:38:38.2100010Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2100299Z %136 = ttg.convert_layout %135 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2100554Z %137 = arith.subf %arg6, %124 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2100839Z %138 = tt.extern_elementwise %137 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2101123Z %139 = arith.mulf %arg7, %138 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2101281Z %140 = arith.addf %139, %136 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2101517Z %141 = ttg.convert_layout %138 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2101845Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2102138Z %143 = ttg.convert_layout %142 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2102377Z %144 = tt.broadcast %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.2102620Z %145 = ttg.convert_layout %144 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2102834Z %146 = arith.mulf %arg8, %145 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2103080Z %147 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.2103420Z %148 = tt.expand_dims %147 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.2103736Z %149 = ttg.convert_layout %148 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2103944Z %150 = arith.muli %149, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2104104Z %151 = arith.addi %82, %150 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2104299Z %152 = tt.broadcast %151 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.2104553Z %153 = ttg.convert_layout %152 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2104782Z %154 = arith.addi %153, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2104998Z %155 = tt.addptr %18, %154 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2105219Z %156 = tt.load %155 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2105422Z %157 = arith.truncf %134 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2105660Z %158 = tt.reshape %146 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2105888Z %159 = tt.reshape %157 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.2106127Z %160 = tt.reshape %156 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.2106425Z %161 = ttg.convert_layout %159 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2106779Z %162 = ttg.convert_layout %160 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2107092Z %163 = ttg.convert_layout %158 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2107494Z %164 = tt.dot %161, %162, %163, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2107884Z %165 = tt.reshape %164 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2108152Z scf.yield %124, %140, %165 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2108386Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T12:38:38.2108606Z %85 = ttg.convert_layout %84#1 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2108932Z %86 = tt.expand_dims %85 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2109217Z %87 = ttg.convert_layout %86 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2109451Z %88 = tt.broadcast %87 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.2109688Z %89 = ttg.convert_layout %88 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2109893Z %90 = arith.divf %84#2, %89 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2110089Z %91 = arith.truncf %90 : tensor<1x1x128xf32, #blocked1> to tensor<1x1x128xbf16, #blocked1> 2026-02-21T12:38:38.2110333Z %92 = tt.addptr %19, %74 : tensor<1x1x128x!tt.ptr, #blocked1>, tensor<1x1x128xi32, #blocked1> 2026-02-21T12:38:38.2110543Z tt.store %92, %91 : tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2110721Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T12:38:38.2110888Z scf.for %arg4 = %27 to %3 step %c1_i32 : i32 { 2026-02-21T12:38:38.2111020Z %29 = arith.divsi %arg4, %c8192_i32 : i32 2026-02-21T12:38:38.2111141Z %30 = arith.muli %29, %c16_i32 : i32 2026-02-21T12:38:38.2111255Z %31 = arith.subi %c192_i32, %30 : i32 2026-02-21T12:38:38.2111384Z %32 = arith.minsi %31, %c16_i32 : i32 2026-02-21T12:38:38.2111502Z %33 = arith.remsi %arg4, %c8192_i32 : i32 2026-02-21T12:38:38.2111618Z %34 = arith.remsi %33, %32 : i32 2026-02-21T12:38:38.2111727Z %35 = arith.addi %30, %34 : i32 2026-02-21T12:38:38.2111833Z %36 = arith.divsi %33, %32 : i32 2026-02-21T12:38:38.2111947Z %37 = arith.muli %35, %c65536_i32 : i32 2026-02-21T12:38:38.2112062Z %38 = arith.muli %36, %c128_i32 : i32 2026-02-21T12:38:38.2112170Z %39 = arith.addi %37, %38 : i32 2026-02-21T12:38:38.2112317Z %40 = tt.splat %39 : i32 -> tensor<1x1x128xi32, #blocked1> 2026-02-21T12:38:38.2112470Z %41 = arith.addi %40, %10 : tensor<1x1x128xi32, #blocked1> 2026-02-21T12:38:38.2112671Z %42 = tt.addptr %11, %41 : tensor<1x1x128x!tt.ptr, #blocked1>, tensor<1x1x128xi32, #blocked1> 2026-02-21T12:38:38.2112877Z %43 = tt.load %42 : tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2113032Z %44 = tt.splat %37 : i32 -> tensor<1x128x1xi32, #blocked> 2026-02-21T12:38:38.2113185Z %45 = arith.addi %44, %15 : tensor<1x128x1xi32, #blocked> 2026-02-21T12:38:38.2113375Z %46 = tt.broadcast %45 : tensor<1x128x1xi32, #blocked> -> tensor<1x128x256xi32, #blocked> 2026-02-21T12:38:38.2113622Z %47 = ttg.convert_layout %46 : tensor<1x128x256xi32, #blocked> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2113865Z %48 = tt.reshape %43 : tensor<1x1x128xbf16, #blocked1> -> tensor<1x128xbf16, #blocked3> 2026-02-21T12:38:38.2114058Z %49 = tt.splat %37 : i32 -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2114191Z %c0_i32_8 = arith.constant 0 : i32 2026-02-21T12:38:38.2114309Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T12:38:38.2114669Z %50:3 = scf.for %arg5 = %c0_i32 to %c0_i32_8 step %c1024_i32 iter_args(%arg6 = %cst_6, %arg7 = %cst_5, %arg8 = %cst_4) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>) : i32 { 2026-02-21T12:38:38.2115014Z %60 = tt.splat %arg5 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2115165Z %61 = arith.addi %60, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2115394Z %62 = ttg.convert_layout %61 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.2115738Z %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.2116033Z %64 = ttg.convert_layout %63 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.2116318Z %65 = ttg.convert_layout %64 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.2116655Z %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.2116953Z %67 = ttg.convert_layout %66 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2117163Z %68 = arith.muli %67, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2117363Z %69 = tt.broadcast %68 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2117565Z %70 = arith.addi %47, %69 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2117780Z %71 = tt.addptr %16, %70 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2117996Z %72 = tt.load %71 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2118200Z %73 = tt.reshape %72 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.2118494Z %74 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2118843Z %75 = ttg.convert_layout %73 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2119164Z %76 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2119574Z %77 = tt.dot %74, %75, %76, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2119955Z %78 = tt.reshape %77 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2120204Z %79 = arith.truncf %78 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2120441Z %80 = arith.extf %79 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2120627Z %81 = "tt.reduce"(%80) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2120669Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2120722Z %358 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.2120763Z tt.reduce.return %358 : f32 2026-02-21T12:38:38.2120875Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2121011Z %82 = ttg.convert_layout %81 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2121101Z %83 = arith.truncf %82 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2121188Z %84 = arith.extf %83 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2121250Z %85 = arith.mulf %84, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2121352Z %86 = arith.truncf %85 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2121434Z %87 = arith.extf %86 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2121498Z %88 = arith.cmpf ogt, %arg6, %87 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2121567Z %89 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2121624Z %90 = arith.ori %88, %89 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.2121718Z %91 = arith.select %90, %arg6, %87 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2121797Z %92 = arith.mulf %80, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2121898Z %93 = arith.truncf %92 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2122038Z %94 = ttg.convert_layout %91 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2122186Z %95 = tt.expand_dims %94 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2122286Z %96 = ttg.convert_layout %95 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2122382Z %97 = arith.extf %93 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2122478Z %98 = tt.broadcast %96 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.2122633Z %99 = ttg.convert_layout %98 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2122700Z %100 = arith.subf %97, %99 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2122913Z %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2122963Z %102 = "tt.reduce"(%101) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2123004Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2123049Z %358 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.2123090Z tt.reduce.return %358 : f32 2026-02-21T12:38:38.2123226Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2123366Z %103 = ttg.convert_layout %102 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2123428Z %104 = arith.subf %arg6, %91 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2123616Z %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2123698Z %106 = arith.mulf %arg7, %105 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2123760Z %107 = arith.addf %106, %103 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2123901Z %108 = ttg.convert_layout %105 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2124051Z %109 = tt.expand_dims %108 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2124157Z %110 = ttg.convert_layout %109 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2124253Z %111 = tt.broadcast %110 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.2124362Z %112 = ttg.convert_layout %111 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2124428Z %113 = arith.mulf %arg8, %112 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2124578Z %114 = ttg.convert_layout %64 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.2124749Z %115 = tt.expand_dims %114 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.2124858Z %116 = ttg.convert_layout %115 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2124923Z %117 = arith.muli %116, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2124985Z %118 = arith.addi %49, %117 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2125085Z %119 = tt.broadcast %118 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.2125220Z %120 = ttg.convert_layout %119 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2125286Z %121 = arith.addi %120, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2125404Z %122 = tt.addptr %18, %121 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2125471Z %123 = tt.load %122 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2125573Z %124 = arith.truncf %101 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2125671Z %125 = tt.reshape %113 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2125772Z %126 = tt.reshape %124 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.2125874Z %127 = tt.reshape %123 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.2126037Z %128 = ttg.convert_layout %126 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2126199Z %129 = ttg.convert_layout %127 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2126303Z %130 = ttg.convert_layout %125 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2126600Z %131 = tt.dot %128, %129, %130, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2126711Z %132 = tt.reshape %131 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2126755Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T12:38:38.2126808Z %133 = arith.muli %c256_i32, %c1_i32_9 : i32 2026-02-21T12:38:38.2126851Z %134 = arith.addi %arg5, %133 : i32 2026-02-21T12:38:38.2126912Z %135 = tt.splat %134 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2126974Z %136 = arith.addi %135, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2127132Z %137 = ttg.convert_layout %136 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.2127282Z %138 = tt.expand_dims %137 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.2127388Z %139 = ttg.convert_layout %138 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.2127534Z %140 = ttg.convert_layout %139 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.2127690Z %141 = tt.expand_dims %140 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.2127804Z %142 = ttg.convert_layout %141 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2127872Z %143 = arith.muli %142, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2127976Z %144 = tt.broadcast %143 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2128059Z %145 = arith.addi %47, %144 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2128174Z %146 = tt.addptr %16, %145 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2128239Z %147 = tt.load %146 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2128345Z %148 = tt.reshape %147 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.2128498Z %149 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2128673Z %150 = ttg.convert_layout %148 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2128782Z %151 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2129042Z %152 = tt.dot %149, %150, %151, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2129138Z %153 = tt.reshape %152 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2129244Z %154 = arith.truncf %153 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2129343Z %155 = arith.extf %154 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2129393Z %156 = "tt.reduce"(%155) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2129437Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2129486Z %358 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.2129530Z tt.reduce.return %358 : f32 2026-02-21T12:38:38.2129642Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2129787Z %157 = ttg.convert_layout %156 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2129881Z %158 = arith.truncf %157 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2129971Z %159 = arith.extf %158 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2130051Z %160 = arith.mulf %159, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2130141Z %161 = arith.truncf %160 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2130228Z %162 = arith.extf %161 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2130293Z %163 = arith.cmpf ogt, %91, %162 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2130381Z %164 = arith.cmpf une, %91, %91 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2130442Z %165 = arith.ori %163, %164 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.2130541Z %166 = arith.select %165, %91, %162 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2130607Z %167 = arith.mulf %155, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2130709Z %168 = arith.truncf %167 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2130856Z %169 = ttg.convert_layout %166 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2131006Z %170 = tt.expand_dims %169 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2131112Z %171 = ttg.convert_layout %170 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2131214Z %172 = arith.extf %168 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2131326Z %173 = tt.broadcast %171 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.2131434Z %174 = ttg.convert_layout %173 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2131502Z %175 = arith.subf %172, %174 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2131702Z %176 = tt.extern_elementwise %175 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2131751Z %177 = "tt.reduce"(%176) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2131791Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2131859Z %358 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.2131900Z tt.reduce.return %358 : f32 2026-02-21T12:38:38.2132011Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2132154Z %178 = ttg.convert_layout %177 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2132214Z %179 = arith.subf %91, %166 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2132402Z %180 = tt.extern_elementwise %179 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2132468Z %181 = arith.mulf %107, %180 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2132527Z %182 = arith.addf %181, %178 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2132666Z %183 = ttg.convert_layout %180 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2132822Z %184 = tt.expand_dims %183 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2132928Z %185 = ttg.convert_layout %184 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2133025Z %186 = tt.broadcast %185 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.2133137Z %187 = ttg.convert_layout %186 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2133201Z %188 = arith.mulf %132, %187 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2133361Z %189 = ttg.convert_layout %139 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.2133522Z %190 = tt.expand_dims %189 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.2133630Z %191 = ttg.convert_layout %190 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2133711Z %192 = arith.muli %191, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2133776Z %193 = arith.addi %49, %192 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2133877Z %194 = tt.broadcast %193 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.2133991Z %195 = ttg.convert_layout %194 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2134058Z %196 = arith.addi %195, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2134174Z %197 = tt.addptr %18, %196 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2134240Z %198 = tt.load %197 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2134344Z %199 = arith.truncf %176 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2134438Z %200 = tt.reshape %188 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2134537Z %201 = tt.reshape %199 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.2134653Z %202 = tt.reshape %198 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.2134808Z %203 = ttg.convert_layout %201 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2134970Z %204 = ttg.convert_layout %202 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2135075Z %205 = ttg.convert_layout %200 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2135344Z %206 = tt.dot %203, %204, %205, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2135442Z %207 = tt.reshape %206 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2135489Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T12:38:38.2135540Z %208 = arith.muli %c256_i32, %c2_i32_10 : i32 2026-02-21T12:38:38.2135581Z %209 = arith.addi %arg5, %208 : i32 2026-02-21T12:38:38.2135641Z %210 = tt.splat %209 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2135702Z %211 = arith.addi %210, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2135845Z %212 = ttg.convert_layout %211 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.2135997Z %213 = tt.expand_dims %212 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.2136104Z %214 = ttg.convert_layout %213 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.2136253Z %215 = ttg.convert_layout %214 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.2136412Z %216 = tt.expand_dims %215 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.2136523Z %217 = ttg.convert_layout %216 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2136590Z %218 = arith.muli %217, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2136710Z %219 = tt.broadcast %218 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2136777Z %220 = arith.addi %47, %219 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2136893Z %221 = tt.addptr %16, %220 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2136958Z %222 = tt.load %221 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2137080Z %223 = tt.reshape %222 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.2137235Z %224 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2137394Z %225 = ttg.convert_layout %223 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2137505Z %226 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2137761Z %227 = tt.dot %224, %225, %226, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2137859Z %228 = tt.reshape %227 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2137963Z %229 = arith.truncf %228 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2138062Z %230 = arith.extf %229 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2138125Z %231 = "tt.reduce"(%230) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2138170Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2138218Z %358 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.2138258Z tt.reduce.return %358 : f32 2026-02-21T12:38:38.2138373Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2138511Z %232 = ttg.convert_layout %231 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2138627Z %233 = arith.truncf %232 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2138719Z %234 = arith.extf %233 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2138782Z %235 = arith.mulf %234, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2138872Z %236 = arith.truncf %235 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2138960Z %237 = arith.extf %236 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2139026Z %238 = arith.cmpf ogt, %166, %237 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2139091Z %239 = arith.cmpf une, %166, %166 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2139149Z %240 = arith.ori %238, %239 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.2139248Z %241 = arith.select %240, %166, %237 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2139315Z %242 = arith.mulf %230, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2139417Z %243 = arith.truncf %242 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2139563Z %244 = ttg.convert_layout %241 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2139715Z %245 = tt.expand_dims %244 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2139820Z %246 = ttg.convert_layout %245 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2139935Z %247 = arith.extf %243 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2140033Z %248 = tt.broadcast %246 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.2140140Z %249 = ttg.convert_layout %248 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2140210Z %250 = arith.subf %247, %249 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2140412Z %251 = tt.extern_elementwise %250 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2140476Z %252 = "tt.reduce"(%251) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2140519Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2140565Z %358 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.2140607Z tt.reduce.return %358 : f32 2026-02-21T12:38:38.2140722Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2140857Z %253 = ttg.convert_layout %252 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2140919Z %254 = arith.subf %166, %241 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2141110Z %255 = tt.extern_elementwise %254 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2141172Z %256 = arith.mulf %182, %255 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2141231Z %257 = arith.addf %256, %253 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2141385Z %258 = ttg.convert_layout %255 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2141536Z %259 = tt.expand_dims %258 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2141641Z %260 = ttg.convert_layout %259 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2141741Z %261 = tt.broadcast %260 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.2141862Z %262 = ttg.convert_layout %261 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2141927Z %263 = arith.mulf %207, %262 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2142075Z %264 = ttg.convert_layout %214 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.2142232Z %265 = tt.expand_dims %264 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.2142338Z %266 = ttg.convert_layout %265 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2142405Z %267 = arith.muli %266, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2142469Z %268 = arith.addi %49, %267 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2142569Z %269 = tt.broadcast %268 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.2142684Z %270 = ttg.convert_layout %269 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2142752Z %271 = arith.addi %270, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2142869Z %272 = tt.addptr %18, %271 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2142936Z %273 = tt.load %272 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2143041Z %274 = arith.truncf %251 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2143136Z %275 = tt.reshape %263 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2143254Z %276 = tt.reshape %274 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.2143358Z %277 = tt.reshape %273 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.2143514Z %278 = ttg.convert_layout %276 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2143673Z %279 = ttg.convert_layout %277 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2143792Z %280 = ttg.convert_layout %275 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2144055Z %281 = tt.dot %278, %279, %280, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2144151Z %282 = tt.reshape %281 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2144200Z %c3_i32 = arith.constant 3 : i32 2026-02-21T12:38:38.2144248Z %283 = arith.muli %c256_i32, %c3_i32 : i32 2026-02-21T12:38:38.2144290Z %284 = arith.addi %arg5, %283 : i32 2026-02-21T12:38:38.2144353Z %285 = tt.splat %284 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2144410Z %286 = arith.addi %285, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2144553Z %287 = ttg.convert_layout %286 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.2144719Z %288 = tt.expand_dims %287 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.2144820Z %289 = ttg.convert_layout %288 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.2144964Z %290 = ttg.convert_layout %289 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.2145123Z %291 = tt.expand_dims %290 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.2145249Z %292 = ttg.convert_layout %291 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2145316Z %293 = arith.muli %292, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2145422Z %294 = tt.broadcast %293 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2145486Z %295 = arith.addi %47, %294 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2145602Z %296 = tt.addptr %16, %295 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2145669Z %297 = tt.load %296 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2145771Z %298 = tt.reshape %297 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.2145925Z %299 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2146087Z %300 = ttg.convert_layout %298 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2146192Z %301 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2146452Z %302 = tt.dot %299, %300, %301, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2146551Z %303 = tt.reshape %302 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2146652Z %304 = arith.truncf %303 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2146765Z %305 = arith.extf %304 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2146817Z %306 = "tt.reduce"(%305) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2146858Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2146907Z %358 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.2146948Z tt.reduce.return %358 : f32 2026-02-21T12:38:38.2147080Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2147220Z %307 = ttg.convert_layout %306 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2147311Z %308 = arith.truncf %307 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2147402Z %309 = arith.extf %308 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2147466Z %310 = arith.mulf %309, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2147554Z %311 = arith.truncf %310 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2147643Z %312 = arith.extf %311 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2147709Z %313 = arith.cmpf ogt, %241, %312 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2147773Z %314 = arith.cmpf une, %241, %241 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2147836Z %315 = arith.ori %313, %314 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.2147944Z %316 = arith.select %315, %241, %312 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2148010Z %317 = arith.mulf %305, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2148114Z %318 = arith.truncf %317 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2148257Z %319 = ttg.convert_layout %316 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2148407Z %320 = tt.expand_dims %319 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2148530Z %321 = ttg.convert_layout %320 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2148629Z %322 = arith.extf %318 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2148728Z %323 = tt.broadcast %321 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.2148840Z %324 = ttg.convert_layout %323 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2148904Z %325 = arith.subf %322, %324 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2149107Z %326 = tt.extern_elementwise %325 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2149160Z %327 = "tt.reduce"(%326) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2149201Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2149250Z %358 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.2149291Z tt.reduce.return %358 : f32 2026-02-21T12:38:38.2149407Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2149544Z %328 = ttg.convert_layout %327 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2149606Z %329 = arith.subf %241, %316 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2149795Z %330 = tt.extern_elementwise %329 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2149871Z %331 = arith.mulf %257, %330 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2149930Z %332 = arith.addf %331, %328 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2150071Z %333 = ttg.convert_layout %330 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2150222Z %334 = tt.expand_dims %333 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2150345Z %335 = ttg.convert_layout %334 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2150447Z %336 = tt.broadcast %335 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.2150554Z %337 = ttg.convert_layout %336 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2150618Z %338 = arith.mulf %282, %337 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2150768Z %339 = ttg.convert_layout %289 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.2150924Z %340 = tt.expand_dims %339 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.2151035Z %341 = ttg.convert_layout %340 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2151100Z %342 = arith.muli %341, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2151165Z %343 = arith.addi %49, %342 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2151277Z %344 = tt.broadcast %343 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.2151394Z %345 = ttg.convert_layout %344 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2151457Z %346 = arith.addi %345, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2151574Z %347 = tt.addptr %18, %346 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2151643Z %348 = tt.load %347 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2151745Z %349 = arith.truncf %326 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2151854Z %350 = tt.reshape %338 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2151954Z %351 = tt.reshape %349 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.2152056Z %352 = tt.reshape %348 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.2152212Z %353 = ttg.convert_layout %351 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2152376Z %354 = ttg.convert_layout %352 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2152479Z %355 = ttg.convert_layout %350 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2152739Z %356 = tt.dot %353, %354, %355, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2152838Z %357 = tt.reshape %356 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2152972Z scf.yield %316, %332, %357 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2153021Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T12:38:38.2153272Z %51:3 = scf.for %arg5 = %c0_i32_8 to %c512_i32 step %c256_i32 iter_args(%arg6 = %50#0, %arg7 = %50#1, %arg8 = %50#2) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>) : i32 { 2026-02-21T12:38:38.2153349Z %60 = tt.splat %arg5 : i32 -> tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2153409Z %61 = arith.addi %60, %12 : tensor<256xi32, #blocked4> 2026-02-21T12:38:38.2153551Z %62 = ttg.convert_layout %61 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T12:38:38.2153701Z %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T12:38:38.2153816Z %64 = ttg.convert_layout %63 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T12:38:38.2153964Z %65 = ttg.convert_layout %64 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T12:38:38.2154118Z %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6> 2026-02-21T12:38:38.2154225Z %67 = ttg.convert_layout %66 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2154291Z %68 = arith.muli %67, %cst_3 : tensor<1x1x256xi32, #blocked1> 2026-02-21T12:38:38.2154391Z %69 = tt.broadcast %68 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2154454Z %70 = arith.addi %47, %69 : tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2154568Z %71 = tt.addptr %16, %70 : tensor<1x128x256x!tt.ptr, #blocked1>, tensor<1x128x256xi32, #blocked1> 2026-02-21T12:38:38.2154634Z %72 = tt.load %71 : tensor<1x128x256x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2154748Z %73 = tt.reshape %72 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3> 2026-02-21T12:38:38.2154901Z %74 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2155057Z %75 = ttg.convert_layout %73 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2155161Z %76 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2155439Z %77 = tt.dot %74, %75, %76, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3> 2026-02-21T12:38:38.2155533Z %78 = tt.reshape %77 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2155635Z %79 = arith.truncf %78 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2155732Z %80 = arith.extf %79 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2155782Z %81 = "tt.reduce"(%80) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2155824Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2155870Z %133 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:38:38.2155915Z tt.reduce.return %133 : f32 2026-02-21T12:38:38.2156026Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2156163Z %82 = ttg.convert_layout %81 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2156253Z %83 = arith.truncf %82 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2156338Z %84 = arith.extf %83 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2156400Z %85 = arith.mulf %84, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2156488Z %86 = arith.truncf %85 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T12:38:38.2156570Z %87 = arith.extf %86 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2156650Z %88 = arith.cmpf ogt, %arg6, %87 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2156719Z %89 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2156776Z %90 = arith.ori %88, %89 : tensor<1x1xi1, #blocked2> 2026-02-21T12:38:38.2156872Z %91 = arith.select %90, %arg6, %87 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2156937Z %92 = arith.mulf %80, %cst_0 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2157054Z %93 = arith.truncf %92 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2157193Z %94 = ttg.convert_layout %91 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2157342Z %95 = tt.expand_dims %94 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2157444Z %96 = ttg.convert_layout %95 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2157540Z %97 = arith.extf %93 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2157635Z %98 = tt.broadcast %96 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9> 2026-02-21T12:38:38.2157740Z %99 = ttg.convert_layout %98 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2157803Z %100 = arith.subf %97, %99 : tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2158024Z %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1> 2026-02-21T12:38:38.2158074Z %102 = "tt.reduce"(%101) <{axis = 2 : i32}> ({ 2026-02-21T12:38:38.2158114Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:38:38.2158159Z %133 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:38:38.2158202Z tt.reduce.return %133 : f32 2026-02-21T12:38:38.2158313Z }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:38:38.2158451Z %103 = ttg.convert_layout %102 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2158530Z %104 = arith.subf %arg6, %91 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2158716Z %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2158779Z %106 = arith.mulf %arg7, %105 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2158842Z %107 = arith.addf %106, %103 : tensor<1x1xf32, #blocked2> 2026-02-21T12:38:38.2158983Z %108 = ttg.convert_layout %105 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2159134Z %109 = tt.expand_dims %108 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2159242Z %110 = ttg.convert_layout %109 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2159340Z %111 = tt.broadcast %110 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.2159447Z %112 = ttg.convert_layout %111 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2159515Z %113 = arith.mulf %arg8, %112 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2159662Z %114 = ttg.convert_layout %64 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T12:38:38.2159819Z %115 = tt.expand_dims %114 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7> 2026-02-21T12:38:38.2159927Z %116 = ttg.convert_layout %115 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2160006Z %117 = arith.muli %116, %cst : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2160067Z %118 = arith.addi %49, %117 : tensor<1x256x1xi32, #blocked> 2026-02-21T12:38:38.2160168Z %119 = tt.broadcast %118 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked> 2026-02-21T12:38:38.2160283Z %120 = ttg.convert_layout %119 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2160362Z %121 = arith.addi %120, %17 : tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2160479Z %122 = tt.addptr %18, %121 : tensor<1x256x128x!tt.ptr, #blocked1>, tensor<1x256x128xi32, #blocked1> 2026-02-21T12:38:38.2160545Z %123 = tt.load %122 : tensor<1x256x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2160647Z %124 = arith.truncf %101 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1> 2026-02-21T12:38:38.2160743Z %125 = tt.reshape %113 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2160842Z %126 = tt.reshape %124 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3> 2026-02-21T12:38:38.2160942Z %127 = tt.reshape %123 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3> 2026-02-21T12:38:38.2161100Z %128 = ttg.convert_layout %126 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T12:38:38.2161261Z %129 = ttg.convert_layout %127 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T12:38:38.2161376Z %130 = ttg.convert_layout %125 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2161641Z %131 = tt.dot %128, %129, %130, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T12:38:38.2161736Z %132 = tt.reshape %131 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2161866Z scf.yield %91, %107, %132 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2161930Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T12:38:38.2162069Z %52 = ttg.convert_layout %51#1 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> 2026-02-21T12:38:38.2162216Z %53 = tt.expand_dims %52 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8> 2026-02-21T12:38:38.2162319Z %54 = ttg.convert_layout %53 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T12:38:38.2162413Z %55 = tt.broadcast %54 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9> 2026-02-21T12:38:38.2162517Z %56 = ttg.convert_layout %55 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2162627Z %57 = arith.divf %51#2, %56 : tensor<1x1x128xf32, #blocked1> 2026-02-21T12:38:38.2162730Z %58 = arith.truncf %57 : tensor<1x1x128xf32, #blocked1> to tensor<1x1x128xbf16, #blocked1> 2026-02-21T12:38:38.2162838Z %59 = tt.addptr %19, %41 : tensor<1x1x128x!tt.ptr, #blocked1>, tensor<1x1x128xi32, #blocked1> 2026-02-21T12:38:38.2162903Z tt.store %59, %58 : tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T12:38:38.2162988Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T12:38:38.2163024Z tt.return 2026-02-21T12:38:38.2163056Z } 2026-02-21T12:38:38.2163094Z } 2026-02-21T12:38:38.2163098Z 2026-02-21T12:38:38.2163128Z {-# 2026-02-21T12:38:38.2163171Z external_resources: { 2026-02-21T12:38:38.2163209Z mlir_reproducer: { 2026-02-21T12:38:38.2165378Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T12:38:38.2165463Z disable_threading: false, 2026-02-21T12:38:38.2165501Z verify_each: true 2026-02-21T12:38:38.2165532Z } 2026-02-21T12:38:38.2165564Z } 2026-02-21T12:38:38.2165594Z #-} 2026-02-21T12:38:38.2165838Z /tmp/torchinductor_root/ne/cnewmtkcvegrxtiokl5mr4xyshhjpytfcvozpnfvt7sf7g5ytajx.py:16:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T12:38:38.2166304Z /tmp/torchinductor_root/ne/cnewmtkcvegrxtiokl5mr4xyshhjpytfcvozpnfvt7sf7g5ytajx.py:16:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T12:38:38.2166418Z [49s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T12:38:38.2167071Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T12:38:38.2167130Z Error: RuntimeError: PassManager::run failed 2026-02-21T12:38:38.2167212Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T12:38:42.6690209Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.0 configs/s 2026-02-21T12:38:42.6701947Z [53s] Adaptive compile timeout: 30s (90% percentile=13.8s, bounds=[30.0s, 30s]) 2026-02-21T12:38:42.9868856Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2375.4 configs/s 2026-02-21T12:38:43.9050030Z [54s] Initial random population of 100, 5 starting points: 2026-02-21T12:38:43.9050516Z error=19 2026-02-21T12:38:43.9050730Z timeout=6 2026-02-21T12:38:43.9050936Z ok=75 2026-02-21T12:38:43.9051133Z min=0.1508 2026-02-21T12:38:43.9051341Z mid=1.3094 2026-02-21T12:38:43.9051562Z max=127.9501 2026-02-21T12:38:43.9051812Z best={'block_sizes': [1, 128, 64], 2026-02-21T12:38:43.9052239Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T12:38:43.9052652Z 'l2_groupings': [32], 2026-02-21T12:38:43.9052933Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:38:43.9053266Z 'loop_orders': [[1, 0]], 2026-02-21T12:38:43.9053549Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:38:43.9053826Z 'num_sm_multiplier': 64, 2026-02-21T12:38:43.9054090Z 'num_stages': 1, 2026-02-21T12:38:43.9054321Z 'num_warps': 4, 2026-02-21T12:38:43.9054589Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:38:43.9055291Z 'range_flattens': [False, False], 2026-02-21T12:38:43.9055597Z 'range_multi_buffers': [False, None], 2026-02-21T12:38:43.9055902Z 'range_num_stages': [2, 3], 2026-02-21T12:38:43.9056183Z 'range_unroll_factors': [1, 2], 2026-02-21T12:38:43.9056480Z 'range_warp_specializes': [], 2026-02-21T12:38:43.9056766Z 'waves_per_eu': 1} 2026-02-21T12:38:43.9140104Z [54s] Fitting surrogate: 100 points, 100 targets 2026-02-21T12:38:44.8828352Z [55s] Generation 1 starting: 91 neighbors, 5 active search path(s) 2026-02-21T12:39:22.3606870Z [93s] Timeout after 30s compiling Config(block_sizes=[1, 128, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[3, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:39:24.2049156Z [95s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:39:24.6005320Z [95s] Timeout after 30s compiling Config(block_sizes=[1, 128, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:39:28.9420534Z [99s] Timeout after 30s compiling Config(block_sizes=[1, 128, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[2, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:39:29.6657575Z [100s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:39:30.1054722Z [101s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[2, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:39:30.1074250Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93/93 0.8 configs/s 2026-02-21T12:39:35.5005488Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 93/93 17.4 configs/s 2026-02-21T12:39:44.1356773Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 110.4 2026-02-21T12:39:44.1357377Z configs/s 2026-02-21T12:39:45.2722918Z [116s] Generation 1 complete: 2026-02-21T12:39:45.2723548Z error=8 2026-02-21T12:39:45.2723761Z timeout=6 2026-02-21T12:39:45.2723958Z ok=82 2026-02-21T12:39:45.2724162Z min=0.1517 2026-02-21T12:39:45.2724363Z mid=0.2301 2026-02-21T12:39:45.2724561Z max=1.3124 2026-02-21T12:39:45.2724786Z best={'block_sizes': [1, 128, 64], 2026-02-21T12:39:45.2725199Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T12:39:45.2725627Z 'l2_groupings': [32], 2026-02-21T12:39:45.2725910Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:39:45.2726227Z 'loop_orders': [[1, 0]], 2026-02-21T12:39:45.2726681Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:39:45.2726966Z 'num_sm_multiplier': 64, 2026-02-21T12:39:45.2727228Z 'num_stages': 1, 2026-02-21T12:39:45.2727460Z 'num_warps': 4, 2026-02-21T12:39:45.2727738Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:39:45.2728065Z 'range_flattens': [False, True], 2026-02-21T12:39:45.2728368Z 'range_multi_buffers': [False, None], 2026-02-21T12:39:45.2728675Z 'range_num_stages': [2, 3], 2026-02-21T12:39:45.2728961Z 'range_unroll_factors': [1, 2], 2026-02-21T12:39:45.2729263Z 'range_warp_specializes': [], 2026-02-21T12:39:45.2729543Z 'waves_per_eu': 1} 2026-02-21T12:39:45.2792145Z [116s] Fitting surrogate: 196 points, 196 targets 2026-02-21T12:39:46.3264766Z [117s] Generation 2 starting: 95 neighbors, 5 active search path(s) 2026-02-21T12:40:24.6721056Z [155s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:40:24.6743800Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 1.1 configs/s 2026-02-21T12:40:30.3219828Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 17.3 configs/s 2026-02-21T12:40:39.0259002Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 110.1 2026-02-21T12:40:39.0259603Z configs/s 2026-02-21T12:40:40.0867599Z [171s] Generation 2 complete: 2026-02-21T12:40:40.0867939Z error=14 2026-02-21T12:40:40.0868145Z timeout=1 2026-02-21T12:40:40.0868742Z ok=85 2026-02-21T12:40:40.0868947Z min=0.1490 2026-02-21T12:40:40.0869148Z mid=0.2267 2026-02-21T12:40:40.0869369Z max=2.2459 2026-02-21T12:40:40.0869594Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:40:40.0870129Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T12:40:40.0870543Z 'l2_groupings': [64], 2026-02-21T12:40:40.0870838Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:40:40.0871152Z 'loop_orders': [[1, 0]], 2026-02-21T12:40:40.0871427Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:40:40.0871705Z 'num_sm_multiplier': 2, 2026-02-21T12:40:40.0871977Z 'num_stages': 1, 2026-02-21T12:40:40.0872203Z 'num_warps': 8, 2026-02-21T12:40:40.0872464Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:40:40.0872782Z 'range_flattens': [None, None], 2026-02-21T12:40:40.0883787Z 'range_multi_buffers': [False, None], 2026-02-21T12:40:40.0884024Z 'range_num_stages': [2, 1], 2026-02-21T12:40:40.0884249Z 'range_unroll_factors': [1, 2], 2026-02-21T12:40:40.0884463Z 'range_warp_specializes': [], 2026-02-21T12:40:40.0884652Z 'waves_per_eu': 1} 2026-02-21T12:40:40.1733027Z [171s] Fitting surrogate: 296 points, 296 targets 2026-02-21T12:40:41.1042296Z [172s] Generation 3 starting: 86 neighbors, 5 active search path(s) 2026-02-21T12:41:22.7241823Z [213s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:41:23.2620304Z [214s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:41:23.6825907Z [214s] Timeout after 30s compiling Config(block_sizes=[1, 256, 16], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:41:24.5426196Z [215s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:41:24.5444460Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 0.5 configs/s 2026-02-21T12:41:29.8797364Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 16.4 configs/s 2026-02-21T12:41:40.6658364Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 89.7 configs/s 2026-02-21T12:41:41.7912071Z [232s] Generation 3 complete: 2026-02-21T12:41:41.7912422Z error=3 2026-02-21T12:41:41.7912616Z timeout=4 2026-02-21T12:41:41.7912830Z ok=84 2026-02-21T12:41:41.7913005Z min=0.1467 2026-02-21T12:41:41.7913191Z mid=0.1644 2026-02-21T12:41:41.7913364Z max=2.3764 2026-02-21T12:41:41.7913570Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:41:41.7913953Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T12:41:41.7914317Z 'l2_groupings': [64], 2026-02-21T12:41:41.7914914Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:41:41.7915194Z 'loop_orders': [[0, 1]], 2026-02-21T12:41:41.7915445Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:41:41.7915710Z 'num_sm_multiplier': 2, 2026-02-21T12:41:41.7915946Z 'num_stages': 1, 2026-02-21T12:41:41.7916149Z 'num_warps': 8, 2026-02-21T12:41:41.7916402Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:41:41.7916684Z 'range_flattens': [None, None], 2026-02-21T12:41:41.7916956Z 'range_multi_buffers': [False, None], 2026-02-21T12:41:41.7917237Z 'range_num_stages': [2, 1], 2026-02-21T12:41:41.7917482Z 'range_unroll_factors': [1, 2], 2026-02-21T12:41:41.7917759Z 'range_warp_specializes': [], 2026-02-21T12:41:41.7918002Z 'waves_per_eu': 1} 2026-02-21T12:41:41.7957907Z [232s] Fitting surrogate: 387 points, 387 targets 2026-02-21T12:41:42.7834164Z [233s] Generation 4 starting: 98 neighbors, 5 active search path(s) 2026-02-21T12:42:26.6159516Z [277s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:42:28.4166451Z [279s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:42:28.4184512Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s 2026-02-21T12:42:33.6803332Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 19.2 configs/s 2026-02-21T12:42:43.8177052Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 95.0 configs/s 2026-02-21T12:42:44.8879401Z [295s] Generation 4 complete: 2026-02-21T12:42:44.8883681Z error=11 2026-02-21T12:42:44.8884014Z timeout=2 2026-02-21T12:42:44.8884662Z ok=90 2026-02-21T12:42:44.8884869Z min=0.1470 2026-02-21T12:42:44.8885073Z mid=0.2032 2026-02-21T12:42:44.8885281Z max=1.3009 2026-02-21T12:42:44.8885515Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:42:44.8885990Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:42:44.8886431Z 'l2_groupings': [64], 2026-02-21T12:42:44.8886709Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:42:44.8887025Z 'loop_orders': [[0, 1]], 2026-02-21T12:42:44.8887305Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:42:44.8887586Z 'num_sm_multiplier': 2, 2026-02-21T12:42:44.8887848Z 'num_stages': 1, 2026-02-21T12:42:44.8888086Z 'num_warps': 8, 2026-02-21T12:42:44.8888518Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:42:44.8888847Z 'range_flattens': [None, None], 2026-02-21T12:42:44.8889166Z 'range_multi_buffers': [False, None], 2026-02-21T12:42:44.8889472Z 'range_num_stages': [2, 1], 2026-02-21T12:42:44.8889744Z 'range_unroll_factors': [1, 3], 2026-02-21T12:42:44.8890048Z 'range_warp_specializes': [], 2026-02-21T12:42:44.8890327Z 'waves_per_eu': 1} 2026-02-21T12:42:44.9781102Z [296s] Fitting surrogate: 490 points, 490 targets 2026-02-21T12:42:45.8363495Z [296s] Generation 5 starting: 86 neighbors, 5 active search path(s) 2026-02-21T12:43:29.2250827Z [340s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:43:29.2270296Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 0.5 configs/s 2026-02-21T12:43:33.9641635Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 18.6 configs/s 2026-02-21T12:43:43.2643534Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 103.4 2026-02-21T12:43:43.2644150Z configs/s 2026-02-21T12:43:44.2806748Z [355s] Generation 5 complete: 2026-02-21T12:43:44.2807081Z error=9 2026-02-21T12:43:44.2807311Z timeout=1 2026-02-21T12:43:44.2807513Z ok=81 2026-02-21T12:43:44.2807707Z min=0.1429 2026-02-21T12:43:44.2807913Z mid=0.2024 2026-02-21T12:43:44.2808106Z max=4.2551 2026-02-21T12:43:44.2808331Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:43:44.2808738Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:43:44.2809141Z 'l2_groupings': [64], 2026-02-21T12:43:44.2809432Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:43:44.2809745Z 'loop_orders': [[0, 1]], 2026-02-21T12:43:44.2810344Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:43:44.2810621Z 'num_sm_multiplier': 2, 2026-02-21T12:43:44.2810883Z 'num_stages': 1, 2026-02-21T12:43:44.2811106Z 'num_warps': 8, 2026-02-21T12:43:44.2811374Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:43:44.2811696Z 'range_flattens': [None, None], 2026-02-21T12:43:44.2812001Z 'range_multi_buffers': [False, None], 2026-02-21T12:43:44.2812301Z 'range_num_stages': [2, 1], 2026-02-21T12:43:44.2812573Z 'range_unroll_factors': [1, 4], 2026-02-21T12:43:44.2812995Z 'range_warp_specializes': [], 2026-02-21T12:43:44.2813263Z 'waves_per_eu': 1} 2026-02-21T12:43:44.3703993Z [355s] Fitting surrogate: 581 points, 581 targets 2026-02-21T12:43:45.1527518Z [356s] Generation 6 starting: 76 neighbors, 4 active search path(s) 2026-02-21T12:44:24.9290715Z [395s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:44:26.2162136Z [397s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:44:27.2520029Z [398s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:44:27.6504535Z [398s] Timeout after 30s compiling Config(block_sizes=[1, 128, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:44:27.6522247Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 0.7 configs/s 2026-02-21T12:44:31.7167070Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 19.2 configs/s 2026-02-21T12:44:40.0736343Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 114.3 2026-02-21T12:44:40.0736927Z configs/s 2026-02-21T12:44:40.9457990Z [411s] Generation 6 complete: 2026-02-21T12:44:40.9458327Z error=6 2026-02-21T12:44:40.9458535Z timeout=4 2026-02-21T12:44:40.9458736Z ok=70 2026-02-21T12:44:40.9458936Z min=0.1436 2026-02-21T12:44:40.9459139Z mid=0.1978 2026-02-21T12:44:40.9459331Z max=0.9076 2026-02-21T12:44:40.9459831Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:44:40.9460243Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:44:40.9460654Z 'l2_groupings': [64], 2026-02-21T12:44:40.9460929Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:44:40.9461237Z 'loop_orders': [[0, 1]], 2026-02-21T12:44:40.9461517Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:44:40.9461808Z 'num_sm_multiplier': 2, 2026-02-21T12:44:40.9462068Z 'num_stages': 1, 2026-02-21T12:44:40.9462296Z 'num_warps': 8, 2026-02-21T12:44:40.9462694Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:44:40.9463013Z 'range_flattens': [None, None], 2026-02-21T12:44:40.9463318Z 'range_multi_buffers': [False, False], 2026-02-21T12:44:40.9463632Z 'range_num_stages': [2, 1], 2026-02-21T12:44:40.9463907Z 'range_unroll_factors': [1, 4], 2026-02-21T12:44:40.9464199Z 'range_warp_specializes': [], 2026-02-21T12:44:40.9464471Z 'waves_per_eu': 1} 2026-02-21T12:44:41.0217393Z [412s] Fitting surrogate: 661 points, 661 targets 2026-02-21T12:44:41.7752660Z [412s] Generation 7 starting: 74 neighbors, 4 active search path(s) 2026-02-21T12:45:03.6642983Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 2.0 configs/s 2026-02-21T12:45:08.1020976Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 17.5 configs/s 2026-02-21T12:45:15.5915705Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 127.3 2026-02-21T12:45:15.5916321Z configs/s 2026-02-21T12:45:16.6211399Z [447s] Generation 7 complete: 2026-02-21T12:45:16.6211926Z error=4 2026-02-21T12:45:16.6212196Z ok=74 2026-02-21T12:45:16.6212402Z min=0.1442 2026-02-21T12:45:16.6212609Z mid=0.2146 2026-02-21T12:45:16.6220517Z max=1.6789 2026-02-21T12:45:16.6220770Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:45:16.6221195Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:45:16.6221608Z 'l2_groupings': [64], 2026-02-21T12:45:16.6221899Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:45:16.6222223Z 'loop_orders': [[0, 1]], 2026-02-21T12:45:16.6222510Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:45:16.6222797Z 'num_sm_multiplier': 2, 2026-02-21T12:45:16.6223059Z 'num_stages': 1, 2026-02-21T12:45:16.6223333Z 'num_warps': 8, 2026-02-21T12:45:16.6223734Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:45:16.6224381Z 'range_flattens': [None, None], 2026-02-21T12:45:16.6224690Z 'range_multi_buffers': [False, False], 2026-02-21T12:45:16.6225013Z 'range_num_stages': [2, 2], 2026-02-21T12:45:16.6225243Z 'range_unroll_factors': [2, 4], 2026-02-21T12:45:16.6225455Z 'range_warp_specializes': [], 2026-02-21T12:45:16.6225634Z 'waves_per_eu': 1} 2026-02-21T12:45:16.6996316Z [447s] Fitting surrogate: 739 points, 739 targets 2026-02-21T12:45:17.4442030Z [448s] Generation 8 starting: 64 neighbors, 4 active search path(s) 2026-02-21T12:45:49.6503140Z [480s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[2, 2], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:45:49.6522989Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 0.6 configs/s 2026-02-21T12:45:53.2928170Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 17.8 configs/s 2026-02-21T12:46:00.8844125Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 125.3 2026-02-21T12:46:00.8844402Z configs/s 2026-02-21T12:46:01.8411702Z [492s] Generation 8 complete: 2026-02-21T12:46:01.8412134Z error=5 2026-02-21T12:46:01.8412345Z timeout=1 2026-02-21T12:46:01.8412552Z ok=62 2026-02-21T12:46:01.8413170Z min=0.1406 2026-02-21T12:46:01.8413374Z mid=0.2013 2026-02-21T12:46:01.8413565Z max=0.8192 2026-02-21T12:46:01.8413797Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:46:01.8414220Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:46:01.8414629Z 'l2_groupings': [32], 2026-02-21T12:46:01.8414905Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:46:01.8415230Z 'loop_orders': [[0, 1]], 2026-02-21T12:46:01.8415512Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:46:01.8415792Z 'num_sm_multiplier': 2, 2026-02-21T12:46:01.8416195Z 'num_stages': 1, 2026-02-21T12:46:01.8416423Z 'num_warps': 8, 2026-02-21T12:46:01.8416682Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:46:01.8417013Z 'range_flattens': [None, False], 2026-02-21T12:46:01.8417329Z 'range_multi_buffers': [False, False], 2026-02-21T12:46:01.8417641Z 'range_num_stages': [2, 2], 2026-02-21T12:46:01.8417918Z 'range_unroll_factors': [2, 4], 2026-02-21T12:46:01.8418216Z 'range_warp_specializes': [], 2026-02-21T12:46:01.8418487Z 'waves_per_eu': 1} 2026-02-21T12:46:01.9192679Z [492s] Fitting surrogate: 807 points, 807 targets 2026-02-21T12:46:02.7283201Z [493s] Generation 9 starting: 71 neighbors, 4 active search path(s) 2026-02-21T12:46:37.3900567Z [528s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:46:45.5357596Z [536s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:46:45.5374692Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 0.6 configs/s 2026-02-21T12:46:49.3533156Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 19.1 configs/s 2026-02-21T12:46:58.6533828Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 112.1 2026-02-21T12:46:58.6536331Z configs/s 2026-02-21T12:46:59.5990971Z [550s] Generation 9 complete: 2026-02-21T12:46:59.5991258Z error=8 2026-02-21T12:46:59.5991464Z timeout=2 2026-02-21T12:46:59.5991632Z ok=65 2026-02-21T12:46:59.5991802Z min=0.1416 2026-02-21T12:46:59.5991971Z mid=0.1618 2026-02-21T12:46:59.5992142Z max=0.7893 2026-02-21T12:46:59.5992334Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:46:59.5992708Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:46:59.5993020Z 'l2_groupings': [32], 2026-02-21T12:46:59.5993234Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:46:59.5993474Z 'loop_orders': [[0, 1]], 2026-02-21T12:46:59.5993687Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:46:59.5993903Z 'num_sm_multiplier': 2, 2026-02-21T12:46:59.5994101Z 'num_stages': 1, 2026-02-21T12:46:59.5994286Z 'num_warps': 8, 2026-02-21T12:46:59.5994487Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:46:59.5994746Z 'range_flattens': [None, None], 2026-02-21T12:46:59.5994984Z 'range_multi_buffers': [False, False], 2026-02-21T12:46:59.5995225Z 'range_num_stages': [2, 2], 2026-02-21T12:46:59.5995438Z 'range_unroll_factors': [2, 4], 2026-02-21T12:46:59.5995667Z 'range_warp_specializes': [], 2026-02-21T12:46:59.5995882Z 'waves_per_eu': 1} 2026-02-21T12:46:59.6964899Z [550s] Fitting surrogate: 882 points, 882 targets 2026-02-21T12:47:00.4726296Z [551s] Generation 10 starting: 70 neighbors, 4 active search path(s) 2026-02-21T12:47:37.4376095Z [588s] Timeout after 30s compiling Config(block_sizes=[1, 256, 16], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:47:37.4399627Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 0.6 configs/s 2026-02-21T12:47:41.3466522Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 72/72 18.7 configs/s 2026-02-21T12:47:47.2743758Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 158.5 2026-02-21T12:47:47.2744368Z configs/s 2026-02-21T12:47:48.1848693Z [599s] Generation 10 complete: 2026-02-21T12:47:48.1849101Z error=6 2026-02-21T12:47:48.1849310Z timeout=1 2026-02-21T12:47:48.1849523Z ok=67 2026-02-21T12:47:48.1849719Z min=0.1417 2026-02-21T12:47:48.1849925Z mid=0.2268 2026-02-21T12:47:48.1850123Z max=0.7611 2026-02-21T12:47:48.1850357Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:47:48.1850770Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:47:48.1851191Z 'l2_groupings': [32], 2026-02-21T12:47:48.1851467Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:47:48.1851808Z 'loop_orders': [[0, 1]], 2026-02-21T12:47:48.1852098Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:47:48.1852375Z 'num_sm_multiplier': 2, 2026-02-21T12:47:48.1852637Z 'num_stages': 1, 2026-02-21T12:47:48.1853195Z 'num_warps': 8, 2026-02-21T12:47:48.1853457Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:47:48.1853786Z 'range_flattens': [None, None], 2026-02-21T12:47:48.1854096Z 'range_multi_buffers': [False, False], 2026-02-21T12:47:48.1854414Z 'range_num_stages': [2, 2], 2026-02-21T12:47:48.1854702Z 'range_unroll_factors': [2, 4], 2026-02-21T12:47:48.1855002Z 'range_warp_specializes': [], 2026-02-21T12:47:48.1855282Z 'waves_per_eu': 1} 2026-02-21T12:47:48.2493831Z [599s] Fitting surrogate: 956 points, 956 targets 2026-02-21T12:47:49.0435453Z [600s] Generation 11 starting: 69 neighbors, 4 active search path(s) 2026-02-21T12:48:31.4965245Z [642s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[2, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:48:32.5723525Z [643s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:48:32.5745752Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.5 configs/s 2026-02-21T12:48:36.5031805Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 71/71 18.3 configs/s 2026-02-21T12:48:40.4541863Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 231.5 2026-02-21T12:48:40.4542468Z configs/s 2026-02-21T12:48:41.3096504Z [652s] Generation 11 complete: 2026-02-21T12:48:41.3096860Z error=4 2026-02-21T12:48:41.3097072Z timeout=2 2026-02-21T12:48:41.3097285Z ok=67 2026-02-21T12:48:41.3101629Z min=0.1220 2026-02-21T12:48:41.3101891Z mid=0.2179 2026-02-21T12:48:41.3102416Z max=1.0024 2026-02-21T12:48:41.3102636Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:48:41.3103044Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T12:48:41.3103424Z 'l2_groupings': [1], 2026-02-21T12:48:41.3103681Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:48:41.3103982Z 'loop_orders': [[0, 1]], 2026-02-21T12:48:41.3104252Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:48:41.3104509Z 'num_stages': 2, 2026-02-21T12:48:41.3104723Z 'num_warps': 4, 2026-02-21T12:48:41.3105084Z 'pid_type': 'flat', 2026-02-21T12:48:41.3105333Z 'range_flattens': [None, True], 2026-02-21T12:48:41.3105615Z 'range_multi_buffers': [None, True], 2026-02-21T12:48:41.3105907Z 'range_num_stages': [0, 2], 2026-02-21T12:48:41.3106265Z 'range_unroll_factors': [0, 2], 2026-02-21T12:48:41.3106548Z 'range_warp_specializes': [], 2026-02-21T12:48:41.3106806Z 'waves_per_eu': 2} 2026-02-21T12:48:41.3458323Z [652s] Fitting surrogate: 1029 points, 1029 targets 2026-02-21T12:48:42.1287853Z [653s] Generation 12 starting: 72 neighbors, 4 active search path(s) 2026-02-21T12:49:17.7777528Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 0.5 configs/s 2026-02-21T12:49:22.8615812Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 74/74 15.0 configs/s 2026-02-21T12:49:29.2054866Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 149.5 2026-02-21T12:49:29.2058025Z configs/s 2026-02-21T12:49:30.1396402Z [701s] Generation 12 complete: 2026-02-21T12:49:30.1396767Z error=4 2026-02-21T12:49:30.1396986Z ok=72 2026-02-21T12:49:30.1397193Z min=0.1208 2026-02-21T12:49:30.1397418Z mid=0.1917 2026-02-21T12:49:30.1397638Z max=1.3322 2026-02-21T12:49:30.1397879Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:49:30.1398297Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T12:49:30.1398709Z 'l2_groupings': [1], 2026-02-21T12:49:30.1398986Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:49:30.1399324Z 'loop_orders': [[0, 1]], 2026-02-21T12:49:30.1399616Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:49:30.1399883Z 'num_stages': 1, 2026-02-21T12:49:30.1400122Z 'num_warps': 4, 2026-02-21T12:49:30.1400351Z 'pid_type': 'flat', 2026-02-21T12:49:30.1400619Z 'range_flattens': [None, True], 2026-02-21T12:49:30.1400922Z 'range_multi_buffers': [None, True], 2026-02-21T12:49:30.1401512Z 'range_num_stages': [0, 2], 2026-02-21T12:49:30.1401786Z 'range_unroll_factors': [0, 2], 2026-02-21T12:49:30.1402104Z 'range_warp_specializes': [], 2026-02-21T12:49:30.1402391Z 'waves_per_eu': 2} 2026-02-21T12:49:30.2115531Z [701s] Fitting surrogate: 1105 points, 1105 targets 2026-02-21T12:49:30.8939213Z [701s] Generation 13 starting: 55 neighbors, 3 active search path(s) 2026-02-21T12:50:08.3530171Z [739s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:50:08.5218313Z [739s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:50:08.5230771Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 0.7 configs/s 2026-02-21T12:50:11.7251947Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 57/57 18.1 configs/s 2026-02-21T12:50:13.7228268Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 421.9 2026-02-21T12:50:13.7228861Z configs/s 2026-02-21T12:50:14.4189620Z [745s] Generation 13 complete: 2026-02-21T12:50:14.4190015Z error=2 2026-02-21T12:50:14.4190222Z timeout=2 2026-02-21T12:50:14.4190430Z ok=54 2026-02-21T12:50:14.4190663Z min=0.1215 2026-02-21T12:50:14.4190867Z mid=0.2396 2026-02-21T12:50:14.4191067Z max=1.9598 2026-02-21T12:50:14.4191307Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:50:14.4191754Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T12:50:14.4192156Z 'l2_groupings': [1], 2026-02-21T12:50:14.4192904Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:50:14.4193222Z 'loop_orders': [[0, 1]], 2026-02-21T12:50:14.4193511Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:50:14.4193777Z 'num_stages': 1, 2026-02-21T12:50:14.4194015Z 'num_warps': 4, 2026-02-21T12:50:14.4194255Z 'pid_type': 'flat', 2026-02-21T12:50:14.4194530Z 'range_flattens': [None, True], 2026-02-21T12:50:14.4194846Z 'range_multi_buffers': [None, True], 2026-02-21T12:50:14.4195153Z 'range_num_stages': [0, 2], 2026-02-21T12:50:14.4195437Z 'range_unroll_factors': [0, 2], 2026-02-21T12:50:14.4195762Z 'range_warp_specializes': [], 2026-02-21T12:50:14.4196036Z 'waves_per_eu': 2} 2026-02-21T12:50:14.4406838Z [745s] Fitting surrogate: 1163 points, 1163 targets 2026-02-21T12:50:14.9072852Z [745s] Generation 14 starting: 40 neighbors, 2 active search path(s) 2026-02-21T12:50:46.7037234Z [777s] Timeout after 30s compiling Config(block_sizes=[1, 128, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:50:47.2413727Z [778s] Timeout after 30s compiling Config(block_sizes=[1, 128, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:50:47.9290768Z [778s] Timeout after 30s compiling Config(block_sizes=[1, 128, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[None, None], range_num_stages=[2, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:50:47.9311716Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 1.0 configs/s 2026-02-21T12:50:50.3222830Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 42/42 17.9 configs/s 2026-02-21T12:50:51.9759350Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 721.7 2026-02-21T12:50:51.9759992Z configs/s 2026-02-21T12:50:52.6562386Z [783s] Generation 14 complete: 2026-02-21T12:50:52.6562774Z timeout=3 2026-02-21T12:50:52.6563273Z ok=40 2026-02-21T12:50:52.6563433Z min=0.1216 2026-02-21T12:50:52.6563594Z mid=0.2526 2026-02-21T12:50:52.6563750Z max=0.8945 2026-02-21T12:50:52.6563926Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:50:52.6564275Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T12:50:52.6564595Z 'l2_groupings': [1], 2026-02-21T12:50:52.6564814Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:50:52.6565067Z 'loop_orders': [[0, 1]], 2026-02-21T12:50:52.6565383Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:50:52.6565596Z 'num_stages': 1, 2026-02-21T12:50:52.6565777Z 'num_warps': 4, 2026-02-21T12:50:52.6565963Z 'pid_type': 'flat', 2026-02-21T12:50:52.6566167Z 'range_flattens': [None, True], 2026-02-21T12:50:52.6566409Z 'range_multi_buffers': [None, True], 2026-02-21T12:50:52.6566656Z 'range_num_stages': [0, 2], 2026-02-21T12:50:52.6566878Z 'range_unroll_factors': [0, 2], 2026-02-21T12:50:52.6567115Z 'range_warp_specializes': [], 2026-02-21T12:50:52.6567330Z 'waves_per_eu': 2} 2026-02-21T12:50:52.6688883Z [783s] Fitting surrogate: 1206 points, 1206 targets 2026-02-21T12:50:53.1456907Z [784s] Generation 15 starting: 40 neighbors, 2 active search path(s) 2026-02-21T12:51:14.3064360Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 1.3 configs/s 2026-02-21T12:51:16.8585867Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 42/42 17.5 configs/s 2026-02-21T12:51:17.3089521Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1620.0 2026-02-21T12:51:17.3089984Z configs/s 2026-02-21T12:51:18.0854997Z [809s] Generation 15 complete: 2026-02-21T12:51:18.0855323Z error=1 2026-02-21T12:51:18.0855519Z ok=42 2026-02-21T12:51:18.0855714Z min=0.1229 2026-02-21T12:51:18.0858811Z mid=0.2730 2026-02-21T12:51:18.0859040Z max=0.7985 2026-02-21T12:51:18.0859567Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:51:18.0859973Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T12:51:18.0860377Z 'l2_groupings': [1], 2026-02-21T12:51:18.0860634Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:51:18.0860928Z 'loop_orders': [[0, 1]], 2026-02-21T12:51:18.0861172Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:51:18.0861403Z 'num_stages': 1, 2026-02-21T12:51:18.0861599Z 'num_warps': 4, 2026-02-21T12:51:18.0861804Z 'pid_type': 'flat', 2026-02-21T12:51:18.0862022Z 'range_flattens': [None, True], 2026-02-21T12:51:18.0862281Z 'range_multi_buffers': [None, True], 2026-02-21T12:51:18.0862550Z 'range_num_stages': [0, 2], 2026-02-21T12:51:18.0862784Z 'range_unroll_factors': [0, 2], 2026-02-21T12:51:18.0863038Z 'range_warp_specializes': [], 2026-02-21T12:51:18.0863275Z 'waves_per_eu': 2} 2026-02-21T12:51:18.0947075Z [809s] Fitting surrogate: 1249 points, 1249 targets 2026-02-21T12:51:18.5385906Z [809s] Generation 16 starting: 38 neighbors, 2 active search path(s) 2026-02-21T12:51:49.8083945Z [840s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:51:49.8099557Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 0.8 configs/s 2026-02-21T12:51:52.1218403Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 40/40 17.7 configs/s 2026-02-21T12:51:53.4597128Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 599.8 2026-02-21T12:51:53.4597707Z configs/s 2026-02-21T12:51:54.2497847Z [845s] Generation 16 complete: 2026-02-21T12:51:54.2498196Z error=1 2026-02-21T12:51:54.2498391Z timeout=1 2026-02-21T12:51:54.2498583Z ok=39 2026-02-21T12:51:54.2499057Z min=0.1248 2026-02-21T12:51:54.2499254Z mid=0.2399 2026-02-21T12:51:54.2499441Z max=1.2635 2026-02-21T12:51:54.2499659Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:51:54.2500070Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T12:51:54.2500458Z 'l2_groupings': [1], 2026-02-21T12:51:54.2500723Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:51:54.2501023Z 'loop_orders': [[0, 1]], 2026-02-21T12:51:54.2501291Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:51:54.2501556Z 'num_stages': 1, 2026-02-21T12:51:54.2501776Z 'num_warps': 4, 2026-02-21T12:51:54.2501999Z 'pid_type': 'flat', 2026-02-21T12:51:54.2502250Z 'range_flattens': [None, True], 2026-02-21T12:51:54.2502538Z 'range_multi_buffers': [None, True], 2026-02-21T12:51:54.2502827Z 'range_num_stages': [0, 2], 2026-02-21T12:51:54.2503089Z 'range_unroll_factors': [0, 2], 2026-02-21T12:51:54.2503375Z 'range_warp_specializes': [], 2026-02-21T12:51:54.2503645Z 'waves_per_eu': 2} 2026-02-21T12:51:54.2660867Z [845s] Fitting surrogate: 1290 points, 1290 targets 2026-02-21T12:51:54.7517903Z [845s] Generation 17 starting: 41 neighbors, 2 active search path(s) 2026-02-21T12:52:31.7872054Z [882s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, True], range_num_stages=[3, 4], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:52:31.7892224Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 0.4 configs/s 2026-02-21T12:52:34.2000394Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 42/42 17.8 configs/s 2026-02-21T12:52:36.4599116Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 376.0 2026-02-21T12:52:36.4599676Z configs/s 2026-02-21T12:52:37.1623938Z [888s] Generation 17 complete: 2026-02-21T12:52:37.1624327Z error=1 2026-02-21T12:52:37.1624533Z timeout=1 2026-02-21T12:52:37.1624761Z ok=42 2026-02-21T12:52:37.1624960Z min=0.1222 2026-02-21T12:52:37.1625163Z mid=0.2331 2026-02-21T12:52:37.1625362Z max=0.8132 2026-02-21T12:52:37.1625593Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:52:37.1626020Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T12:52:37.1626437Z 'l2_groupings': [1], 2026-02-21T12:52:37.1626720Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:52:37.1627036Z 'loop_orders': [[0, 1]], 2026-02-21T12:52:37.1627318Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:52:37.1627587Z 'num_stages': 1, 2026-02-21T12:52:37.1627819Z 'num_warps': 4, 2026-02-21T12:52:37.1628049Z 'pid_type': 'flat', 2026-02-21T12:52:37.1628324Z 'range_flattens': [None, True], 2026-02-21T12:52:37.1628624Z 'range_multi_buffers': [None, True], 2026-02-21T12:52:37.1628939Z 'range_num_stages': [0, 2], 2026-02-21T12:52:37.1629212Z 'range_unroll_factors': [0, 2], 2026-02-21T12:52:37.1629507Z 'range_warp_specializes': [], 2026-02-21T12:52:37.1629792Z 'waves_per_eu': 2} 2026-02-21T12:52:37.1829173Z [888s] Fitting surrogate: 1334 points, 1334 targets 2026-02-21T12:52:37.4716303Z [888s] Generation 18 starting: 19 neighbors, 1 active search path(s) 2026-02-21T12:53:08.9500802Z [919s] Timeout after 30s compiling Config(block_sizes=[1, 64, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[3, 4], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:53:08.9521175Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 0.6 configs/s 2026-02-21T12:53:10.0999497Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 18.2 configs/s 2026-02-21T12:53:10.4202332Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2173.6 2026-02-21T12:53:10.4205731Z configs/s 2026-02-21T12:53:11.1090635Z [922s] Generation 18 complete: 2026-02-21T12:53:11.1090897Z timeout=1 2026-02-21T12:53:11.1091008Z ok=20 2026-02-21T12:53:11.1091146Z min=0.1229 2026-02-21T12:53:11.1100325Z mid=0.2640 2026-02-21T12:53:11.1100459Z max=0.8246 2026-02-21T12:53:11.1100576Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:53:11.1100785Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T12:53:11.1100984Z 'l2_groupings': [1], 2026-02-21T12:53:11.1101124Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:53:11.1101287Z 'loop_orders': [[0, 1]], 2026-02-21T12:53:11.1101426Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:53:11.1101563Z 'num_stages': 1, 2026-02-21T12:53:11.1101697Z 'num_warps': 4, 2026-02-21T12:53:11.1101815Z 'pid_type': 'flat', 2026-02-21T12:53:11.1101943Z 'range_flattens': [None, True], 2026-02-21T12:53:11.1103469Z 'range_multi_buffers': [None, True], 2026-02-21T12:53:11.1104220Z 'range_num_stages': [0, 2], 2026-02-21T12:53:11.1104532Z 'range_unroll_factors': [0, 2], 2026-02-21T12:53:11.1104843Z 'range_warp_specializes': [], 2026-02-21T12:53:11.1105131Z 'waves_per_eu': 2} 2026-02-21T12:53:11.1187601Z [922s] Fitting surrogate: 1355 points, 1355 targets 2026-02-21T12:53:11.3960542Z [922s] Generation 19 starting: 17 neighbors, 1 active search path(s) 2026-02-21T12:53:16.5759503Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 5.2 configs/s 2026-02-21T12:53:17.7064907Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.3 configs/s 2026-02-21T12:53:18.5123080Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 984.9 2026-02-21T12:53:18.5123684Z configs/s 2026-02-21T12:53:19.2856281Z [930s] Generation 19 complete: 2026-02-21T12:53:19.2856527Z ok=19 2026-02-21T12:53:19.2856685Z min=0.1231 2026-02-21T12:53:19.2856843Z mid=0.1896 2026-02-21T12:53:19.2857008Z max=0.4380 2026-02-21T12:53:19.2857176Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:53:19.2857502Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T12:53:19.2857803Z 'l2_groupings': [1], 2026-02-21T12:53:19.2858023Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:53:19.2858264Z 'loop_orders': [[0, 1]], 2026-02-21T12:53:19.2858472Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:53:19.2858674Z 'num_stages': 1, 2026-02-21T12:53:19.2858844Z 'num_warps': 4, 2026-02-21T12:53:19.2859016Z 'pid_type': 'flat', 2026-02-21T12:53:19.2859208Z 'range_flattens': [None, True], 2026-02-21T12:53:19.2859442Z 'range_multi_buffers': [None, True], 2026-02-21T12:53:19.2859671Z 'range_num_stages': [0, 2], 2026-02-21T12:53:19.2859876Z 'range_unroll_factors': [0, 2], 2026-02-21T12:53:19.2860103Z 'range_warp_specializes': [], 2026-02-21T12:53:19.2860311Z 'waves_per_eu': 2} 2026-02-21T12:53:19.3012727Z [930s] Fitting surrogate: 1374 points, 1374 targets 2026-02-21T12:53:19.6408002Z [930s] Generation 20 starting: 19 neighbors, 1 active search path(s) 2026-02-21T12:53:29.2106305Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 2.0 configs/s 2026-02-21T12:53:30.4759224Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 17.0 configs/s 2026-02-21T12:53:32.5983923Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 408.1 2026-02-21T12:53:32.5984433Z configs/s 2026-02-21T12:53:33.2845962Z [944s] Generation 20 complete: 2026-02-21T12:53:33.2846207Z ok=21 2026-02-21T12:53:33.2846362Z min=0.1248 2026-02-21T12:53:33.2846537Z mid=0.1549 2026-02-21T12:53:33.2846685Z max=0.4356 2026-02-21T12:53:33.2846857Z best={'block_sizes': [1, 128, 32], 2026-02-21T12:53:33.2847353Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T12:53:33.2847648Z 'l2_groupings': [1], 2026-02-21T12:53:33.2847850Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:53:33.2848204Z 'loop_orders': [[0, 1]], 2026-02-21T12:53:33.2848412Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:53:33.2848607Z 'num_stages': 1, 2026-02-21T12:53:33.2848776Z 'num_warps': 4, 2026-02-21T12:53:33.2848948Z 'pid_type': 'flat', 2026-02-21T12:53:33.2849148Z 'range_flattens': [None, True], 2026-02-21T12:53:33.2849368Z 'range_multi_buffers': [None, True], 2026-02-21T12:53:33.2849595Z 'range_num_stages': [0, 2], 2026-02-21T12:53:33.2849800Z 'range_unroll_factors': [0, 2], 2026-02-21T12:53:33.2850028Z 'range_warp_specializes': [], 2026-02-21T12:53:33.2850236Z 'waves_per_eu': 2} 2026-02-21T12:53:33.3105083Z [944s] Fitting surrogate: 1395 points, 1395 targets 2026-02-21T12:53:33.4652619Z [944s] Autotuning complete in 944.5s after searching 1311 configs. 2026-02-21T12:53:33.4653271Z One can hardcode the best config and skip autotuning with: 2026-02-21T12:53:33.4655211Z @helion.kernel(config=helion.Config(block_sizes=[1, 128, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T12:53:33.4656946Z 2026-02-21T12:53:33.4657406Z [944s] Code of selected kernel: /tmp/torchinductor_root/rz/crztikscnyuaoookxi6imyujxfonfnzor3eprliz4nchtvjwxhku.py 2026-02-21T12:53:33.4896669Z from __future__ import annotations 2026-02-21T12:53:33.4896968Z 2026-02-21T12:53:33.4897073Z import torch 2026-02-21T12:53:33.4897497Z import triton 2026-02-21T12:53:33.4897753Z import triton.language as tl 2026-02-21T12:53:33.4898108Z from torch._inductor.runtime import triton_helpers 2026-02-21T12:53:33.4898582Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T12:53:33.4899084Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T12:53:33.4899417Z 2026-02-21T12:53:33.4899536Z _BLOCK_SIZE_1 = tl.constexpr(128) 2026-02-21T12:53:33.4899842Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T12:53:33.4900144Z _SHAPE_DIM = tl.constexpr(128) 2026-02-21T12:53:33.4900439Z _BLOCK_SIZE_3 = tl.constexpr(32) 2026-02-21T12:53:33.4900725Z _SHAPE_DIM_1 = tl.constexpr(128) 2026-02-21T12:53:33.4900914Z 2026-02-21T12:53:33.4901007Z @triton.jit 2026-02-21T12:53:33.4901383Z def _helion_attention(q_view, k_view, v_view, out, _RDIM_SIZE_2: tl.constexpr): 2026-02-21T12:53:33.4902001Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T12:53:33.4902451Z num_blocks_0 = 192 2026-02-21T12:53:33.4902739Z pid_0 = tl.program_id(0) % num_blocks_0 2026-02-21T12:53:33.4903080Z pid_1 = tl.program_id(0) // num_blocks_0 2026-02-21T12:53:33.4903392Z offset_0 = pid_0 2026-02-21T12:53:33.4903689Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T12:53:33.4904020Z offset_1 = pid_1 * _BLOCK_SIZE_1 2026-02-21T12:53:33.4904342Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T12:53:33.4904628Z indices_4 = tl.arange(0, _RDIM_SIZE_2).to(tl.int32) 2026-02-21T12:53:33.4904971Z # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T12:53:33.4905419Z m_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], float('-inf'), tl.float32) 2026-02-21T12:53:33.4905730Z # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0) 2026-02-21T12:53:33.4906017Z l_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 1.0, tl.float32) 2026-02-21T12:53:33.4906361Z # src[attention.py:70]: acc = hl.zeros([tile_b, tile_m, head_dim], dtype=torch.float32) 2026-02-21T12:53:33.4906722Z acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128], 0.0, tl.float32) 2026-02-21T12:53:33.4914198Z # src[attention.py:71]: q = q_view[tile_b, tile_m, :] 2026-02-21T12:53:33.4914669Z q = tl.load(tl.make_block_ptr(q_view, [192, 512, 128], [65536, 128, 1], [offset_0, offset_1, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _SHAPE_DIM], [2, 1, 0]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T12:53:33.4915106Z # src[attention.py:72]: for tile_n in hl.tile(v_view.size(1)): 2026-02-21T12:53:33.4915323Z # src[attention.py:73]: k = k_view[tile_b, :, tile_n] 2026-02-21T12:53:33.4915520Z # src[attention.py:74]: qk = torch.bmm(q, k) 2026-02-21T12:53:33.4915687Z # src[attention.py:72-85]: ... 2026-02-21T12:53:33.4915995Z for offset_2 in tl.range(0, 512, _BLOCK_SIZE_3, loop_unroll_factor=2, num_stages=1, disallow_acc_multi_buffer=False, flatten=True): 2026-02-21T12:53:33.4916355Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_3).to(tl.int32) 2026-02-21T12:53:33.4916538Z q_copy = q 2026-02-21T12:53:33.4916652Z m_i_copy = m_i 2026-02-21T12:53:33.4916770Z l_i_copy = l_i 2026-02-21T12:53:33.4916886Z acc_copy = acc 2026-02-21T12:53:33.4917002Z q_copy_0 = q_copy 2026-02-21T12:53:33.4917132Z m_i_copy_0 = m_i_copy 2026-02-21T12:53:33.4917265Z l_i_copy_0 = l_i_copy 2026-02-21T12:53:33.4917393Z acc_copy_0 = acc_copy 2026-02-21T12:53:33.4917547Z # src[attention.py:73]: k = k_view[tile_b, :, tile_n] 2026-02-21T12:53:33.4917974Z k = tl.load(tl.make_block_ptr(k_view, [192, 128, 512], [65536, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_1, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T12:53:33.4918387Z # src[attention.py:74]: qk = torch.bmm(q, k) 2026-02-21T12:53:33.4918959Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T12:53:33.4919554Z # src[attention.py:75]: m_ij = torch.maximum(m_i, torch.amax(qk, -1) * qk_scale) 2026-02-21T12:53:33.4919790Z amax = tl.cast(tl.max(qk, 2), tl.bfloat16) 2026-02-21T12:53:33.4919945Z v_0 = 0.12751743074602467 2026-02-21T12:53:33.4920090Z v_1 = tl.cast(amax * v_0, tl.bfloat16) 2026-02-21T12:53:33.4920243Z v_2 = tl.cast(v_1, tl.float32) 2026-02-21T12:53:33.4920408Z v_3 = triton_helpers.maximum(m_i_copy_0, v_2) 2026-02-21T12:53:33.4920609Z # src[attention.py:76]: qk = qk * qk_scale - m_ij[:, :, None] 2026-02-21T12:53:33.4920789Z v_4 = 0.12751743074602467 2026-02-21T12:53:33.4920929Z v_5 = tl.cast(qk * v_4, tl.bfloat16) 2026-02-21T12:53:33.4921080Z subscript = v_3[:, :, None] 2026-02-21T12:53:33.4921224Z v_6 = tl.cast(v_5, tl.float32) 2026-02-21T12:53:33.4921367Z v_7 = v_6 - subscript 2026-02-21T12:53:33.4921529Z # src[attention.py:77]: p = torch.exp2(qk) 2026-02-21T12:53:33.4921691Z v_8 = libdevice.exp2(v_7) 2026-02-21T12:53:33.4921848Z # src[attention.py:78]: l_ij = torch.sum(p, -1) 2026-02-21T12:53:33.4922028Z l_ij = tl.cast(tl.sum(v_8, 2), tl.float32) 2026-02-21T12:53:33.4922214Z # src[attention.py:79]: alpha = torch.exp2(m_i - m_ij) 2026-02-21T12:53:33.4922392Z v_9 = m_i_copy_0 - v_3 2026-02-21T12:53:33.4922527Z v_10 = libdevice.exp2(v_9) 2026-02-21T12:53:33.4922764Z # src[attention.py:80]: l_i = l_i * alpha + l_ij 2026-02-21T12:53:33.4922952Z v_11 = l_i_copy_0 * v_10 2026-02-21T12:53:33.4923086Z l_i = v_11 + l_ij 2026-02-21T12:53:33.4923238Z # src[attention.py:81]: acc = acc * alpha[:, :, None] 2026-02-21T12:53:33.4923411Z subscript_1 = v_10[:, :, None] 2026-02-21T12:53:33.4923557Z v_13 = acc_copy_0 * subscript_1 2026-02-21T12:53:33.4923729Z # src[attention.py:82]: v = v_view[tile_b, tile_n, :] 2026-02-21T12:53:33.4924043Z v = tl.load(v_view + (indices_0[:, None, None] * 65536 + indices_2[None, :, None] * 128 + indices_4[None, None, :] * 1), None) 2026-02-21T12:53:33.4924365Z # src[attention.py:83]: p = p.to(v.dtype) 2026-02-21T12:53:33.4924554Z v_14 = tl.cast(v_8, tl.bfloat16) 2026-02-21T12:53:33.4924701Z # src[attention.py:84]: acc = torch.baddbmm(acc, p, v) 2026-02-21T12:53:33.4925164Z acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128]) 2026-02-21T12:53:33.4925605Z # src[attention.py:85]: m_i = m_ij 2026-02-21T12:53:33.4925716Z m_i = v_3 2026-02-21T12:53:33.4925829Z # src[attention.py:87]: acc = acc / l_i[:, :, None] 2026-02-21T12:53:33.4925965Z subscript_2 = l_i[:, :, None] 2026-02-21T12:53:33.4926076Z v_15 = acc / subscript_2 2026-02-21T12:53:33.4926221Z # src[attention.py:88]: out[tile_b, tile_m, :] = acc.to(out.dtype) 2026-02-21T12:53:33.4926375Z v_16 = tl.cast(v_15, tl.bfloat16) 2026-02-21T12:53:33.4926599Z tl.store(out + (indices_0[:, None, None] * 65536 + indices_1[None, :, None] * 128 + indices_4[None, None, :] * 1), v_16, None) 2026-02-21T12:53:33.4926785Z 2026-02-21T12:53:33.4926917Z def attention(q_in: torch.Tensor, k_in: torch.Tensor, v_in: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T12:53:33.4927122Z """ 2026-02-21T12:53:33.4927214Z Computes scaled dot-product attention. 2026-02-21T12:53:33.4927298Z 2026-02-21T12:53:33.4927413Z Implements the attention mechanism: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V 2026-02-21T12:53:33.4927570Z 2026-02-21T12:53:33.4927604Z Args: 2026-02-21T12:53:33.4927708Z q_in: Query tensor of shape [..., seq_len_q, head_dim] 2026-02-21T12:53:33.4927918Z k_in: Key tensor of shape [..., seq_len_k, head_dim] 2026-02-21T12:53:33.4928075Z v_in: Value tensor of shape [..., seq_len_k, head_dim] 2026-02-21T12:53:33.4928178Z 2026-02-21T12:53:33.4928214Z Returns: 2026-02-21T12:53:33.4928322Z Output tensor of shape [..., seq_len_q, head_dim] 2026-02-21T12:53:33.4928450Z """ 2026-02-21T12:53:33.4928546Z # src[attention.py:56]: m_dim = q_in.size(-2) 2026-02-21T12:53:33.4928673Z m_dim = q_in.size(-2) 2026-02-21T12:53:33.4928789Z # src[attention.py:57]: n_dim = k_in.size(-2) 2026-02-21T12:53:33.4928915Z n_dim = k_in.size(-2) 2026-02-21T12:53:33.4929034Z # src[attention.py:58]: assert n_dim == v_in.size(-2) 2026-02-21T12:53:33.4929172Z assert n_dim == v_in.size(-2) 2026-02-21T12:53:33.4929318Z # src[attention.py:59]: head_dim = hl.specialize(q_in.size(-1)) 2026-02-21T12:53:33.4929468Z head_dim = 128 2026-02-21T12:53:33.4929603Z # src[attention.py:60]: assert head_dim == k_in.size(-1) == v_in.size(-1) 2026-02-21T12:53:33.4929784Z assert head_dim == k_in.size(-1) == v_in.size(-1) 2026-02-21T12:53:33.4929955Z # src[attention.py:61]: q_view = q_in.reshape([-1, m_dim, head_dim]) 2026-02-21T12:53:33.4930123Z q_view = q_in.reshape([-1, m_dim, head_dim]) 2026-02-21T12:53:33.4930285Z # src[attention.py:62]: v_view = v_in.reshape([-1, n_dim, head_dim]) 2026-02-21T12:53:33.4930450Z v_view = v_in.reshape([-1, n_dim, head_dim]) 2026-02-21T12:53:33.4930632Z # src[attention.py:63]: k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2) 2026-02-21T12:53:33.4930837Z k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2) 2026-02-21T12:53:33.4931023Z # src[attention.py:64]: out = torch.empty_like(q_view) 2026-02-21T12:53:33.4931163Z out = torch.empty_like(q_view) 2026-02-21T12:53:33.4931327Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T12:53:33.4931494Z _BLOCK_SIZE_1 = 128 2026-02-21T12:53:33.4931590Z _RDIM_SIZE_2 = 128 2026-02-21T12:53:33.4931737Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T12:53:33.4931989Z # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T12:53:33.4932203Z # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0) 2026-02-21T12:53:33.4932380Z # src[attention.py:67-88]: ... 2026-02-21T12:53:33.4932695Z _launcher(_helion_attention, (192 * triton.cdiv(512, _BLOCK_SIZE_1),), q_view, k_view, v_view, out, _RDIM_SIZE_2, num_warps=4, num_stages=1, waves_per_eu=2, matrix_instr_nonkdim=16) 2026-02-21T12:53:33.4933028Z # src[attention.py:89]: return out.view(q_in.size()) 2026-02-21T12:53:33.4933166Z return out.view(q_in.size()) 2026-02-21T12:53:34.3074750Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T12:53:34.3076967Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T12:53:34.3079124Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T12:53:34.3079603Z WARNING:tritonbench.utils.triton_op:Completed input ID 2: 2026-02-21T12:53:34.3080033Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T12:53:34.3080368Z ------------------------------------------ 2026-02-21T12:53:34.3080683Z (4, 48, 512, 512, 128) 2026-02-21T12:53:34.3080845Z 2026-02-21T12:53:34.3085208Z 50%|█████ | 3/6 [26:40<30:43, 614.57s/it]WARNING:tritonbench.utils.triton_op:Running input ID 4: 2026-02-21T12:53:34.3085598Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T12:53:34.3086029Z ------------------------------------------ 2026-02-21T12:53:34.3086253Z (4, 48, 2048, 2048, 128) 2026-02-21T12:53:34.3087930Z INFO:tritonbench.utils.triton_op:Took 0.09ms to get benchmark function for aten 2026-02-21T12:53:35.3065035Z INFO:tritonbench.utils.triton_op:Took 1.66ms to get benchmark function for flex_attention 2026-02-21T12:53:36.4652620Z WARNING:__main__:Input tensor metadata: 2026-02-21T12:53:36.4652912Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T12:53:36.4653161Z 'dtype': 'torch.bfloat16', 2026-02-21T12:53:36.4653399Z 'shape': (4, 48, 2048, 128), 2026-02-21T12:53:36.4653655Z 'stride': (12582912, 262144, 128, 1)}, 2026-02-21T12:53:36.4653899Z { 'device': 'cuda:0', 2026-02-21T12:53:36.4654123Z 'dtype': 'torch.bfloat16', 2026-02-21T12:53:36.4654351Z 'shape': (4, 48, 2048, 128), 2026-02-21T12:53:36.4654585Z 'stride': (12582912, 262144, 128, 1)}, 2026-02-21T12:53:36.4654825Z { 'device': 'cuda:0', 2026-02-21T12:53:36.4655038Z 'dtype': 'torch.bfloat16', 2026-02-21T12:53:36.4655275Z 'shape': (4, 48, 2048, 128), 2026-02-21T12:53:36.4655509Z 'stride': (12582912, 262144, 128, 1)}), 2026-02-21T12:53:36.4655742Z 'kwargs': {}} 2026-02-21T12:53:36.4700452Z INFO:tritonbench.utils.triton_op:Took 5.19ms to get benchmark function for helion_attention 2026-02-21T12:53:36.7208569Z [0s] Autotune random seed: 2150287535 2026-02-21T12:53:36.7576062Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T12:54:11.3810466Z [34s] Timeout after 30s compiling Config(block_sizes=[1, 64, 2048], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T12:54:12.3835759Z [35s] Timeout after 30s compiling Config(block_sizes=[1, 2, 2048], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:54:12.7902591Z [36s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T12:54:13.5050724Z [36s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:54:13.8357473Z [37s] Timeout after 30s compiling Config(block_sizes=[1, 1, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, None], range_num_stages=[1, 2], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:54:14.1845155Z [37s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[3, 3], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:54:15.3142081Z [38s] Timeout after 30s compiling Config(block_sizes=[1, 16, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, False], range_num_stages=[2, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:54:16.3358463Z [39s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 128], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=1, num_warps=16, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T12:54:17.1897909Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 1, 128], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T12:54:17.3794886Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[0, 0], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:54:17.9722667Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 1, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, True], range_num_stages=[3, 1], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:54:18.2071950Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[4, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T12:54:20.0148678Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[1, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T12:54:20.4376469Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:54:20.6909244Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 32, 2048], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T12:54:21.3984528Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T12:54:22.8463258Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 64, 512], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[4, 2], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T12:54:23.1789074Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 512, 1024], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T12:54:25.7985886Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 256], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:54:25.9733757Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 64, 1024], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[1, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T12:54:26.2027796Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 4], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T12:54:26.8904640Z [50s] Timeout after 30s compiling Config(block_sizes=[1, 8, 1024], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:54:27.2914717Z [50s] Timeout after 30s compiling Config(block_sizes=[1, 2, 2048], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T12:54:27.8084474Z [51s] Timeout after 30s compiling Config(block_sizes=[1, 4, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[0, 1], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T12:54:28.1682734Z [51s] Timeout after 30s compiling Config(block_sizes=[1, 1, 128], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T12:54:28.1708097Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.2 configs/s 2026-02-21T12:55:19.5172168Z /tmp/torchinductor_root/3j/c3jqldgdfc2n2y6s5hmlk2le75i7srd7briu3ykyopeiqmr7asfz.py:63:24: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T12:55:19.5173220Z k = tl.load(tl.make_block_ptr(k_view, [192, 128, 2048], [262144, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_3, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T12:55:19.5174147Z ^ 2026-02-21T12:55:19.5175479Z /tmp/torchinductor_root/3j/c3jqldgdfc2n2y6s5hmlk2le75i7srd7briu3ykyopeiqmr7asfz.py:65:145: note: - use: %137 = "tt.reshape"(<>) : (tensor<1x128x512xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 8], order = [1, 0, 2]}>>) -> tensor<128x512xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 8], order = [0, 1]}>> 2026-02-21T12:55:19.5176619Z 2026-02-21T12:55:19.5177281Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T12:55:19.5178172Z ^ 2026-02-21T12:55:19.5178533Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T12:55:19.5179027Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [2, 1, 0]}> 2026-02-21T12:55:19.5179642Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [2, 1, 0]}> 2026-02-21T12:55:19.5180155Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}> 2026-02-21T12:55:19.5180659Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [2, 1, 0]}> 2026-02-21T12:55:19.5181279Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T12:55:19.5181762Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}> 2026-02-21T12:55:19.5182231Z #blocked6 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [8], order = [0]}> 2026-02-21T12:55:19.5182687Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [0, 1]}> 2026-02-21T12:55:19.5183159Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}> 2026-02-21T12:55:19.5183653Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [0, 1, 2]}> 2026-02-21T12:55:19.5184164Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [0, 1, 2]}> 2026-02-21T12:55:19.5184690Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 2], order = [2, 1, 0]}> 2026-02-21T12:55:19.5185233Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [0, 1, 2]}> 2026-02-21T12:55:19.5185748Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}> 2026-02-21T12:55:19.5186267Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T12:55:19.5186784Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [0, 1, 2]}> 2026-02-21T12:55:19.5187401Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T12:55:19.5188242Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T12:55:19.5188889Z %c2048_i64 = arith.constant 2048 : i64 2026-02-21T12:55:19.5189088Z %c128_i64 = arith.constant 128 : i64 2026-02-21T12:55:19.5189308Z %c192_i64 = arith.constant 192 : i64 2026-02-21T12:55:19.5189476Z %c0_i64 = arith.constant 0 : i64 2026-02-21T12:55:19.5189628Z %c262144_i64 = arith.constant 262144 : i64 2026-02-21T12:55:19.5189775Z %c128_i32 = arith.constant 128 : i32 2026-02-21T12:55:19.5189924Z %c262144_i32 = arith.constant 262144 : i32 2026-02-21T12:55:19.5190104Z %cst = arith.constant dense<128> : tensor<1x1x128xi64, #blocked> 2026-02-21T12:55:19.5190329Z %cst_0 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked> 2026-02-21T12:55:19.5190564Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<1x128x512xbf16, #blocked1> 2026-02-21T12:55:19.5190842Z %cst_2 = arith.constant dense<2048> : tensor<1x1x512xi64, #blocked1> 2026-02-21T12:55:19.5191058Z %cst_3 = arith.constant dense<0> : tensor<1x1x512xi64, #blocked1> 2026-02-21T12:55:19.5191277Z %cst_4 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2> 2026-02-21T12:55:19.5191498Z %cst_5 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2> 2026-02-21T12:55:19.5191718Z %cst_6 = arith.constant dense<128> : tensor<1x1x512xi64, #blocked1> 2026-02-21T12:55:19.5191896Z %c512_i32 = arith.constant 512 : i32 2026-02-21T12:55:19.5192043Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T12:55:19.5192193Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T12:55:19.5192342Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T12:55:19.5192493Z %c393216_i32 = arith.constant 393216 : i32 2026-02-21T12:55:19.5192670Z %cst_7 = arith.constant dense<128> : tensor<1x512x1xi32, #blocked3> 2026-02-21T12:55:19.5192904Z %cst_8 = arith.constant dense<0.127517432> : tensor<1x1x512xf32, #blocked1> 2026-02-21T12:55:19.5193168Z %cst_9 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5193414Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<1x512xf32, #blocked5> 2026-02-21T12:55:19.5193615Z %c0_i32 = arith.constant 0 : i32 2026-02-21T12:55:19.5193799Z %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked> 2026-02-21T12:55:19.5194044Z %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5194283Z %cst_13 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5194479Z %c64_i32 = arith.constant 64 : i32 2026-02-21T12:55:19.5194612Z %c192_i32 = arith.constant 192 : i32 2026-02-21T12:55:19.5194757Z %0 = tt.get_program_id x : i32 2026-02-21T12:55:19.5194954Z %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked6> 2026-02-21T12:55:19.5195282Z %2 = ttg.convert_layout %1 : tensor<128xi32, #blocked6> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T12:55:19.5195693Z %3 = tt.expand_dims %2 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7> 2026-02-21T12:55:19.5196073Z %4 = ttg.convert_layout %3 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked8> 2026-02-21T12:55:19.5196424Z %5 = ttg.convert_layout %4 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T12:55:19.5196836Z %6 = tt.expand_dims %5 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi32, #blocked9> 2026-02-21T12:55:19.5197212Z %7 = ttg.convert_layout %6 : tensor<1x1x128xi32, #blocked9> -> tensor<1x1x128xi32, #blocked> 2026-02-21T12:55:19.5197499Z %8 = tt.splat %arg0 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked> 2026-02-21T12:55:19.5197753Z %9 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #blocked6> 2026-02-21T12:55:19.5198018Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x512x!tt.ptr, #blocked1> 2026-02-21T12:55:19.5198282Z %11 = arith.extsi %1 : tensor<128xi32, #blocked6> to tensor<128xi64, #blocked6> 2026-02-21T12:55:19.5198634Z %12 = ttg.convert_layout %11 : tensor<128xi64, #blocked6> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T12:55:19.5199056Z %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi64, #blocked7> 2026-02-21T12:55:19.5199392Z %14 = ttg.convert_layout %13 : tensor<1x128xi64, #blocked7> -> tensor<1x128xi64, #blocked8> 2026-02-21T12:55:19.5199676Z %15 = ttg.convert_layout %14 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked10}>> 2026-02-21T12:55:19.5200013Z %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x128x1xi64, #blocked10> 2026-02-21T12:55:19.5200316Z %17 = ttg.convert_layout %16 : tensor<1x128x1xi64, #blocked10> -> tensor<1x128x1xi64, #blocked2> 2026-02-21T12:55:19.5200565Z %18 = tt.broadcast %17 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x512xi64, #blocked2> 2026-02-21T12:55:19.5200818Z %19 = ttg.convert_layout %18 : tensor<1x128x512xi64, #blocked2> -> tensor<1x128x512xi64, #blocked1> 2026-02-21T12:55:19.5201056Z %20 = arith.extsi %9 : tensor<512xi32, #blocked6> to tensor<512xi64, #blocked6> 2026-02-21T12:55:19.5201247Z %21 = arith.cmpi sge, %17, %cst_5 : tensor<1x128x1xi64, #blocked2> 2026-02-21T12:55:19.5201422Z %22 = arith.cmpi slt, %17, %cst_4 : tensor<1x128x1xi64, #blocked2> 2026-02-21T12:55:19.5201591Z %23 = arith.andi %21, %22 : tensor<1x128x1xi1, #blocked2> 2026-02-21T12:55:19.5201790Z %24 = tt.broadcast %7 : tensor<1x1x128xi32, #blocked> -> tensor<1x512x128xi32, #blocked> 2026-02-21T12:55:19.5202041Z %25 = ttg.convert_layout %24 : tensor<1x512x128xi32, #blocked> -> tensor<1x512x128xi32, #blocked11> 2026-02-21T12:55:19.5202299Z %26 = tt.splat %arg2 : !tt.ptr -> tensor<1x512x128x!tt.ptr, #blocked11> 2026-02-21T12:55:19.5202514Z %27 = tt.splat %arg3 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked> 2026-02-21T12:55:19.5202785Z %28 = arith.extsi %1 : tensor<128xi32, #blocked6> to tensor<128xi64, #blocked6> 2026-02-21T12:55:19.5203052Z %29 = ttg.convert_layout %28 : tensor<128xi64, #blocked6> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T12:55:19.5203387Z %30 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi64, #blocked7> 2026-02-21T12:55:19.5203674Z %31 = ttg.convert_layout %30 : tensor<1x128xi64, #blocked7> -> tensor<1x128xi64, #blocked8> 2026-02-21T12:55:19.5203961Z %32 = ttg.convert_layout %31 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T12:55:19.5204298Z %33 = tt.expand_dims %32 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi64, #blocked9> 2026-02-21T12:55:19.5204626Z %34 = ttg.convert_layout %33 : tensor<1x1x128xi64, #blocked9> -> tensor<1x1x128xi64, #blocked> 2026-02-21T12:55:19.5204844Z %35 = arith.cmpi sge, %34, %cst_0 : tensor<1x1x128xi64, #blocked> 2026-02-21T12:55:19.5205019Z %36 = arith.cmpi slt, %34, %cst : tensor<1x1x128xi64, #blocked> 2026-02-21T12:55:19.5205187Z %37 = arith.andi %35, %36 : tensor<1x1x128xi1, #blocked> 2026-02-21T12:55:19.5205347Z scf.for %arg4 = %0 to %c393216_i32 step %c2432_i32 : i32 { 2026-02-21T12:55:19.5205520Z %38 = arith.divsi %arg4, %c131072_i32 : i32 2026-02-21T12:55:19.5205649Z %39 = arith.muli %38, %c64_i32 : i32 2026-02-21T12:55:19.5205775Z %40 = arith.subi %c192_i32, %39 : i32 2026-02-21T12:55:19.5205898Z %41 = arith.minsi %40, %c64_i32 : i32 2026-02-21T12:55:19.5206021Z %42 = arith.remsi %arg4, %c131072_i32 : i32 2026-02-21T12:55:19.5206151Z %43 = arith.remsi %42, %41 : i32 2026-02-21T12:55:19.5206266Z %44 = arith.addi %39, %43 : i32 2026-02-21T12:55:19.5206383Z %45 = arith.divsi %42, %41 : i32 2026-02-21T12:55:19.5206504Z %46 = arith.muli %44, %c262144_i32 : i32 2026-02-21T12:55:19.5206628Z %47 = arith.muli %45, %c128_i32 : i32 2026-02-21T12:55:19.5206763Z %48 = arith.addi %46, %47 : i32 2026-02-21T12:55:19.5206902Z %49 = tt.splat %48 : i32 -> tensor<1x1x128xi32, #blocked> 2026-02-21T12:55:19.5207064Z %50 = arith.addi %49, %7 : tensor<1x1x128xi32, #blocked> 2026-02-21T12:55:19.5207264Z %51 = tt.addptr %8, %50 : tensor<1x1x128x!tt.ptr, #blocked>, tensor<1x1x128xi32, #blocked> 2026-02-21T12:55:19.5207478Z %52 = tt.load %51 : tensor<1x1x128x!tt.ptr, #blocked> 2026-02-21T12:55:19.5207620Z %53 = arith.extsi %44 : i32 to i64 2026-02-21T12:55:19.5207746Z %54 = arith.muli %53, %c262144_i64 : i64 2026-02-21T12:55:19.5207888Z %55 = tt.splat %54 : i64 -> tensor<1x128x512xi64, #blocked1> 2026-02-21T12:55:19.5208038Z %56 = arith.cmpi sge, %53, %c0_i64 : i64 2026-02-21T12:55:19.5208169Z %57 = arith.cmpi slt, %53, %c192_i64 : i64 2026-02-21T12:55:19.5208294Z %58 = arith.andi %56, %57 : i1 2026-02-21T12:55:19.5208433Z %59 = tt.splat %58 : i1 -> tensor<1x128x1xi1, #blocked2> 2026-02-21T12:55:19.5208591Z %60 = arith.andi %59, %23 : tensor<1x128x1xi1, #blocked2> 2026-02-21T12:55:19.5208789Z %61 = tt.broadcast %60 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x512xi1, #blocked2> 2026-02-21T12:55:19.5209039Z %62 = ttg.convert_layout %61 : tensor<1x128x512xi1, #blocked2> -> tensor<1x128x512xi1, #blocked1> 2026-02-21T12:55:19.5209287Z %63 = tt.reshape %52 : tensor<1x1x128xbf16, #blocked> -> tensor<1x128xbf16, #blocked8> 2026-02-21T12:55:19.5209488Z %64 = tt.splat %46 : i32 -> tensor<1x512x1xi32, #blocked3> 2026-02-21T12:55:19.5209886Z %65:3 = scf.for %arg5 = %c0_i32 to %c2048_i32 step %c512_i32 iter_args(%arg6 = %cst_13, %arg7 = %cst_12, %arg8 = %cst_11) -> (tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked>) : i32 { 2026-02-21T12:55:19.5210248Z %91 = tt.splat %arg5 : i32 -> tensor<512xi32, #blocked6> 2026-02-21T12:55:19.5210405Z %92 = arith.addi %91, %9 : tensor<512xi32, #blocked6> 2026-02-21T12:55:19.5210549Z %93 = arith.extsi %arg5 : i32 to i64 2026-02-21T12:55:19.5210689Z %94 = tt.splat %93 : i64 -> tensor<512xi64, #blocked6> 2026-02-21T12:55:19.5210839Z %95 = arith.addi %94, %20 : tensor<512xi64, #blocked6> 2026-02-21T12:55:19.5211079Z %96 = ttg.convert_layout %95 : tensor<512xi64, #blocked6> -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T12:55:19.5211413Z %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x512xi64, #blocked7> 2026-02-21T12:55:19.5211709Z %98 = ttg.convert_layout %97 : tensor<1x512xi64, #blocked7> -> tensor<1x512xi64, #blocked5> 2026-02-21T12:55:19.5212003Z %99 = ttg.convert_layout %98 : tensor<1x512xi64, #blocked5> -> tensor<1x512xi64, #ttg.slice<{dim = 1, parent = #blocked12}>> 2026-02-21T12:55:19.5212366Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x512xi64, #ttg.slice<{dim = 1, parent = #blocked12}>> -> tensor<1x1x512xi64, #blocked12> 2026-02-21T12:55:19.5212690Z %101 = ttg.convert_layout %100 : tensor<1x1x512xi64, #blocked12> -> tensor<1x1x512xi64, #blocked1> 2026-02-21T12:55:19.5212915Z %102 = arith.muli %101, %cst_6 : tensor<1x1x512xi64, #blocked1> 2026-02-21T12:55:19.5213132Z %103 = tt.broadcast %102 : tensor<1x1x512xi64, #blocked1> -> tensor<1x128x512xi64, #blocked1> 2026-02-21T12:55:19.5213362Z %104 = arith.addi %19, %103 : tensor<1x128x512xi64, #blocked1> 2026-02-21T12:55:19.5213532Z %105 = arith.addi %55, %104 : tensor<1x128x512xi64, #blocked1> 2026-02-21T12:55:19.5213759Z %106 = tt.addptr %10, %105 : tensor<1x128x512x!tt.ptr, #blocked1>, tensor<1x128x512xi64, #blocked1> 2026-02-21T12:55:19.5213993Z %107 = arith.cmpi sge, %101, %cst_3 : tensor<1x1x512xi64, #blocked1> 2026-02-21T12:55:19.5214186Z %108 = arith.cmpi slt, %101, %cst_2 : tensor<1x1x512xi64, #blocked1> 2026-02-21T12:55:19.5214368Z %109 = arith.andi %107, %108 : tensor<1x1x512xi1, #blocked1> 2026-02-21T12:55:19.5214588Z %110 = tt.broadcast %109 : tensor<1x1x512xi1, #blocked1> -> tensor<1x128x512xi1, #blocked1> 2026-02-21T12:55:19.5214800Z %111 = arith.andi %62, %110 : tensor<1x128x512xi1, #blocked1> 2026-02-21T12:55:19.5214979Z %112 = tt.load %106, %111, %cst_1 : tensor<1x128x512x!tt.ptr, #blocked1> 2026-02-21T12:55:19.5215209Z %113 = tt.reshape %112 : tensor<1x128x512xbf16, #blocked1> -> tensor<128x512xbf16, #blocked5> 2026-02-21T12:55:19.5215516Z %114 = ttg.convert_layout %63 : tensor<1x128xbf16, #blocked8> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>> 2026-02-21T12:55:19.5215876Z %115 = ttg.convert_layout %113 : tensor<128x512xbf16, #blocked5> -> tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>> 2026-02-21T12:55:19.5216195Z %116 = ttg.convert_layout %cst_10 : tensor<1x512xf32, #blocked5> -> tensor<1x512xf32, #blocked5> 2026-02-21T12:55:19.5216619Z %117 = tt.dot %114, %115, %116, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>> * tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>> -> tensor<1x512xf32, #blocked5> 2026-02-21T12:55:19.5217019Z %118 = tt.reshape %117 : tensor<1x512xf32, #blocked5> -> tensor<1x1x512xf32, #blocked1> 2026-02-21T12:55:19.5217269Z %119 = arith.truncf %118 : tensor<1x1x512xf32, #blocked1> to tensor<1x1x512xbf16, #blocked1> 2026-02-21T12:55:19.5217516Z %120 = arith.extf %119 : tensor<1x1x512xbf16, #blocked1> to tensor<1x1x512xf32, #blocked1> 2026-02-21T12:55:19.5217713Z %121 = "tt.reduce"(%120) <{axis = 2 : i32}> ({ 2026-02-21T12:55:19.5217860Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:55:19.5217988Z %176 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T12:55:19.5218126Z tt.reduce.return %176 : f32 2026-02-21T12:55:19.5218318Z }) : (tensor<1x1x512xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:55:19.5218617Z %122 = ttg.convert_layout %121 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5218891Z %123 = arith.truncf %122 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4> 2026-02-21T12:55:19.5219121Z %124 = arith.extf %123 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5219319Z %125 = arith.mulf %124, %cst_9 : tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5219513Z %126 = arith.truncf %125 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4> 2026-02-21T12:55:19.5219738Z %127 = arith.extf %126 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5219935Z %128 = arith.cmpf ogt, %arg6, %127 : tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5230136Z %129 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5230308Z %130 = arith.ori %128, %129 : tensor<1x1xi1, #blocked4> 2026-02-21T12:55:19.5230514Z %131 = arith.select %130, %arg6, %127 : tensor<1x1xi1, #blocked4>, tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5230730Z %132 = arith.mulf %120, %cst_8 : tensor<1x1x512xf32, #blocked1> 2026-02-21T12:55:19.5230939Z %133 = arith.truncf %132 : tensor<1x1x512xf32, #blocked1> to tensor<1x1x512xbf16, #blocked1> 2026-02-21T12:55:19.5231260Z %134 = ttg.convert_layout %131 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> 2026-02-21T12:55:19.5231603Z %135 = tt.expand_dims %134 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x1x1xf32, #blocked13> 2026-02-21T12:55:19.5231912Z %136 = ttg.convert_layout %135 : tensor<1x1x1xf32, #blocked13> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T12:55:19.5232167Z %137 = arith.extf %133 : tensor<1x1x512xbf16, #blocked1> to tensor<1x1x512xf32, #blocked1> 2026-02-21T12:55:19.5232425Z %138 = tt.broadcast %136 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x512xf32, #blocked14> 2026-02-21T12:55:19.5232684Z %139 = ttg.convert_layout %138 : tensor<1x1x512xf32, #blocked14> -> tensor<1x1x512xf32, #blocked1> 2026-02-21T12:55:19.5232902Z %140 = arith.subf %137, %139 : tensor<1x1x512xf32, #blocked1> 2026-02-21T12:55:19.5233215Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x512xf32, #blocked1>) -> tensor<1x1x512xf32, #blocked1> 2026-02-21T12:55:19.5233512Z %142 = "tt.reduce"(%141) <{axis = 2 : i32}> ({ 2026-02-21T12:55:19.5233642Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T12:55:19.5233768Z %176 = arith.addf %arg9, %arg10 : f32 2026-02-21T12:55:19.5233890Z tt.reduce.return %176 : f32 2026-02-21T12:55:19.5234086Z }) : (tensor<1x1x512xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T12:55:19.5234384Z %143 = ttg.convert_layout %142 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5234630Z %144 = arith.subf %arg6, %131 : tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5234923Z %145 = tt.extern_elementwise %144 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked4>) -> tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5235216Z %146 = arith.mulf %arg7, %145 : tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5235380Z %147 = arith.addf %146, %143 : tensor<1x1xf32, #blocked4> 2026-02-21T12:55:19.5235645Z %148 = ttg.convert_layout %145 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> 2026-02-21T12:55:19.5235984Z %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x1x1xf32, #blocked13> 2026-02-21T12:55:19.5236294Z %150 = ttg.convert_layout %149 : tensor<1x1x1xf32, #blocked13> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T12:55:19.5236547Z %151 = tt.broadcast %150 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x128xf32, #blocked14> 2026-02-21T12:55:19.5236804Z %152 = ttg.convert_layout %151 : tensor<1x1x128xf32, #blocked14> -> tensor<1x1x128xf32, #blocked> 2026-02-21T12:55:19.5237025Z %153 = arith.mulf %arg8, %152 : tensor<1x1x128xf32, #blocked> 2026-02-21T12:55:19.5237271Z %154 = ttg.convert_layout %92 : tensor<512xi32, #blocked6> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T12:55:19.5237609Z %155 = tt.expand_dims %154 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x512xi32, #blocked7> 2026-02-21T12:55:19.5237910Z %156 = ttg.convert_layout %155 : tensor<1x512xi32, #blocked7> -> tensor<1x512xi32, #blocked5> 2026-02-21T12:55:19.5238220Z %157 = ttg.convert_layout %156 : tensor<1x512xi32, #blocked5> -> tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked15}>> 2026-02-21T12:55:19.5238573Z %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x512x1xi32, #blocked15> 2026-02-21T12:55:19.5238889Z %159 = ttg.convert_layout %158 : tensor<1x512x1xi32, #blocked15> -> tensor<1x512x1xi32, #blocked3> 2026-02-21T12:55:19.5239129Z %160 = arith.muli %159, %cst_7 : tensor<1x512x1xi32, #blocked3> 2026-02-21T12:55:19.5239298Z %161 = arith.addi %64, %160 : tensor<1x512x1xi32, #blocked3> 2026-02-21T12:55:19.5239508Z %162 = tt.broadcast %161 : tensor<1x512x1xi32, #blocked3> -> tensor<1x512x128xi32, #blocked3> 2026-02-21T12:55:19.5239777Z %163 = ttg.convert_layout %162 : tensor<1x512x128xi32, #blocked3> -> tensor<1x512x128xi32, #blocked11> 2026-02-21T12:55:19.5240002Z %164 = arith.addi %163, %25 : tensor<1x512x128xi32, #blocked11> 2026-02-21T12:55:19.5240236Z %165 = tt.addptr %26, %164 : tensor<1x512x128x!tt.ptr, #blocked11>, tensor<1x512x128xi32, #blocked11> 2026-02-21T12:55:19.5240482Z %166 = tt.load %165 : tensor<1x512x128x!tt.ptr, #blocked11> 2026-02-21T12:55:19.5240698Z %167 = arith.truncf %141 : tensor<1x1x512xf32, #blocked1> to tensor<1x1x512xbf16, #blocked1> 2026-02-21T12:55:19.5240943Z %168 = tt.reshape %153 : tensor<1x1x128xf32, #blocked> -> tensor<1x128xf32, #blocked8> 2026-02-21T12:55:19.5241179Z %169 = tt.reshape %167 : tensor<1x1x512xbf16, #blocked1> -> tensor<1x512xbf16, #blocked5> 2026-02-21T12:55:19.5241428Z %170 = tt.reshape %166 : tensor<1x512x128xbf16, #blocked11> -> tensor<512x128xbf16, #blocked8> 2026-02-21T12:55:19.5241732Z %171 = ttg.convert_layout %169 : tensor<1x512xbf16, #blocked5> -> tensor<1x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T12:55:19.5242094Z %172 = ttg.convert_layout %170 : tensor<512x128xbf16, #blocked8> -> tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T12:55:19.5242406Z %173 = ttg.convert_layout %168 : tensor<1x128xf32, #blocked8> -> tensor<1x128xf32, #blocked8> 2026-02-21T12:55:19.5242850Z %174 = tt.dot %171, %172, %173, inputPrecision = tf32 : tensor<1x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x128xf32, #blocked8> 2026-02-21T12:55:19.5243250Z %175 = tt.reshape %174 : tensor<1x128xf32, #blocked8> -> tensor<1x1x128xf32, #blocked> 2026-02-21T12:55:19.5243523Z scf.yield %131, %147, %175 : tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked> 2026-02-21T12:55:19.5243745Z } {tt.flatten, tt.num_stages = 4 : i32} 2026-02-21T12:55:19.5243998Z %66 = ttg.convert_layout %65#1 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> 2026-02-21T12:55:19.5244333Z %67 = tt.expand_dims %66 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x1x1xf32, #blocked13> 2026-02-21T12:55:19.5244632Z %68 = ttg.convert_layout %67 : tensor<1x1x1xf32, #blocked13> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T12:55:19.5244881Z %69 = tt.broadcast %68 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x128xf32, #blocked14> 2026-02-21T12:55:19.5245127Z %70 = ttg.convert_layout %69 : tensor<1x1x128xf32, #blocked14> -> tensor<1x1x128xf32, #blocked> 2026-02-21T12:55:19.5245342Z %71 = arith.divf %65#2, %70 : tensor<1x1x128xf32, #blocked> 2026-02-21T12:55:19.5245539Z %72 = arith.truncf %71 : tensor<1x1x128xf32, #blocked> to tensor<1x1x128xbf16, #blocked> 2026-02-21T12:55:19.5245722Z %73 = arith.extsi %44 : i32 to i64 2026-02-21T12:55:19.5245844Z %74 = arith.extsi %45 : i32 to i64 2026-02-21T12:55:19.5245964Z %75 = arith.muli %73, %c262144_i64 : i64 2026-02-21T12:55:19.5246108Z %76 = tt.splat %75 : i64 -> tensor<1x1x128xi64, #blocked> 2026-02-21T12:55:19.5246267Z %77 = arith.muli %74, %c128_i64 : i64 2026-02-21T12:55:19.5246405Z %78 = tt.splat %77 : i64 -> tensor<1x1x128xi64, #blocked> 2026-02-21T12:55:19.5246559Z %79 = arith.addi %78, %34 : tensor<1x1x128xi64, #blocked> 2026-02-21T12:55:19.5246714Z %80 = arith.addi %76, %79 : tensor<1x1x128xi64, #blocked> 2026-02-21T12:55:19.5246914Z %81 = tt.addptr %27, %80 : tensor<1x1x128x!tt.ptr, #blocked>, tensor<1x1x128xi64, #blocked> 2026-02-21T12:55:19.5247122Z %82 = arith.cmpi sge, %73, %c0_i64 : i64 2026-02-21T12:55:19.5247249Z %83 = arith.cmpi slt, %73, %c192_i64 : i64 2026-02-21T12:55:19.5247367Z %84 = arith.andi %82, %83 : i1 2026-02-21T12:55:19.5247483Z %85 = arith.cmpi sge, %74, %c0_i64 : i64 2026-02-21T12:55:19.5247606Z %86 = arith.cmpi slt, %74, %c2048_i64 : i64 2026-02-21T12:55:19.5247726Z %87 = arith.andi %85, %86 : i1 2026-02-21T12:55:19.5247832Z %88 = arith.andi %84, %87 : i1 2026-02-21T12:55:19.5247965Z %89 = tt.splat %88 : i1 -> tensor<1x1x128xi1, #blocked> 2026-02-21T12:55:19.5248119Z %90 = arith.andi %89, %37 : tensor<1x1x128xi1, #blocked> 2026-02-21T12:55:19.5248303Z tt.store %81, %72, %90 : tensor<1x1x128x!tt.ptr, #blocked> 2026-02-21T12:55:19.5248475Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32} 2026-02-21T12:55:19.5248608Z tt.return 2026-02-21T12:55:19.5248689Z } 2026-02-21T12:55:19.5248766Z } 2026-02-21T12:55:19.5248810Z 2026-02-21T12:55:19.5248840Z {-# 2026-02-21T12:55:19.5248919Z external_resources: { 2026-02-21T12:55:19.5249020Z mlir_reproducer: { 2026-02-21T12:55:19.5251187Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=1 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T12:55:19.5253400Z disable_threading: false, 2026-02-21T12:55:19.5253510Z verify_each: true 2026-02-21T12:55:19.5253598Z } 2026-02-21T12:55:19.5253672Z } 2026-02-21T12:55:19.5253741Z #-} 2026-02-21T12:55:19.5254015Z /tmp/torchinductor_root/3j/c3jqldgdfc2n2y6s5hmlk2le75i7srd7briu3ykyopeiqmr7asfz.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T12:55:19.5254719Z /tmp/torchinductor_root/3j/c3jqldgdfc2n2y6s5hmlk2le75i7srd7briu3ykyopeiqmr7asfz.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T12:55:19.5255268Z [102s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T12:55:19.5256058Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 512], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T12:55:19.5256801Z Error: RuntimeError: PassManager::run failed 2026-02-21T12:55:19.5256966Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T12:55:22.2039421Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 3.2 configs/s 2026-02-21T12:55:22.2050302Z [105s] Adaptive compile timeout: 30s (90% percentile=30.0s, bounds=[30.0s, 30s]) 2026-02-21T12:55:22.2321826Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 - configs/s 2026-02-21T12:55:23.2028753Z [106s] Initial random population of 100, 5 starting points: 2026-02-21T12:55:23.2029252Z error=12 2026-02-21T12:55:23.2029485Z timeout=25 2026-02-21T12:55:23.2029696Z ok=63 2026-02-21T12:55:23.2029919Z min=2.1912 2026-02-21T12:55:23.2030117Z mid=12.2203 2026-02-21T12:55:23.2030327Z max=1458.4796 2026-02-21T12:55:23.2030564Z best={'block_sizes': [1, 256, 16], 2026-02-21T12:55:23.2031248Z 'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:55:23.2031639Z 'l2_groupings': [2], 2026-02-21T12:55:23.2031921Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:55:23.2032237Z 'loop_orders': [[1, 0]], 2026-02-21T12:55:23.2032511Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:55:23.2032801Z 'num_sm_multiplier': 32, 2026-02-21T12:55:23.2033062Z 'num_stages': 2, 2026-02-21T12:55:23.2033286Z 'num_warps': 16, 2026-02-21T12:55:23.2033543Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:55:23.2033875Z 'range_flattens': [False, None], 2026-02-21T12:55:23.2034178Z 'range_multi_buffers': [False, True], 2026-02-21T12:55:23.2034487Z 'range_num_stages': [2, 4], 2026-02-21T12:55:23.2034905Z 'range_unroll_factors': [2, 4], 2026-02-21T12:55:23.2035208Z 'range_warp_specializes': [], 2026-02-21T12:55:23.2035482Z 'waves_per_eu': 4} 2026-02-21T12:55:23.2045507Z [106s] Fitting surrogate: 100 points, 100 targets 2026-02-21T12:55:24.1407214Z [107s] Generation 1 starting: 91 neighbors, 5 active search path(s) 2026-02-21T12:55:47.3197638Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 3.3 configs/s 2026-02-21T12:55:57.1182517Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 92/92 9.5 configs/s 2026-02-21T12:55:57.4135776Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━ 103/103 223.8 configs/s 2026-02-21T12:56:05.1620702Z [148s] Generation 1 complete: 2026-02-21T12:56:05.1621096Z error=3 2026-02-21T12:56:05.1621301Z ok=93 2026-02-21T12:56:05.1621508Z min=1.8662 2026-02-21T12:56:05.1621718Z mid=3.5739 2026-02-21T12:56:05.1621918Z max=59.7261 2026-02-21T12:56:05.1622154Z best={'block_sizes': [1, 128, 16], 2026-02-21T12:56:05.1622604Z 'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T12:56:05.1623016Z 'l2_groupings': [2], 2026-02-21T12:56:05.1623318Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:56:05.1623633Z 'loop_orders': [[1, 0]], 2026-02-21T12:56:05.1623888Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:56:05.1624140Z 'num_stages': 2, 2026-02-21T12:56:05.1624369Z 'num_warps': 8, 2026-02-21T12:56:05.1624585Z 'pid_type': 'flat', 2026-02-21T12:56:05.1624831Z 'range_flattens': [None, None], 2026-02-21T12:56:05.1625108Z 'range_multi_buffers': [None, True], 2026-02-21T12:56:05.1625394Z 'range_num_stages': [0, 4], 2026-02-21T12:56:05.1625656Z 'range_unroll_factors': [0, 4], 2026-02-21T12:56:05.1625928Z 'range_warp_specializes': [], 2026-02-21T12:56:05.1626179Z 'waves_per_eu': 4} 2026-02-21T12:56:05.1686989Z [148s] Fitting surrogate: 196 points, 196 targets 2026-02-21T12:56:06.9566851Z [150s] Generation 2 starting: 87 neighbors, 5 active search path(s) 2026-02-21T12:56:46.3356644Z [189s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, None], range_num_stages=[0, 1], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T12:56:49.2656251Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 0.7 configs/s 2026-02-21T12:56:57.2161914Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 11.4 configs/s 2026-02-21T12:57:01.5053412Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 116/116 16.4 configs/s 2026-02-21T12:57:10.1456588Z [213s] Generation 2 complete: 2026-02-21T12:57:10.1456971Z error=6 2026-02-21T12:57:10.1457286Z timeout=1 2026-02-21T12:57:10.1457489Z ok=86 2026-02-21T12:57:10.1457685Z min=1.6856 2026-02-21T12:57:10.1457978Z mid=2.3992 2026-02-21T12:57:10.1458200Z max=31.4057 2026-02-21T12:57:10.1458463Z best={'block_sizes': [1, 256, 32], 2026-02-21T12:57:10.1458938Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T12:57:10.1459375Z 'l2_groupings': [8], 2026-02-21T12:57:10.1459650Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:57:10.1460317Z 'loop_orders': [[0, 1]], 2026-02-21T12:57:10.1460601Z 'matrix_instr_nonkdim': 16, 2026-02-21T12:57:10.1460873Z 'num_sm_multiplier': 64, 2026-02-21T12:57:10.1461131Z 'num_stages': 2, 2026-02-21T12:57:10.1461357Z 'num_warps': 8, 2026-02-21T12:57:10.1461626Z 'pid_type': 'persistent_blocked', 2026-02-21T12:57:10.1461928Z 'range_flattens': [False, None], 2026-02-21T12:57:10.1462231Z 'range_multi_buffers': [True, False], 2026-02-21T12:57:10.1462534Z 'range_num_stages': [1, 1], 2026-02-21T12:57:10.1462806Z 'range_unroll_factors': [2, 0], 2026-02-21T12:57:10.1463098Z 'range_warp_specializes': [], 2026-02-21T12:57:10.1463366Z 'waves_per_eu': 1} 2026-02-21T12:57:10.1578964Z [213s] Fitting surrogate: 289 points, 289 targets 2026-02-21T12:57:11.1272356Z [214s] Generation 3 starting: 89 neighbors, 5 active search path(s) 2026-02-21T12:57:49.4746137Z [252s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T12:57:50.1770128Z [253s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[1, 1], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:57:50.1796579Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 0.8 configs/s 2026-02-21T12:57:57.7009487Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 11.9 configs/s 2026-02-21T12:58:01.6785791Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 136/136 19.8 configs/s 2026-02-21T12:58:08.9558591Z [272s] Generation 3 complete: 2026-02-21T12:58:08.9558941Z error=2 2026-02-21T12:58:08.9559141Z timeout=2 2026-02-21T12:58:08.9561710Z ok=90 2026-02-21T12:58:08.9561943Z min=1.4115 2026-02-21T12:58:08.9562447Z mid=2.4051 2026-02-21T12:58:08.9562796Z max=33.5141 2026-02-21T12:58:08.9563050Z best={'block_sizes': [1, 128, 64], 2026-02-21T12:58:08.9563488Z 'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'], 2026-02-21T12:58:08.9575325Z 'l2_groupings': [32], 2026-02-21T12:58:08.9575588Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:58:08.9575838Z 'loop_orders': [[0, 1]], 2026-02-21T12:58:08.9576070Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:58:08.9576564Z 'num_sm_multiplier': 16, 2026-02-21T12:58:08.9576766Z 'num_stages': 1, 2026-02-21T12:58:08.9576943Z 'num_warps': 4, 2026-02-21T12:58:08.9577153Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:58:08.9577421Z 'range_flattens': [None, True], 2026-02-21T12:58:08.9577652Z 'range_multi_buffers': [False, True], 2026-02-21T12:58:08.9577892Z 'range_num_stages': [0, 0], 2026-02-21T12:58:08.9578104Z 'range_unroll_factors': [2, 2], 2026-02-21T12:58:08.9578341Z 'range_warp_specializes': [], 2026-02-21T12:58:08.9578555Z 'waves_per_eu': 2} 2026-02-21T12:58:08.9625382Z [272s] Fitting surrogate: 383 points, 383 targets 2026-02-21T12:58:09.9470180Z [273s] Generation 4 starting: 98 neighbors, 5 active search path(s) 2026-02-21T12:58:42.4272084Z [305s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T12:58:42.4296147Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 1.6 configs/s 2026-02-21T12:58:50.7290760Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 11.9 configs/s 2026-02-21T12:58:54.7380260Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 146/146 20.6 configs/s 2026-02-21T12:59:01.0662361Z [324s] Generation 4 complete: 2026-02-21T12:59:01.0662828Z error=9 2026-02-21T12:59:01.0663053Z timeout=1 2026-02-21T12:59:01.0663266Z ok=93 2026-02-21T12:59:01.0663470Z min=1.2779 2026-02-21T12:59:01.0663782Z mid=2.3124 2026-02-21T12:59:01.0664107Z max=35.2354 2026-02-21T12:59:01.0665123Z best={'block_sizes': [1, 128, 128], 2026-02-21T12:59:01.0665593Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T12:59:01.0666034Z 'l2_groupings': [32], 2026-02-21T12:59:01.0666310Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:59:01.0666634Z 'loop_orders': [[0, 1]], 2026-02-21T12:59:01.0666908Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:59:01.0667202Z 'num_sm_multiplier': 16, 2026-02-21T12:59:01.0667460Z 'num_stages': 1, 2026-02-21T12:59:01.0667691Z 'num_warps': 4, 2026-02-21T12:59:01.0667951Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:59:01.0668285Z 'range_flattens': [None, True], 2026-02-21T12:59:01.0668601Z 'range_multi_buffers': [False, True], 2026-02-21T12:59:01.0668908Z 'range_num_stages': [0, 0], 2026-02-21T12:59:01.0669190Z 'range_unroll_factors': [2, 2], 2026-02-21T12:59:01.0669482Z 'range_warp_specializes': [], 2026-02-21T12:59:01.0669767Z 'waves_per_eu': 2} 2026-02-21T12:59:01.0752709Z [324s] Fitting surrogate: 486 points, 486 targets 2026-02-21T12:59:02.7439669Z [325s] Generation 5 starting: 73 neighbors, 4 active search path(s) 2026-02-21T12:59:39.4107155Z [362s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T12:59:42.5823430Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.6 configs/s 2026-02-21T12:59:48.1121775Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 13.8 configs/s 2026-02-21T12:59:53.2999316Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 156/156 19.4 configs/s 2026-02-21T12:59:59.6076155Z [382s] Generation 5 complete: 2026-02-21T12:59:59.6079576Z error=3 2026-02-21T12:59:59.6079887Z timeout=1 2026-02-21T12:59:59.6080067Z ok=73 2026-02-21T12:59:59.6080239Z min=1.3197 2026-02-21T12:59:59.6080420Z mid=1.8694 2026-02-21T12:59:59.6080907Z max=14.8615 2026-02-21T12:59:59.6081115Z best={'block_sizes': [1, 128, 128], 2026-02-21T12:59:59.6081494Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T12:59:59.6081864Z 'l2_groupings': [32], 2026-02-21T12:59:59.6082109Z 'load_eviction_policies': ['', '', ''], 2026-02-21T12:59:59.6082386Z 'loop_orders': [[0, 1]], 2026-02-21T12:59:59.6082711Z 'matrix_instr_nonkdim': 0, 2026-02-21T12:59:59.6082954Z 'num_sm_multiplier': 16, 2026-02-21T12:59:59.6083191Z 'num_stages': 1, 2026-02-21T12:59:59.6083389Z 'num_warps': 4, 2026-02-21T12:59:59.6083618Z 'pid_type': 'persistent_interleaved', 2026-02-21T12:59:59.6083898Z 'range_flattens': [None, False], 2026-02-21T12:59:59.6084163Z 'range_multi_buffers': [False, True], 2026-02-21T12:59:59.6084431Z 'range_num_stages': [0, 0], 2026-02-21T12:59:59.6084678Z 'range_unroll_factors': [2, 2], 2026-02-21T12:59:59.6084946Z 'range_warp_specializes': [], 2026-02-21T12:59:59.6085173Z 'waves_per_eu': 2} 2026-02-21T12:59:59.6195400Z [382s] Fitting surrogate: 563 points, 563 targets 2026-02-21T13:00:00.4612495Z [383s] Generation 6 starting: 69 neighbors, 4 active search path(s) 2026-02-21T13:00:38.6275748Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 0.4 configs/s 2026-02-21T13:00:44.7417866Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 11.5 configs/s 2026-02-21T13:00:49.1498456Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 156/156 21.5 configs/s 2026-02-21T13:00:54.7334375Z [437s] Generation 6 complete: 2026-02-21T13:00:54.7334790Z error=1 2026-02-21T13:00:54.7335008Z ok=72 2026-02-21T13:00:54.7335209Z min=1.3089 2026-02-21T13:00:54.7335415Z mid=1.8582 2026-02-21T13:00:54.7335610Z max=17.2142 2026-02-21T13:00:54.7335844Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:00:54.7336262Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T13:00:54.7337251Z 'l2_groupings': [16], 2026-02-21T13:00:54.7337543Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:00:54.7337881Z 'loop_orders': [[0, 1]], 2026-02-21T13:00:54.7338135Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:00:54.7338378Z 'num_sm_multiplier': 16, 2026-02-21T13:00:54.7338611Z 'num_stages': 1, 2026-02-21T13:00:54.7338830Z 'num_warps': 4, 2026-02-21T13:00:54.7339063Z 'pid_type': 'persistent_interleaved', 2026-02-21T13:00:54.7339350Z 'range_flattens': [None, False], 2026-02-21T13:00:54.7339623Z 'range_multi_buffers': [False, True], 2026-02-21T13:00:54.7339895Z 'range_num_stages': [0, 0], 2026-02-21T13:00:54.7340155Z 'range_unroll_factors': [2, 2], 2026-02-21T13:00:54.7340419Z 'range_warp_specializes': [], 2026-02-21T13:00:54.7340659Z 'waves_per_eu': 2} 2026-02-21T13:00:54.7428033Z [437s] Fitting surrogate: 636 points, 636 targets 2026-02-21T13:00:55.5858425Z [438s] Generation 7 starting: 72 neighbors, 4 active search path(s) 2026-02-21T13:01:33.3355248Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.7 configs/s 2026-02-21T13:01:39.3772117Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 12.3 configs/s 2026-02-21T13:01:41.1886259Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 156/156 40.3 configs/s 2026-02-21T13:01:46.4675176Z [489s] Generation 7 complete: 2026-02-21T13:01:46.4675606Z error=4 2026-02-21T13:01:46.4675813Z ok=72 2026-02-21T13:01:46.4676018Z min=1.3046 2026-02-21T13:01:46.4676231Z mid=2.9547 2026-02-21T13:01:46.4676429Z max=15.7700 2026-02-21T13:01:46.4676663Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:01:46.4677557Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T13:01:46.4677950Z 'l2_groupings': [32], 2026-02-21T13:01:46.4678232Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:01:46.4678543Z 'loop_orders': [[0, 1]], 2026-02-21T13:01:46.4678814Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:01:46.4679090Z 'num_sm_multiplier': 16, 2026-02-21T13:01:46.4679362Z 'num_stages': 1, 2026-02-21T13:01:46.4679588Z 'num_warps': 4, 2026-02-21T13:01:46.4679849Z 'pid_type': 'persistent_interleaved', 2026-02-21T13:01:46.4680335Z 'range_flattens': [None, False], 2026-02-21T13:01:46.4680638Z 'range_multi_buffers': [False, False], 2026-02-21T13:01:46.4680879Z 'range_num_stages': [0, 0], 2026-02-21T13:01:46.4681198Z 'range_unroll_factors': [2, 2], 2026-02-21T13:01:46.4681453Z 'range_warp_specializes': [], 2026-02-21T13:01:46.4681685Z 'waves_per_eu': 2} 2026-02-21T13:01:46.4741248Z [489s] Fitting surrogate: 712 points, 712 targets 2026-02-21T13:01:47.2916362Z [490s] Generation 8 starting: 71 neighbors, 4 active search path(s) 2026-02-21T13:02:26.9471355Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 0.5 configs/s 2026-02-21T13:02:33.0444255Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 12.1 configs/s 2026-02-21T13:02:36.5531818Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 156/156 26.3 configs/s 2026-02-21T13:02:42.4930744Z [545s] Generation 8 complete: 2026-02-21T13:02:42.4931120Z error=3 2026-02-21T13:02:42.4931322Z ok=72 2026-02-21T13:02:42.4931550Z min=1.3343 2026-02-21T13:02:42.4931750Z mid=1.9467 2026-02-21T13:02:42.4931948Z max=27.7946 2026-02-21T13:02:42.4932178Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:02:42.4932619Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T13:02:42.4933019Z 'l2_groupings': [32], 2026-02-21T13:02:42.4933297Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:02:42.4933606Z 'loop_orders': [[0, 1]], 2026-02-21T13:02:42.4933877Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:02:42.4934170Z 'num_sm_multiplier': 16, 2026-02-21T13:02:42.4934425Z 'num_stages': 1, 2026-02-21T13:02:42.4934659Z 'num_warps': 4, 2026-02-21T13:02:42.4934918Z 'pid_type': 'persistent_interleaved', 2026-02-21T13:02:42.4935223Z 'range_flattens': [None, True], 2026-02-21T13:02:42.4935427Z 'range_multi_buffers': [False, False], 2026-02-21T13:02:42.4935614Z 'range_num_stages': [0, 0], 2026-02-21T13:02:42.4935732Z 'range_unroll_factors': [2, 2], 2026-02-21T13:02:42.4935851Z 'range_warp_specializes': [], 2026-02-21T13:02:42.4935967Z 'waves_per_eu': 2} 2026-02-21T13:02:42.5014326Z [545s] Fitting surrogate: 787 points, 787 targets 2026-02-21T13:02:43.9428163Z [547s] Generation 9 starting: 50 neighbors, 3 active search path(s) 2026-02-21T13:03:15.5639695Z [578s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[0, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:03:23.3874750Z [586s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:03:24.5797079Z [587s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[0, 0], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:03:24.5818032Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 0.4 configs/s 2026-02-21T13:03:28.0618631Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 14.8 configs/s 2026-02-21T13:03:30.5355412Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 156/156 30.1 configs/s 2026-02-21T13:03:35.6671333Z [598s] Generation 9 complete: 2026-02-21T13:03:35.6671749Z error=2 2026-02-21T13:03:35.6671955Z timeout=3 2026-02-21T13:03:35.6672153Z ok=48 2026-02-21T13:03:35.6672356Z min=1.3454 2026-02-21T13:03:35.6673078Z mid=1.6914 2026-02-21T13:03:35.6673274Z max=8.7120 2026-02-21T13:03:35.6673498Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:03:35.6673913Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T13:03:35.6674314Z 'l2_groupings': [32], 2026-02-21T13:03:35.6674605Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:03:35.6674922Z 'loop_orders': [[0, 1]], 2026-02-21T13:03:35.6675203Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:03:35.6675478Z 'num_sm_multiplier': 16, 2026-02-21T13:03:35.6675738Z 'num_stages': 1, 2026-02-21T13:03:35.6675969Z 'num_warps': 4, 2026-02-21T13:03:35.6676227Z 'pid_type': 'persistent_blocked', 2026-02-21T13:03:35.6676690Z 'range_flattens': [None, True], 2026-02-21T13:03:35.6676998Z 'range_multi_buffers': [False, False], 2026-02-21T13:03:35.6677321Z 'range_num_stages': [1, 0], 2026-02-21T13:03:35.6677597Z 'range_unroll_factors': [2, 2], 2026-02-21T13:03:35.6677824Z 'range_warp_specializes': [], 2026-02-21T13:03:35.6678037Z 'waves_per_eu': 2} 2026-02-21T13:03:35.6763072Z [598s] Fitting surrogate: 840 points, 840 targets 2026-02-21T13:03:36.3111629Z [599s] Generation 10 starting: 54 neighbors, 3 active search path(s) 2026-02-21T13:04:17.2560734Z [640s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:04:17.2577582Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55/55 0.4 configs/s 2026-02-21T13:04:21.9837770Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 55/55 11.7 configs/s 2026-02-21T13:04:22.1546602Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━ 156/156 1240.0 2026-02-21T13:04:22.1548596Z configs/s 2026-02-21T13:04:27.5974250Z [650s] Generation 10 complete: 2026-02-21T13:04:27.5974675Z error=3 2026-02-21T13:04:27.5974896Z timeout=1 2026-02-21T13:04:27.5975476Z ok=53 2026-02-21T13:04:27.5975671Z min=1.2608 2026-02-21T13:04:27.5975868Z mid=2.3358 2026-02-21T13:04:27.5976064Z max=32.6283 2026-02-21T13:04:27.5976288Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:04:27.5976699Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T13:04:27.5977089Z 'l2_groupings': [32], 2026-02-21T13:04:27.5977363Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:04:27.5977682Z 'loop_orders': [[0, 1]], 2026-02-21T13:04:27.5977951Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:04:27.5978512Z 'num_sm_multiplier': 16, 2026-02-21T13:04:27.5978767Z 'num_stages': 1, 2026-02-21T13:04:27.5978993Z 'num_warps': 4, 2026-02-21T13:04:27.5979252Z 'pid_type': 'persistent_blocked', 2026-02-21T13:04:27.5979571Z 'range_flattens': [None, True], 2026-02-21T13:04:27.5979873Z 'range_multi_buffers': [False, False], 2026-02-21T13:04:27.5980179Z 'range_num_stages': [1, 0], 2026-02-21T13:04:27.5980442Z 'range_unroll_factors': [2, 2], 2026-02-21T13:04:27.5980658Z 'range_warp_specializes': [], 2026-02-21T13:04:27.5980881Z 'waves_per_eu': 2} 2026-02-21T13:04:27.6041218Z [650s] Fitting surrogate: 897 points, 897 targets 2026-02-21T13:04:28.0740638Z [651s] Generation 11 starting: 39 neighbors, 2 active search path(s) 2026-02-21T13:04:42.6914235Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 1.8 configs/s 2026-02-21T13:04:45.8331031Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 39/39 13.0 configs/s 2026-02-21T13:04:45.9818940Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━ 158/158 1448.4 2026-02-21T13:04:45.9819488Z configs/s 2026-02-21T13:04:50.6673922Z [673s] Generation 11 complete: 2026-02-21T13:04:50.6674533Z ok=41 2026-02-21T13:04:50.6674618Z min=1.3124 2026-02-21T13:04:50.6674699Z mid=2.5480 2026-02-21T13:04:50.6674772Z max=8.9074 2026-02-21T13:04:50.6674863Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:04:50.6675015Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T13:04:50.6675175Z 'l2_groupings': [32], 2026-02-21T13:04:50.6675276Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:04:50.6675394Z 'loop_orders': [[0, 1]], 2026-02-21T13:04:50.6675495Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:04:50.6675603Z 'num_sm_multiplier': 16, 2026-02-21T13:04:50.6675701Z 'num_stages': 1, 2026-02-21T13:04:50.6675788Z 'num_warps': 4, 2026-02-21T13:04:50.6675996Z 'pid_type': 'persistent_blocked', 2026-02-21T13:04:50.6746413Z 'range_flattens': [False, True], 2026-02-21T13:04:50.6746563Z 'range_multi_buffers': [False, False], 2026-02-21T13:04:50.6746683Z 'range_num_stages': [1, 0], 2026-02-21T13:04:50.6746791Z 'range_unroll_factors': [2, 2], 2026-02-21T13:04:50.6746904Z 'range_warp_specializes': [], 2026-02-21T13:04:50.6747008Z 'waves_per_eu': 2} 2026-02-21T13:04:50.6747117Z [673s] Fitting surrogate: 938 points, 938 targets 2026-02-21T13:04:51.0824411Z [674s] Generation 12 starting: 34 neighbors, 2 active search path(s) 2026-02-21T13:05:22.0534741Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 0.6 configs/s 2026-02-21T13:05:25.0217479Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 34/34 11.9 configs/s 2026-02-21T13:05:25.1744209Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━ 158/158 1404.4 2026-02-21T13:05:25.1748116Z configs/s 2026-02-21T13:05:29.9513773Z [713s] Generation 12 complete: 2026-02-21T13:05:29.9513991Z error=1 2026-02-21T13:05:29.9514117Z ok=35 2026-02-21T13:05:29.9514202Z min=1.2955 2026-02-21T13:05:29.9514293Z mid=1.7441 2026-02-21T13:05:29.9514378Z max=18.3117 2026-02-21T13:05:29.9514471Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:05:29.9514640Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T13:05:29.9514800Z 'l2_groupings': [64], 2026-02-21T13:05:29.9514912Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:05:29.9515029Z 'loop_orders': [[0, 1]], 2026-02-21T13:05:29.9515528Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:05:29.9515640Z 'num_sm_multiplier': 16, 2026-02-21T13:05:29.9515746Z 'num_stages': 1, 2026-02-21T13:05:29.9515836Z 'num_warps': 4, 2026-02-21T13:05:29.9515940Z 'pid_type': 'persistent_blocked', 2026-02-21T13:05:29.9516064Z 'range_flattens': [False, True], 2026-02-21T13:05:29.9516180Z 'range_multi_buffers': [False, None], 2026-02-21T13:05:29.9516307Z 'range_num_stages': [1, 0], 2026-02-21T13:05:29.9516416Z 'range_unroll_factors': [2, 2], 2026-02-21T13:05:29.9516645Z 'range_warp_specializes': [], 2026-02-21T13:05:29.9516748Z 'waves_per_eu': 2} 2026-02-21T13:05:29.9608364Z [713s] Fitting surrogate: 974 points, 974 targets 2026-02-21T13:05:30.3820961Z [713s] Generation 13 starting: 37 neighbors, 2 active search path(s) 2026-02-21T13:06:06.0610359Z [749s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[0, 1], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:06:06.0626587Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 0.4 configs/s 2026-02-21T13:06:08.6491747Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 14.5 configs/s 2026-02-21T13:06:10.3469480Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━ 158/158 39.3 configs/s 2026-02-21T13:06:15.2284361Z [758s] Generation 13 complete: 2026-02-21T13:06:15.2284794Z timeout=1 2026-02-21T13:06:15.2285007Z ok=38 2026-02-21T13:06:15.2285748Z min=1.3319 2026-02-21T13:06:15.2285959Z mid=1.4907 2026-02-21T13:06:15.2286169Z max=6.7336 2026-02-21T13:06:15.2286402Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:06:15.2286824Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T13:06:15.2287240Z 'l2_groupings': [64], 2026-02-21T13:06:15.2287549Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:06:15.2287870Z 'loop_orders': [[0, 1]], 2026-02-21T13:06:15.2288155Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:06:15.2288420Z 'num_stages': 1, 2026-02-21T13:06:15.2288654Z 'num_warps': 4, 2026-02-21T13:06:15.2288890Z 'pid_type': 'flat', 2026-02-21T13:06:15.2289322Z 'range_flattens': [None, True], 2026-02-21T13:06:15.2289628Z 'range_multi_buffers': [None, None], 2026-02-21T13:06:15.2289935Z 'range_num_stages': [0, 0], 2026-02-21T13:06:15.2290239Z 'range_unroll_factors': [0, 2], 2026-02-21T13:06:15.2290532Z 'range_warp_specializes': [], 2026-02-21T13:06:15.2290817Z 'waves_per_eu': 2} 2026-02-21T13:06:15.2352258Z [758s] Fitting surrogate: 1013 points, 1013 targets 2026-02-21T13:06:15.6901462Z [758s] Generation 14 starting: 33 neighbors, 2 active search path(s) 2026-02-21T13:06:28.2485699Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 1.4 configs/s 2026-02-21T13:06:31.1968084Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 33/33 11.7 configs/s 2026-02-21T13:06:31.3073274Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 158/158 - configs/s 2026-02-21T13:06:34.5726699Z [777s] Generation 14 complete: 2026-02-21T13:06:34.5726909Z ok=35 2026-02-21T13:06:34.5727001Z min=1.3077 2026-02-21T13:06:34.5727085Z mid=3.4467 2026-02-21T13:06:34.5727207Z max=15.4661 2026-02-21T13:06:34.5727301Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:06:34.5727461Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T13:06:34.5727630Z 'l2_groupings': [32], 2026-02-21T13:06:34.5727742Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:06:34.5727866Z 'loop_orders': [[0, 1]], 2026-02-21T13:06:34.5727981Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:06:34.5728088Z 'num_stages': 1, 2026-02-21T13:06:34.5728178Z 'num_warps': 4, 2026-02-21T13:06:34.5728272Z 'pid_type': 'flat', 2026-02-21T13:06:34.5728375Z 'range_flattens': [None, True], 2026-02-21T13:06:34.5728873Z 'range_multi_buffers': [None, None], 2026-02-21T13:06:34.5728992Z 'range_num_stages': [0, 0], 2026-02-21T13:06:34.5729103Z 'range_unroll_factors': [0, 2], 2026-02-21T13:06:34.5729213Z 'range_warp_specializes': [], 2026-02-21T13:06:34.5729321Z 'waves_per_eu': 2} 2026-02-21T13:06:34.5804107Z [777s] Fitting surrogate: 1048 points, 1048 targets 2026-02-21T13:06:34.9855113Z [778s] Generation 15 starting: 30 neighbors, 2 active search path(s) 2026-02-21T13:07:06.0046037Z [809s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:07:06.9480921Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0.4 configs/s 2026-02-21T13:07:09.6355275Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 11.7 configs/s 2026-02-21T13:07:09.7011625Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 158/158 - configs/s 2026-02-21T13:07:11.5028984Z [814s] Generation 15 complete: 2026-02-21T13:07:11.5029418Z error=1 2026-02-21T13:07:11.5029634Z timeout=1 2026-02-21T13:07:11.5029839Z ok=30 2026-02-21T13:07:11.5030079Z min=1.2747 2026-02-21T13:07:11.5030292Z mid=4.0630 2026-02-21T13:07:11.5030518Z max=17.3615 2026-02-21T13:07:11.5030782Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:07:11.5031339Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T13:07:11.5031772Z 'l2_groupings': [32], 2026-02-21T13:07:11.5032450Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:07:11.5035308Z 'loop_orders': [[0, 1]], 2026-02-21T13:07:11.5035705Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:07:11.5035994Z 'num_stages': 1, 2026-02-21T13:07:11.5036240Z 'num_warps': 4, 2026-02-21T13:07:11.5036519Z 'pid_type': 'flat', 2026-02-21T13:07:11.5036787Z 'range_flattens': [None, True], 2026-02-21T13:07:11.5037104Z 'range_multi_buffers': [None, None], 2026-02-21T13:07:11.5037438Z 'range_num_stages': [0, 0], 2026-02-21T13:07:11.5037722Z 'range_unroll_factors': [0, 1], 2026-02-21T13:07:11.5038039Z 'range_warp_specializes': [], 2026-02-21T13:07:11.5038331Z 'waves_per_eu': 2} 2026-02-21T13:07:11.5079093Z [814s] Fitting surrogate: 1080 points, 1080 targets 2026-02-21T13:07:12.1108981Z [815s] Generation 16 starting: 30 neighbors, 2 active search path(s) 2026-02-21T13:07:22.3720672Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0.9 configs/s 2026-02-21T13:07:24.9764846Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 12.5 configs/s 2026-02-21T13:07:25.0700673Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 158/158 - configs/s 2026-02-21T13:07:27.7744068Z [831s] Generation 16 complete: 2026-02-21T13:07:27.7744480Z error=1 2026-02-21T13:07:27.7744731Z ok=31 2026-02-21T13:07:27.7744930Z min=1.2896 2026-02-21T13:07:27.7745143Z mid=2.8624 2026-02-21T13:07:27.7745345Z max=19.3822 2026-02-21T13:07:27.7745572Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:07:27.7745984Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T13:07:27.7746377Z 'l2_groupings': [32], 2026-02-21T13:07:27.7746672Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:07:27.7746988Z 'loop_orders': [[0, 1]], 2026-02-21T13:07:27.7747259Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:07:27.7747960Z 'num_stages': 1, 2026-02-21T13:07:27.7748116Z 'num_warps': 4, 2026-02-21T13:07:27.7748271Z 'pid_type': 'flat', 2026-02-21T13:07:27.7748451Z 'range_flattens': [None, True], 2026-02-21T13:07:27.7748667Z 'range_multi_buffers': [None, None], 2026-02-21T13:07:27.7748873Z 'range_num_stages': [0, 0], 2026-02-21T13:07:27.7749058Z 'range_unroll_factors': [0, 1], 2026-02-21T13:07:27.7749255Z 'range_warp_specializes': [], 2026-02-21T13:07:27.7749444Z 'waves_per_eu': 2} 2026-02-21T13:07:27.7805696Z [831s] Fitting surrogate: 1112 points, 1112 targets 2026-02-21T13:07:28.8536313Z [832s] Generation 17 starting: 16 neighbors, 1 active search path(s) 2026-02-21T13:07:40.6941565Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.7 configs/s 2026-02-21T13:07:42.8255716Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 8.3 configs/s 2026-02-21T13:07:42.8667499Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 158/158 - configs/s 2026-02-21T13:07:44.0774465Z [847s] Generation 17 complete: 2026-02-21T13:07:44.0774841Z ok=18 2026-02-21T13:07:44.0775045Z min=1.3154 2026-02-21T13:07:44.0775260Z mid=4.0403 2026-02-21T13:07:44.0775460Z max=33.1777 2026-02-21T13:07:44.0776024Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:07:44.0776451Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T13:07:44.0776849Z 'l2_groupings': [8], 2026-02-21T13:07:44.0777119Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:07:44.0777455Z 'loop_orders': [[1, 0]], 2026-02-21T13:07:44.0777726Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:07:44.0777993Z 'num_stages': 4, 2026-02-21T13:07:44.0778220Z 'num_warps': 4, 2026-02-21T13:07:44.0778448Z 'pid_type': 'flat', 2026-02-21T13:07:44.0778709Z 'range_flattens': [None, False], 2026-02-21T13:07:44.0779012Z 'range_multi_buffers': [None, None], 2026-02-21T13:07:44.0779474Z 'range_num_stages': [0, 1], 2026-02-21T13:07:44.0779754Z 'range_unroll_factors': [0, 1], 2026-02-21T13:07:44.0780050Z 'range_warp_specializes': [], 2026-02-21T13:07:44.0780326Z 'waves_per_eu': 2} 2026-02-21T13:07:44.0815638Z [847s] Fitting surrogate: 1130 points, 1130 targets 2026-02-21T13:07:44.3419716Z [847s] Generation 18 starting: 17 neighbors, 1 active search path(s) 2026-02-21T13:07:53.0910465Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.8 configs/s 2026-02-21T13:07:54.9500898Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 9.7 configs/s 2026-02-21T13:07:54.9916257Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 158/158 - configs/s 2026-02-21T13:07:56.1101829Z [859s] Generation 18 complete: 2026-02-21T13:07:56.1102326Z error=1 2026-02-21T13:07:56.1102653Z ok=17 2026-02-21T13:07:56.1102974Z min=1.3981 2026-02-21T13:07:56.1103283Z mid=2.9325 2026-02-21T13:07:56.1103538Z max=28.4636 2026-02-21T13:07:56.1103775Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:07:56.1104218Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'pointer'], 2026-02-21T13:07:56.1104627Z 'l2_groupings': [8], 2026-02-21T13:07:56.1104921Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:07:56.1105239Z 'loop_orders': [[1, 0]], 2026-02-21T13:07:56.1105516Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:07:56.1105792Z 'num_stages': 4, 2026-02-21T13:07:56.1106024Z 'num_warps': 4, 2026-02-21T13:07:56.1106254Z 'pid_type': 'flat', 2026-02-21T13:07:56.1106516Z 'range_flattens': [None, False], 2026-02-21T13:07:56.1106820Z 'range_multi_buffers': [None, None], 2026-02-21T13:07:56.1107408Z 'range_num_stages': [0, 1], 2026-02-21T13:07:56.1107689Z 'range_unroll_factors': [0, 1], 2026-02-21T13:07:56.1107987Z 'range_warp_specializes': [], 2026-02-21T13:07:56.1108262Z 'waves_per_eu': 2} 2026-02-21T13:07:56.1142210Z [859s] Fitting surrogate: 1148 points, 1148 targets 2026-02-21T13:07:56.2419643Z [859s] Autotuning complete in 859.5s after searching 1065 configs. 2026-02-21T13:07:56.2420051Z One can hardcode the best config and skip autotuning with: 2026-02-21T13:07:56.2421058Z @helion.kernel(config=helion.Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T13:07:56.2422189Z 2026-02-21T13:07:56.2422435Z [859s] Code of selected kernel: /tmp/torchinductor_root/et/cettuquifdqfk5ruaus5w25bzjtsigxxq5ms2ayqaiw4dzpt7j5j.py 2026-02-21T13:07:56.2646994Z from __future__ import annotations 2026-02-21T13:07:56.2647221Z 2026-02-21T13:07:56.2647295Z import torch 2026-02-21T13:07:56.2647477Z import triton 2026-02-21T13:07:56.2647663Z import triton.language as tl 2026-02-21T13:07:56.2647854Z from torch._inductor.runtime import triton_helpers 2026-02-21T13:07:56.2648099Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T13:07:56.2648367Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T13:07:56.2648529Z 2026-02-21T13:07:56.2648601Z _BLOCK_SIZE_1 = tl.constexpr(128) 2026-02-21T13:07:56.2648760Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T13:07:56.2662149Z _SHAPE_DIM = tl.constexpr(128) 2026-02-21T13:07:56.2662302Z _BLOCK_SIZE_3 = tl.constexpr(128) 2026-02-21T13:07:56.2662445Z _SHAPE_DIM_4 = tl.constexpr(128) 2026-02-21T13:07:56.2662533Z 2026-02-21T13:07:56.2662578Z @triton.jit 2026-02-21T13:07:56.2662762Z def _helion_attention(q_view, k_view, v_view, out, _RDIM_SIZE_2: tl.constexpr): 2026-02-21T13:07:56.2663048Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T13:07:56.2663267Z num_pid_m = tl.cdiv(2048, _BLOCK_SIZE_1) 2026-02-21T13:07:56.2663412Z num_pid_n = 192 2026-02-21T13:07:56.2663530Z inner_2d_pid = tl.program_id(0) 2026-02-21T13:07:56.2663797Z num_pid_in_group = 8 * num_pid_n 2026-02-21T13:07:56.2663957Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T13:07:56.2664126Z first_pid_m = group_id * 8 2026-02-21T13:07:56.2664283Z group_size_m = min(num_pid_m - first_pid_m, 8) 2026-02-21T13:07:56.2664496Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T13:07:56.2664725Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T13:07:56.2664896Z offset_1 = pid_0 * _BLOCK_SIZE_1 2026-02-21T13:07:56.2665080Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T13:07:56.2665261Z offset_0 = pid_1 2026-02-21T13:07:56.2665396Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T13:07:56.2665572Z indices_4 = tl.arange(0, _RDIM_SIZE_2).to(tl.int32) 2026-02-21T13:07:56.2665822Z # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T13:07:56.2666102Z m_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], float('-inf'), tl.float32) 2026-02-21T13:07:56.2666290Z # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0) 2026-02-21T13:07:56.2666460Z l_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 1.0, tl.float32) 2026-02-21T13:07:56.2666662Z # src[attention.py:70]: acc = hl.zeros([tile_b, tile_m, head_dim], dtype=torch.float32) 2026-02-21T13:07:56.2666889Z acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128], 0.0, tl.float32) 2026-02-21T13:07:56.2667058Z # src[attention.py:71]: q = q_view[tile_b, tile_m, :] 2026-02-21T13:07:56.2667393Z q = tl.load(tl.make_block_ptr(q_view, [192, 2048, 128], [262144, 128, 1], [offset_0, offset_1, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _SHAPE_DIM], [2, 1, 0]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T13:07:56.2667771Z # src[attention.py:72]: for tile_n in hl.tile(v_view.size(1)): 2026-02-21T13:07:56.2667939Z # src[attention.py:73]: k = k_view[tile_b, :, tile_n] 2026-02-21T13:07:56.2668093Z # src[attention.py:74]: qk = torch.bmm(q, k) 2026-02-21T13:07:56.2668224Z # src[attention.py:72-85]: ... 2026-02-21T13:07:56.2668415Z for offset_2 in tl.range(0, 2048, _BLOCK_SIZE_3, loop_unroll_factor=1, num_stages=1, flatten=False): 2026-02-21T13:07:56.2668677Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_3).to(tl.int32) 2026-02-21T13:07:56.2668822Z q_copy = q 2026-02-21T13:07:56.2668910Z m_i_copy = m_i 2026-02-21T13:07:56.2669003Z l_i_copy = l_i 2026-02-21T13:07:56.2669095Z acc_copy = acc 2026-02-21T13:07:56.2669187Z q_copy_0 = q_copy 2026-02-21T13:07:56.2669291Z m_i_copy_0 = m_i_copy 2026-02-21T13:07:56.2669390Z l_i_copy_0 = l_i_copy 2026-02-21T13:07:56.2669492Z acc_copy_0 = acc_copy 2026-02-21T13:07:56.2669614Z # src[attention.py:73]: k = k_view[tile_b, :, tile_n] 2026-02-21T13:07:56.2669861Z k = tl.load(k_view + (indices_0[:, None, None] * 262144 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None) 2026-02-21T13:07:56.2670100Z # src[attention.py:74]: qk = torch.bmm(q, k) 2026-02-21T13:07:56.2670545Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T13:07:56.2671007Z # src[attention.py:75]: m_ij = torch.maximum(m_i, torch.amax(qk, -1) * qk_scale) 2026-02-21T13:07:56.2671192Z amax = tl.cast(tl.max(qk, 2), tl.bfloat16) 2026-02-21T13:07:56.2671315Z v_0 = 0.12751743074602467 2026-02-21T13:07:56.2671434Z v_1 = tl.cast(amax * v_0, tl.bfloat16) 2026-02-21T13:07:56.2671556Z v_2 = tl.cast(v_1, tl.float32) 2026-02-21T13:07:56.2671688Z v_3 = triton_helpers.maximum(m_i_copy_0, v_2) 2026-02-21T13:07:56.2671846Z # src[attention.py:76]: qk = qk * qk_scale - m_ij[:, :, None] 2026-02-21T13:07:56.2671991Z v_4 = 0.12751743074602467 2026-02-21T13:07:56.2672121Z v_5 = tl.cast(qk * v_4, tl.bfloat16) 2026-02-21T13:07:56.2672241Z subscript = v_3[:, :, None] 2026-02-21T13:07:56.2672356Z v_6 = tl.cast(v_5, tl.float32) 2026-02-21T13:07:56.2672465Z v_7 = v_6 - subscript 2026-02-21T13:07:56.2672582Z # src[attention.py:77]: p = torch.exp2(qk) 2026-02-21T13:07:56.2672707Z v_8 = libdevice.exp2(v_7) 2026-02-21T13:07:56.2672836Z # src[attention.py:78]: l_ij = torch.sum(p, -1) 2026-02-21T13:07:56.2672971Z l_ij = tl.cast(tl.sum(v_8, 2), tl.float32) 2026-02-21T13:07:56.2673117Z # src[attention.py:79]: alpha = torch.exp2(m_i - m_ij) 2026-02-21T13:07:56.2673257Z v_9 = m_i_copy_0 - v_3 2026-02-21T13:07:56.2673363Z v_10 = libdevice.exp2(v_9) 2026-02-21T13:07:56.2673489Z # src[attention.py:80]: l_i = l_i * alpha + l_ij 2026-02-21T13:07:56.2673616Z v_11 = l_i_copy_0 * v_10 2026-02-21T13:07:56.2673735Z l_i = v_11 + l_ij 2026-02-21T13:07:56.2673854Z # src[attention.py:81]: acc = acc * alpha[:, :, None] 2026-02-21T13:07:56.2673991Z subscript_1 = v_10[:, :, None] 2026-02-21T13:07:56.2674109Z v_13 = acc_copy_0 * subscript_1 2026-02-21T13:07:56.2674242Z # src[attention.py:82]: v = v_view[tile_b, tile_n, :] 2026-02-21T13:07:56.2674576Z v = tl.load(tl.make_block_ptr(v_view, [192, 2048, 128], [262144, 128, 1], [offset_0, offset_2, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_3, _SHAPE_DIM_4], [2, 1, 0]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T13:07:56.2674898Z # src[attention.py:83]: p = p.to(v.dtype) 2026-02-21T13:07:56.2675041Z v_14 = tl.cast(v_8, tl.bfloat16) 2026-02-21T13:07:56.2675182Z # src[attention.py:84]: acc = torch.baddbmm(acc, p, v) 2026-02-21T13:07:56.2675632Z acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128]) 2026-02-21T13:07:56.2676061Z # src[attention.py:85]: m_i = m_ij 2026-02-21T13:07:56.2676187Z m_i = v_3 2026-02-21T13:07:56.2676295Z # src[attention.py:87]: acc = acc / l_i[:, :, None] 2026-02-21T13:07:56.2676421Z subscript_2 = l_i[:, :, None] 2026-02-21T13:07:56.2676531Z v_15 = acc / subscript_2 2026-02-21T13:07:56.2676672Z # src[attention.py:88]: out[tile_b, tile_m, :] = acc.to(out.dtype) 2026-02-21T13:07:56.2676820Z v_16 = tl.cast(v_15, tl.bfloat16) 2026-02-21T13:07:56.2677069Z tl.store(out + (indices_0[:, None, None] * 262144 + indices_1[None, :, None] * 128 + indices_4[None, None, :] * 1), v_16, None) 2026-02-21T13:07:56.2677251Z 2026-02-21T13:07:56.2677383Z def attention(q_in: torch.Tensor, k_in: torch.Tensor, v_in: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T13:07:56.2677582Z """ 2026-02-21T13:07:56.2677675Z Computes scaled dot-product attention. 2026-02-21T13:07:56.2677758Z 2026-02-21T13:07:56.2677869Z Implements the attention mechanism: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V 2026-02-21T13:07:56.2678022Z 2026-02-21T13:07:56.2678058Z Args: 2026-02-21T13:07:56.2678160Z q_in: Query tensor of shape [..., seq_len_q, head_dim] 2026-02-21T13:07:56.2678332Z k_in: Key tensor of shape [..., seq_len_k, head_dim] 2026-02-21T13:07:56.2678482Z v_in: Value tensor of shape [..., seq_len_k, head_dim] 2026-02-21T13:07:56.2678582Z 2026-02-21T13:07:56.2678615Z Returns: 2026-02-21T13:07:56.2686776Z Output tensor of shape [..., seq_len_q, head_dim] 2026-02-21T13:07:56.2686915Z """ 2026-02-21T13:07:56.2687013Z # src[attention.py:56]: m_dim = q_in.size(-2) 2026-02-21T13:07:56.2687141Z m_dim = q_in.size(-2) 2026-02-21T13:07:56.2687252Z # src[attention.py:57]: n_dim = k_in.size(-2) 2026-02-21T13:07:56.2687370Z n_dim = k_in.size(-2) 2026-02-21T13:07:56.2687489Z # src[attention.py:58]: assert n_dim == v_in.size(-2) 2026-02-21T13:07:56.2687685Z assert n_dim == v_in.size(-2) 2026-02-21T13:07:56.2687831Z # src[attention.py:59]: head_dim = hl.specialize(q_in.size(-1)) 2026-02-21T13:07:56.2687978Z head_dim = 128 2026-02-21T13:07:56.2688113Z # src[attention.py:60]: assert head_dim == k_in.size(-1) == v_in.size(-1) 2026-02-21T13:07:56.2688293Z assert head_dim == k_in.size(-1) == v_in.size(-1) 2026-02-21T13:07:56.2688458Z # src[attention.py:61]: q_view = q_in.reshape([-1, m_dim, head_dim]) 2026-02-21T13:07:56.2688619Z q_view = q_in.reshape([-1, m_dim, head_dim]) 2026-02-21T13:07:56.2688772Z # src[attention.py:62]: v_view = v_in.reshape([-1, n_dim, head_dim]) 2026-02-21T13:07:56.2688932Z v_view = v_in.reshape([-1, n_dim, head_dim]) 2026-02-21T13:07:56.2689107Z # src[attention.py:63]: k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2) 2026-02-21T13:07:56.2689306Z k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2) 2026-02-21T13:07:56.2689474Z # src[attention.py:64]: out = torch.empty_like(q_view) 2026-02-21T13:07:56.2689609Z out = torch.empty_like(q_view) 2026-02-21T13:07:56.2689771Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T13:07:56.2689931Z _BLOCK_SIZE_1 = 128 2026-02-21T13:07:56.2690026Z _RDIM_SIZE_2 = 128 2026-02-21T13:07:56.2690166Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T13:07:56.2690394Z # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T13:07:56.2690605Z # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0) 2026-02-21T13:07:56.2690772Z # src[attention.py:67-88]: ... 2026-02-21T13:07:56.2691078Z _launcher(_helion_attention, (triton.cdiv(2048, _BLOCK_SIZE_1) * 192,), q_view, k_view, v_view, out, _RDIM_SIZE_2, num_warps=4, num_stages=4, waves_per_eu=2, matrix_instr_nonkdim=0) 2026-02-21T13:07:56.2691402Z # src[attention.py:89]: return out.view(q_in.size()) 2026-02-21T13:07:56.2691540Z return out.view(q_in.size()) 2026-02-21T13:07:57.1850376Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T13:07:57.1852052Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T13:07:57.1853201Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T13:07:57.1853442Z WARNING:tritonbench.utils.triton_op:Completed input ID 4: 2026-02-21T13:07:57.1853692Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T13:07:57.1853870Z ------------------------------------------ 2026-02-21T13:07:57.1854048Z (4, 48, 2048, 2048, 128) 2026-02-21T13:07:57.1854136Z 2026-02-21T13:07:57.1854693Z 67%|██████▋ | 4/6 [41:03<23:45, 712.60s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5: 2026-02-21T13:07:57.1854969Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T13:07:57.1855254Z ------------------------------------------ 2026-02-21T13:07:57.1855410Z (4, 48, 4096, 4096, 128) 2026-02-21T13:07:57.1855633Z INFO:tritonbench.utils.triton_op:Took 0.09ms to get benchmark function for aten 2026-02-21T13:07:58.2106229Z INFO:tritonbench.utils.triton_op:Took 1.42ms to get benchmark function for flex_attention 2026-02-21T13:07:59.5452789Z WARNING:__main__:Input tensor metadata: 2026-02-21T13:07:59.5453153Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T13:07:59.5453456Z 'dtype': 'torch.bfloat16', 2026-02-21T13:07:59.5453747Z 'shape': (4, 48, 4096, 128), 2026-02-21T13:07:59.5454054Z 'stride': (25165824, 524288, 128, 1)}, 2026-02-21T13:07:59.5454701Z { 'device': 'cuda:0', 2026-02-21T13:07:59.5454968Z 'dtype': 'torch.bfloat16', 2026-02-21T13:07:59.5455268Z 'shape': (4, 48, 4096, 128), 2026-02-21T13:07:59.5455552Z 'stride': (25165824, 524288, 128, 1)}, 2026-02-21T13:07:59.5455843Z { 'device': 'cuda:0', 2026-02-21T13:07:59.5456107Z 'dtype': 'torch.bfloat16', 2026-02-21T13:07:59.5456387Z 'shape': (4, 48, 4096, 128), 2026-02-21T13:07:59.5456671Z 'stride': (25165824, 524288, 128, 1)}), 2026-02-21T13:07:59.5456960Z 'kwargs': {}} 2026-02-21T13:07:59.5496140Z INFO:tritonbench.utils.triton_op:Took 4.67ms to get benchmark function for helion_attention 2026-02-21T13:07:59.7937663Z [0s] Autotune random seed: 2150287535 2026-02-21T13:07:59.8499544Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T13:08:34.0939106Z [34s] Timeout after 30s compiling Config(block_sizes=[1, 64, 2048], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:08:34.4189717Z [34s] Timeout after 30s compiling Config(block_sizes=[1, 2, 1024], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 2], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:08:36.9903501Z [37s] Timeout after 30s compiling Config(block_sizes=[1, 2, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:08:38.0265459Z [38s] Timeout after 30s compiling Config(block_sizes=[1, 32, 2048], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[3, 3], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:08:38.9411927Z [39s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 32], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, True], range_num_stages=[1, 4], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:08:39.7437267Z [39s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:08:41.2780056Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 128, 1024], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[3, 2], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:08:42.6114285Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 64, 1024], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:08:43.6194781Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 1, 4096], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:08:43.9810158Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 512], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:08:44.4700243Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:08:45.0110802Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[1, 0], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:08:45.3342183Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 4, 2048], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:08:45.5504829Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:08:46.2413208Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 16, 4096], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:08:46.4628722Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 2, 2048], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[4, 4], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:08:46.9268690Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 512, 1024], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:08:49.0999246Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 8, 4096], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:08:49.7181740Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 2, 1024], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[3, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:08:50.0485245Z [50s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 4], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:08:51.1478171Z [51s] Timeout after 30s compiling Config(block_sizes=[1, 2, 2048], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:08:51.7537548Z [51s] Timeout after 30s compiling Config(block_sizes=[1, 4, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[0, 1], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:08:52.0891504Z [52s] Timeout after 30s compiling Config(block_sizes=[1, 1, 128], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:08:52.0912153Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.2 configs/s 2026-02-21T13:11:50.4219603Z /tmp/torchinductor_root/ih/ciheokr4uxsbdya5nc2jy32y2mmxkj4j7prhpucwzng6ytixyhh4.py:63:24: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T13:11:50.4221107Z k = tl.load(tl.make_block_ptr(k_view, [192, 128, 4096], [524288, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_3, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T13:11:50.4221969Z ^ 2026-02-21T13:11:50.4223653Z /tmp/torchinductor_root/ih/ciheokr4uxsbdya5nc2jy32y2mmxkj4j7prhpucwzng6ytixyhh4.py:65:145: note: - use: %137 = "tt.reshape"(<>) : (tensor<1x128x512xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 8], order = [1, 0, 2]}>>) -> tensor<128x512xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 8], order = [0, 1]}>> 2026-02-21T13:11:50.4225280Z 2026-02-21T13:11:50.4226213Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T13:11:50.4227368Z ^ 2026-02-21T13:11:50.4227730Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T13:11:50.4228196Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [2, 1, 0]}> 2026-02-21T13:11:50.4228889Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [2, 1, 0]}> 2026-02-21T13:11:50.4229467Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}> 2026-02-21T13:11:50.4230049Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [2, 1, 0]}> 2026-02-21T13:11:50.4230726Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [8, 1], order = [1, 0]}> 2026-02-21T13:11:50.4231278Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}> 2026-02-21T13:11:50.4231815Z #blocked6 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [8], order = [0]}> 2026-02-21T13:11:50.4232350Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [0, 1]}> 2026-02-21T13:11:50.4232878Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}> 2026-02-21T13:11:50.4233431Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [0, 1, 2]}> 2026-02-21T13:11:50.4234021Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [0, 1, 2]}> 2026-02-21T13:11:50.4234601Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 2], order = [2, 1, 0]}> 2026-02-21T13:11:50.4235259Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [0, 1, 2]}> 2026-02-21T13:11:50.4235829Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}> 2026-02-21T13:11:50.4236398Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}> 2026-02-21T13:11:50.4236968Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [0, 1, 2]}> 2026-02-21T13:11:50.4237601Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T13:11:50.4238298Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T13:11:50.4238844Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T13:11:50.4239008Z %c128_i64 = arith.constant 128 : i64 2026-02-21T13:11:50.4239167Z %c192_i64 = arith.constant 192 : i64 2026-02-21T13:11:50.4239321Z %c0_i64 = arith.constant 0 : i64 2026-02-21T13:11:50.4239485Z %c524288_i64 = arith.constant 524288 : i64 2026-02-21T13:11:50.4239647Z %c128_i32 = arith.constant 128 : i32 2026-02-21T13:11:50.4239805Z %c524288_i32 = arith.constant 524288 : i32 2026-02-21T13:11:50.4240010Z %cst = arith.constant dense<128> : tensor<1x1x128xi64, #blocked> 2026-02-21T13:11:50.4240248Z %cst_0 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked> 2026-02-21T13:11:50.4240517Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<1x128x512xbf16, #blocked1> 2026-02-21T13:11:50.4240785Z %cst_2 = arith.constant dense<4096> : tensor<1x1x512xi64, #blocked1> 2026-02-21T13:11:50.4241028Z %cst_3 = arith.constant dense<0> : tensor<1x1x512xi64, #blocked1> 2026-02-21T13:11:50.4241268Z %cst_4 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2> 2026-02-21T13:11:50.4241508Z %cst_5 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2> 2026-02-21T13:11:50.4241749Z %cst_6 = arith.constant dense<128> : tensor<1x1x512xi64, #blocked1> 2026-02-21T13:11:50.4241972Z %c512_i32 = arith.constant 512 : i32 2026-02-21T13:11:50.4242130Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T13:11:50.4242290Z %c262144_i32 = arith.constant 262144 : i32 2026-02-21T13:11:50.4242453Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T13:11:50.4242692Z %c786432_i32 = arith.constant 786432 : i32 2026-02-21T13:11:50.4242897Z %cst_7 = arith.constant dense<128> : tensor<1x512x1xi32, #blocked3> 2026-02-21T13:11:50.4243147Z %cst_8 = arith.constant dense<0.127517432> : tensor<1x1x512xf32, #blocked1> 2026-02-21T13:11:50.4243435Z %cst_9 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4243699Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<1x512xf32, #blocked5> 2026-02-21T13:11:50.4243914Z %c0_i32 = arith.constant 0 : i32 2026-02-21T13:11:50.4244118Z %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked> 2026-02-21T13:11:50.4244379Z %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4244639Z %cst_13 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4244850Z %c64_i32 = arith.constant 64 : i32 2026-02-21T13:11:50.4245003Z %c192_i32 = arith.constant 192 : i32 2026-02-21T13:11:50.4245159Z %0 = tt.get_program_id x : i32 2026-02-21T13:11:50.4245369Z %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked6> 2026-02-21T13:11:50.4245732Z %2 = ttg.convert_layout %1 : tensor<128xi32, #blocked6> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T13:11:50.4246197Z %3 = tt.expand_dims %2 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7> 2026-02-21T13:11:50.4246585Z %4 = ttg.convert_layout %3 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked8> 2026-02-21T13:11:50.4246964Z %5 = ttg.convert_layout %4 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T13:11:50.4247411Z %6 = tt.expand_dims %5 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi32, #blocked9> 2026-02-21T13:11:50.4247802Z %7 = ttg.convert_layout %6 : tensor<1x1x128xi32, #blocked9> -> tensor<1x1x128xi32, #blocked> 2026-02-21T13:11:50.4248066Z %8 = tt.splat %arg0 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked> 2026-02-21T13:11:50.4248376Z %9 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #blocked6> 2026-02-21T13:11:50.4248600Z %10 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x512x!tt.ptr, #blocked1> 2026-02-21T13:11:50.4248824Z %11 = arith.extsi %1 : tensor<128xi32, #blocked6> to tensor<128xi64, #blocked6> 2026-02-21T13:11:50.4249107Z %12 = ttg.convert_layout %11 : tensor<128xi64, #blocked6> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T13:11:50.4249456Z %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi64, #blocked7> 2026-02-21T13:11:50.4249763Z %14 = ttg.convert_layout %13 : tensor<1x128xi64, #blocked7> -> tensor<1x128xi64, #blocked8> 2026-02-21T13:11:50.4250072Z %15 = ttg.convert_layout %14 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked10}>> 2026-02-21T13:11:50.4250438Z %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x128x1xi64, #blocked10> 2026-02-21T13:11:50.4250761Z %17 = ttg.convert_layout %16 : tensor<1x128x1xi64, #blocked10> -> tensor<1x128x1xi64, #blocked2> 2026-02-21T13:11:50.4251032Z %18 = tt.broadcast %17 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x512xi64, #blocked2> 2026-02-21T13:11:50.4251301Z %19 = ttg.convert_layout %18 : tensor<1x128x512xi64, #blocked2> -> tensor<1x128x512xi64, #blocked1> 2026-02-21T13:11:50.4251554Z %20 = arith.extsi %9 : tensor<512xi32, #blocked6> to tensor<512xi64, #blocked6> 2026-02-21T13:11:50.4251778Z %21 = arith.cmpi sge, %17, %cst_5 : tensor<1x128x1xi64, #blocked2> 2026-02-21T13:11:50.4251963Z %22 = arith.cmpi slt, %17, %cst_4 : tensor<1x128x1xi64, #blocked2> 2026-02-21T13:11:50.4252139Z %23 = arith.andi %21, %22 : tensor<1x128x1xi1, #blocked2> 2026-02-21T13:11:50.4252345Z %24 = tt.broadcast %7 : tensor<1x1x128xi32, #blocked> -> tensor<1x512x128xi32, #blocked> 2026-02-21T13:11:50.4252610Z %25 = ttg.convert_layout %24 : tensor<1x512x128xi32, #blocked> -> tensor<1x512x128xi32, #blocked11> 2026-02-21T13:11:50.4252885Z %26 = tt.splat %arg2 : !tt.ptr -> tensor<1x512x128x!tt.ptr, #blocked11> 2026-02-21T13:11:50.4253108Z %27 = tt.splat %arg3 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked> 2026-02-21T13:11:50.4253324Z %28 = arith.extsi %1 : tensor<128xi32, #blocked6> to tensor<128xi64, #blocked6> 2026-02-21T13:11:50.4253605Z %29 = ttg.convert_layout %28 : tensor<128xi64, #blocked6> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T13:11:50.4253950Z %30 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi64, #blocked7> 2026-02-21T13:11:50.4254254Z %31 = ttg.convert_layout %30 : tensor<1x128xi64, #blocked7> -> tensor<1x128xi64, #blocked8> 2026-02-21T13:11:50.4254560Z %32 = ttg.convert_layout %31 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> 2026-02-21T13:11:50.4254921Z %33 = tt.expand_dims %32 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi64, #blocked9> 2026-02-21T13:11:50.4255257Z %34 = ttg.convert_layout %33 : tensor<1x1x128xi64, #blocked9> -> tensor<1x1x128xi64, #blocked> 2026-02-21T13:11:50.4255485Z %35 = arith.cmpi sge, %34, %cst_0 : tensor<1x1x128xi64, #blocked> 2026-02-21T13:11:50.4255664Z %36 = arith.cmpi slt, %34, %cst : tensor<1x1x128xi64, #blocked> 2026-02-21T13:11:50.4255838Z %37 = arith.andi %35, %36 : tensor<1x1x128xi1, #blocked> 2026-02-21T13:11:50.4256008Z scf.for %arg4 = %0 to %c786432_i32 step %c2432_i32 : i32 { 2026-02-21T13:11:50.4256185Z %38 = arith.divsi %arg4, %c262144_i32 : i32 2026-02-21T13:11:50.4256319Z %39 = arith.muli %38, %c64_i32 : i32 2026-02-21T13:11:50.4256446Z %40 = arith.subi %c192_i32, %39 : i32 2026-02-21T13:11:50.4256603Z %41 = arith.minsi %40, %c64_i32 : i32 2026-02-21T13:11:50.4256734Z %42 = arith.remsi %arg4, %c262144_i32 : i32 2026-02-21T13:11:50.4256866Z %43 = arith.remsi %42, %41 : i32 2026-02-21T13:11:50.4256987Z %44 = arith.addi %39, %43 : i32 2026-02-21T13:11:50.4257105Z %45 = arith.divsi %42, %41 : i32 2026-02-21T13:11:50.4257225Z %46 = arith.muli %44, %c524288_i32 : i32 2026-02-21T13:11:50.4257346Z %47 = arith.muli %45, %c128_i32 : i32 2026-02-21T13:11:50.4257461Z %48 = arith.addi %46, %47 : i32 2026-02-21T13:11:50.4257594Z %49 = tt.splat %48 : i32 -> tensor<1x1x128xi32, #blocked> 2026-02-21T13:11:50.4257748Z %50 = arith.addi %49, %7 : tensor<1x1x128xi32, #blocked> 2026-02-21T13:11:50.4257945Z %51 = tt.addptr %8, %50 : tensor<1x1x128x!tt.ptr, #blocked>, tensor<1x1x128xi32, #blocked> 2026-02-21T13:11:50.4258149Z %52 = tt.load %51 : tensor<1x1x128x!tt.ptr, #blocked> 2026-02-21T13:11:50.4258288Z %53 = arith.extsi %44 : i32 to i64 2026-02-21T13:11:50.4258409Z %54 = arith.muli %53, %c524288_i64 : i64 2026-02-21T13:11:50.4258545Z %55 = tt.splat %54 : i64 -> tensor<1x128x512xi64, #blocked1> 2026-02-21T13:11:50.4258689Z %56 = arith.cmpi sge, %53, %c0_i64 : i64 2026-02-21T13:11:50.4258810Z %57 = arith.cmpi slt, %53, %c192_i64 : i64 2026-02-21T13:11:50.4258933Z %58 = arith.andi %56, %57 : i1 2026-02-21T13:11:50.4259063Z %59 = tt.splat %58 : i1 -> tensor<1x128x1xi1, #blocked2> 2026-02-21T13:11:50.4259218Z %60 = arith.andi %59, %23 : tensor<1x128x1xi1, #blocked2> 2026-02-21T13:11:50.4259415Z %61 = tt.broadcast %60 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x512xi1, #blocked2> 2026-02-21T13:11:50.4259678Z %62 = ttg.convert_layout %61 : tensor<1x128x512xi1, #blocked2> -> tensor<1x128x512xi1, #blocked1> 2026-02-21T13:11:50.4259920Z %63 = tt.reshape %52 : tensor<1x1x128xbf16, #blocked> -> tensor<1x128xbf16, #blocked8> 2026-02-21T13:11:50.4260113Z %64 = tt.splat %46 : i32 -> tensor<1x512x1xi32, #blocked3> 2026-02-21T13:11:50.4260474Z %65:3 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c512_i32 iter_args(%arg6 = %cst_13, %arg7 = %cst_12, %arg8 = %cst_11) -> (tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked>) : i32 { 2026-02-21T13:11:50.4260859Z %91 = tt.splat %arg5 : i32 -> tensor<512xi32, #blocked6> 2026-02-21T13:11:50.4261013Z %92 = arith.addi %91, %9 : tensor<512xi32, #blocked6> 2026-02-21T13:11:50.4261148Z %93 = arith.extsi %arg5 : i32 to i64 2026-02-21T13:11:50.4261280Z %94 = tt.splat %93 : i64 -> tensor<512xi64, #blocked6> 2026-02-21T13:11:50.4261430Z %95 = arith.addi %94, %20 : tensor<512xi64, #blocked6> 2026-02-21T13:11:50.4261662Z %96 = ttg.convert_layout %95 : tensor<512xi64, #blocked6> -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T13:11:50.4261990Z %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x512xi64, #blocked7> 2026-02-21T13:11:50.4262282Z %98 = ttg.convert_layout %97 : tensor<1x512xi64, #blocked7> -> tensor<1x512xi64, #blocked5> 2026-02-21T13:11:50.4262573Z %99 = ttg.convert_layout %98 : tensor<1x512xi64, #blocked5> -> tensor<1x512xi64, #ttg.slice<{dim = 1, parent = #blocked12}>> 2026-02-21T13:11:50.4262933Z %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x512xi64, #ttg.slice<{dim = 1, parent = #blocked12}>> -> tensor<1x1x512xi64, #blocked12> 2026-02-21T13:11:50.4263248Z %101 = ttg.convert_layout %100 : tensor<1x1x512xi64, #blocked12> -> tensor<1x1x512xi64, #blocked1> 2026-02-21T13:11:50.4263468Z %102 = arith.muli %101, %cst_6 : tensor<1x1x512xi64, #blocked1> 2026-02-21T13:11:50.4263676Z %103 = tt.broadcast %102 : tensor<1x1x512xi64, #blocked1> -> tensor<1x128x512xi64, #blocked1> 2026-02-21T13:11:50.4263886Z %104 = arith.addi %19, %103 : tensor<1x128x512xi64, #blocked1> 2026-02-21T13:11:50.4264067Z %105 = arith.addi %55, %104 : tensor<1x128x512xi64, #blocked1> 2026-02-21T13:11:50.4264286Z %106 = tt.addptr %10, %105 : tensor<1x128x512x!tt.ptr, #blocked1>, tensor<1x128x512xi64, #blocked1> 2026-02-21T13:11:50.4264512Z %107 = arith.cmpi sge, %101, %cst_3 : tensor<1x1x512xi64, #blocked1> 2026-02-21T13:11:50.4264694Z %108 = arith.cmpi slt, %101, %cst_2 : tensor<1x1x512xi64, #blocked1> 2026-02-21T13:11:50.4264867Z %109 = arith.andi %107, %108 : tensor<1x1x512xi1, #blocked1> 2026-02-21T13:11:50.4265069Z %110 = tt.broadcast %109 : tensor<1x1x512xi1, #blocked1> -> tensor<1x128x512xi1, #blocked1> 2026-02-21T13:11:50.4265276Z %111 = arith.andi %62, %110 : tensor<1x128x512xi1, #blocked1> 2026-02-21T13:11:50.4265456Z %112 = tt.load %106, %111, %cst_1 : tensor<1x128x512x!tt.ptr, #blocked1> 2026-02-21T13:11:50.4265687Z %113 = tt.reshape %112 : tensor<1x128x512xbf16, #blocked1> -> tensor<128x512xbf16, #blocked5> 2026-02-21T13:11:50.4265991Z %114 = ttg.convert_layout %63 : tensor<1x128xbf16, #blocked8> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>> 2026-02-21T13:11:50.4266350Z %115 = ttg.convert_layout %113 : tensor<128x512xbf16, #blocked5> -> tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>> 2026-02-21T13:11:50.4266658Z %116 = ttg.convert_layout %cst_10 : tensor<1x512xf32, #blocked5> -> tensor<1x512xf32, #blocked5> 2026-02-21T13:11:50.4267068Z %117 = tt.dot %114, %115, %116, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>> * tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>> -> tensor<1x512xf32, #blocked5> 2026-02-21T13:11:50.4267479Z %118 = tt.reshape %117 : tensor<1x512xf32, #blocked5> -> tensor<1x1x512xf32, #blocked1> 2026-02-21T13:11:50.4267719Z %119 = arith.truncf %118 : tensor<1x1x512xf32, #blocked1> to tensor<1x1x512xbf16, #blocked1> 2026-02-21T13:11:50.4267964Z %120 = arith.extf %119 : tensor<1x1x512xbf16, #blocked1> to tensor<1x1x512xf32, #blocked1> 2026-02-21T13:11:50.4268155Z %121 = "tt.reduce"(%120) <{axis = 2 : i32}> ({ 2026-02-21T13:11:50.4268296Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:11:50.4268418Z %176 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:11:50.4268543Z tt.reduce.return %176 : f32 2026-02-21T13:11:50.4268734Z }) : (tensor<1x1x512xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:11:50.4269024Z %122 = ttg.convert_layout %121 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4269297Z %123 = arith.truncf %122 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4> 2026-02-21T13:11:50.4269518Z %124 = arith.extf %123 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4269709Z %125 = arith.mulf %124, %cst_9 : tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4269899Z %126 = arith.truncf %125 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4> 2026-02-21T13:11:50.4270115Z %127 = arith.extf %126 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4270311Z %128 = arith.cmpf ogt, %arg6, %127 : tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4270498Z %129 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4270660Z %130 = arith.ori %128, %129 : tensor<1x1xi1, #blocked4> 2026-02-21T13:11:50.4270852Z %131 = arith.select %130, %arg6, %127 : tensor<1x1xi1, #blocked4>, tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4271056Z %132 = arith.mulf %120, %cst_8 : tensor<1x1x512xf32, #blocked1> 2026-02-21T13:11:50.4271261Z %133 = arith.truncf %132 : tensor<1x1x512xf32, #blocked1> to tensor<1x1x512xbf16, #blocked1> 2026-02-21T13:11:50.4271549Z %134 = ttg.convert_layout %131 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> 2026-02-21T13:11:50.4271906Z %135 = tt.expand_dims %134 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x1x1xf32, #blocked13> 2026-02-21T13:11:50.4272210Z %136 = ttg.convert_layout %135 : tensor<1x1x1xf32, #blocked13> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T13:11:50.4272458Z %137 = arith.extf %133 : tensor<1x1x512xbf16, #blocked1> to tensor<1x1x512xf32, #blocked1> 2026-02-21T13:11:50.4272699Z %138 = tt.broadcast %136 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x512xf32, #blocked14> 2026-02-21T13:11:50.4272954Z %139 = ttg.convert_layout %138 : tensor<1x1x512xf32, #blocked14> -> tensor<1x1x512xf32, #blocked1> 2026-02-21T13:11:50.4273173Z %140 = arith.subf %137, %139 : tensor<1x1x512xf32, #blocked1> 2026-02-21T13:11:50.4273483Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x512xf32, #blocked1>) -> tensor<1x1x512xf32, #blocked1> 2026-02-21T13:11:50.4273778Z %142 = "tt.reduce"(%141) <{axis = 2 : i32}> ({ 2026-02-21T13:11:50.4273905Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:11:50.4274025Z %176 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:11:50.4274143Z tt.reduce.return %176 : f32 2026-02-21T13:11:50.4274331Z }) : (tensor<1x1x512xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:11:50.4274618Z %143 = ttg.convert_layout %142 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4274861Z %144 = arith.subf %arg6, %131 : tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4275187Z %145 = tt.extern_elementwise %144 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked4>) -> tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4275475Z %146 = arith.mulf %arg7, %145 : tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4275638Z %147 = arith.addf %146, %143 : tensor<1x1xf32, #blocked4> 2026-02-21T13:11:50.4275878Z %148 = ttg.convert_layout %145 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> 2026-02-21T13:11:50.4276235Z %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x1x1xf32, #blocked13> 2026-02-21T13:11:50.4276535Z %150 = ttg.convert_layout %149 : tensor<1x1x1xf32, #blocked13> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T13:11:50.4276782Z %151 = tt.broadcast %150 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x128xf32, #blocked14> 2026-02-21T13:11:50.4277035Z %152 = ttg.convert_layout %151 : tensor<1x1x128xf32, #blocked14> -> tensor<1x1x128xf32, #blocked> 2026-02-21T13:11:50.4277245Z %153 = arith.mulf %arg8, %152 : tensor<1x1x128xf32, #blocked> 2026-02-21T13:11:50.4277486Z %154 = ttg.convert_layout %92 : tensor<512xi32, #blocked6> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> 2026-02-21T13:11:50.4277816Z %155 = tt.expand_dims %154 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x512xi32, #blocked7> 2026-02-21T13:11:50.4278106Z %156 = ttg.convert_layout %155 : tensor<1x512xi32, #blocked7> -> tensor<1x512xi32, #blocked5> 2026-02-21T13:11:50.4278414Z %157 = ttg.convert_layout %156 : tensor<1x512xi32, #blocked5> -> tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked15}>> 2026-02-21T13:11:50.4278759Z %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x512x1xi32, #blocked15> 2026-02-21T13:11:50.4279071Z %159 = ttg.convert_layout %158 : tensor<1x512x1xi32, #blocked15> -> tensor<1x512x1xi32, #blocked3> 2026-02-21T13:11:50.4279289Z %160 = arith.muli %159, %cst_7 : tensor<1x512x1xi32, #blocked3> 2026-02-21T13:11:50.4279453Z %161 = arith.addi %64, %160 : tensor<1x512x1xi32, #blocked3> 2026-02-21T13:11:50.4279671Z %162 = tt.broadcast %161 : tensor<1x512x1xi32, #blocked3> -> tensor<1x512x128xi32, #blocked3> 2026-02-21T13:11:50.4279930Z %163 = ttg.convert_layout %162 : tensor<1x512x128xi32, #blocked3> -> tensor<1x512x128xi32, #blocked11> 2026-02-21T13:11:50.4280153Z %164 = arith.addi %163, %25 : tensor<1x512x128xi32, #blocked11> 2026-02-21T13:11:50.4280377Z %165 = tt.addptr %26, %164 : tensor<1x512x128x!tt.ptr, #blocked11>, tensor<1x512x128xi32, #blocked11> 2026-02-21T13:11:50.4280603Z %166 = tt.load %165 : tensor<1x512x128x!tt.ptr, #blocked11> 2026-02-21T13:11:50.4280813Z %167 = arith.truncf %141 : tensor<1x1x512xf32, #blocked1> to tensor<1x1x512xbf16, #blocked1> 2026-02-21T13:11:50.4281051Z %168 = tt.reshape %153 : tensor<1x1x128xf32, #blocked> -> tensor<1x128xf32, #blocked8> 2026-02-21T13:11:50.4281285Z %169 = tt.reshape %167 : tensor<1x1x512xbf16, #blocked1> -> tensor<1x512xbf16, #blocked5> 2026-02-21T13:11:50.4281529Z %170 = tt.reshape %166 : tensor<1x512x128xbf16, #blocked11> -> tensor<512x128xbf16, #blocked8> 2026-02-21T13:11:50.4281831Z %171 = ttg.convert_layout %169 : tensor<1x512xbf16, #blocked5> -> tensor<1x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:11:50.4282196Z %172 = ttg.convert_layout %170 : tensor<512x128xbf16, #blocked8> -> tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:11:50.4282497Z %173 = ttg.convert_layout %168 : tensor<1x128xf32, #blocked8> -> tensor<1x128xf32, #blocked8> 2026-02-21T13:11:50.4282937Z %174 = tt.dot %171, %172, %173, inputPrecision = tf32 : tensor<1x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x128xf32, #blocked8> 2026-02-21T13:11:50.4283351Z %175 = tt.reshape %174 : tensor<1x128xf32, #blocked8> -> tensor<1x1x128xf32, #blocked> 2026-02-21T13:11:50.4283618Z scf.yield %131, %147, %175 : tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked> 2026-02-21T13:11:50.4283835Z } {tt.flatten, tt.num_stages = 4 : i32} 2026-02-21T13:11:50.4284082Z %66 = ttg.convert_layout %65#1 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> 2026-02-21T13:11:50.4284416Z %67 = tt.expand_dims %66 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x1x1xf32, #blocked13> 2026-02-21T13:11:50.4284710Z %68 = ttg.convert_layout %67 : tensor<1x1x1xf32, #blocked13> -> tensor<1x1x1xf32, #blocked14> 2026-02-21T13:11:50.4284951Z %69 = tt.broadcast %68 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x128xf32, #blocked14> 2026-02-21T13:11:50.4285192Z %70 = ttg.convert_layout %69 : tensor<1x1x128xf32, #blocked14> -> tensor<1x1x128xf32, #blocked> 2026-02-21T13:11:50.4285401Z %71 = arith.divf %65#2, %70 : tensor<1x1x128xf32, #blocked> 2026-02-21T13:11:50.4285594Z %72 = arith.truncf %71 : tensor<1x1x128xf32, #blocked> to tensor<1x1x128xbf16, #blocked> 2026-02-21T13:11:50.4285773Z %73 = arith.extsi %44 : i32 to i64 2026-02-21T13:11:50.4285887Z %74 = arith.extsi %45 : i32 to i64 2026-02-21T13:11:50.4286011Z %75 = arith.muli %73, %c524288_i64 : i64 2026-02-21T13:11:50.4286152Z %76 = tt.splat %75 : i64 -> tensor<1x1x128xi64, #blocked> 2026-02-21T13:11:50.4286312Z %77 = arith.muli %74, %c128_i64 : i64 2026-02-21T13:11:50.4286447Z %78 = tt.splat %77 : i64 -> tensor<1x1x128xi64, #blocked> 2026-02-21T13:11:50.4286604Z %79 = arith.addi %78, %34 : tensor<1x1x128xi64, #blocked> 2026-02-21T13:11:50.4286761Z %80 = arith.addi %76, %79 : tensor<1x1x128xi64, #blocked> 2026-02-21T13:11:50.4286961Z %81 = tt.addptr %27, %80 : tensor<1x1x128x!tt.ptr, #blocked>, tensor<1x1x128xi64, #blocked> 2026-02-21T13:11:50.4287159Z %82 = arith.cmpi sge, %73, %c0_i64 : i64 2026-02-21T13:11:50.4287285Z %83 = arith.cmpi slt, %73, %c192_i64 : i64 2026-02-21T13:11:50.4303275Z %84 = arith.andi %82, %83 : i1 2026-02-21T13:11:50.4303410Z %85 = arith.cmpi sge, %74, %c0_i64 : i64 2026-02-21T13:11:50.4303541Z %86 = arith.cmpi slt, %74, %c4096_i64 : i64 2026-02-21T13:11:50.4303672Z %87 = arith.andi %85, %86 : i1 2026-02-21T13:11:50.4303783Z %88 = arith.andi %84, %87 : i1 2026-02-21T13:11:50.4303922Z %89 = tt.splat %88 : i1 -> tensor<1x1x128xi1, #blocked> 2026-02-21T13:11:50.4304079Z %90 = arith.andi %89, %37 : tensor<1x1x128xi1, #blocked> 2026-02-21T13:11:50.4304247Z tt.store %81, %72, %90 : tensor<1x1x128x!tt.ptr, #blocked> 2026-02-21T13:11:50.4304423Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32} 2026-02-21T13:11:50.4304566Z tt.return 2026-02-21T13:11:50.4304654Z } 2026-02-21T13:11:50.4304732Z } 2026-02-21T13:11:50.4304776Z 2026-02-21T13:11:50.4304813Z {-# 2026-02-21T13:11:50.4304898Z external_resources: { 2026-02-21T13:11:50.4305003Z mlir_reproducer: { 2026-02-21T13:11:50.4307154Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=1 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T13:11:50.4309445Z disable_threading: false, 2026-02-21T13:11:50.4309559Z verify_each: true 2026-02-21T13:11:50.4309652Z } 2026-02-21T13:11:50.4309731Z } 2026-02-21T13:11:50.4309804Z #-} 2026-02-21T13:11:50.4310085Z /tmp/torchinductor_root/ih/ciheokr4uxsbdya5nc2jy32y2mmxkj4j7prhpucwzng6ytixyhh4.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T13:11:50.4310801Z /tmp/torchinductor_root/ih/ciheokr4uxsbdya5nc2jy32y2mmxkj4j7prhpucwzng6ytixyhh4.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T13:11:50.4311359Z [230s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T13:11:50.4312184Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 512], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T13:11:50.4312916Z Error: RuntimeError: PassManager::run failed 2026-02-21T13:11:50.4313085Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T13:12:02.0753735Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 0.7 configs/s 2026-02-21T13:12:02.0762427Z [242s] Adaptive compile timeout: 30s (90% percentile=30.0s, bounds=[30.0s, 30s]) 2026-02-21T13:12:02.1258382Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 - configs/s 2026-02-21T13:12:03.2288375Z [243s] Initial random population of 100, 5 starting points: 2026-02-21T13:12:03.2288874Z error=21 2026-02-21T13:12:03.2289081Z timeout=23 2026-02-21T13:12:03.2289283Z ok=56 2026-02-21T13:12:03.2289467Z min=7.3252 2026-02-21T13:12:03.2289662Z mid=46.2729 2026-02-21T13:12:03.2289873Z max=8495.9043 2026-02-21T13:12:03.2290112Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:12:03.2290518Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T13:12:03.2290901Z 'l2_groupings': [2], 2026-02-21T13:12:03.2291180Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:12:03.2291486Z 'loop_orders': [[0, 1]], 2026-02-21T13:12:03.2291753Z 'matrix_instr_nonkdim': 16, 2026-02-21T13:12:03.2292009Z 'num_stages': 1, 2026-02-21T13:12:03.2292239Z 'num_warps': 4, 2026-02-21T13:12:03.2292462Z 'pid_type': 'flat', 2026-02-21T13:12:03.2292717Z 'range_flattens': [None, False], 2026-02-21T13:12:03.2293022Z 'range_multi_buffers': [None, False], 2026-02-21T13:12:03.2293325Z 'range_num_stages': [0, 1], 2026-02-21T13:12:03.2293595Z 'range_unroll_factors': [0, 0], 2026-02-21T13:12:03.2293892Z 'range_warp_specializes': [], 2026-02-21T13:12:03.2294160Z 'waves_per_eu': 1} 2026-02-21T13:12:03.2303622Z [243s] Fitting surrogate: 100 points, 100 targets 2026-02-21T13:12:04.1775317Z [244s] Generation 1 starting: 96 neighbors, 5 active search path(s) 2026-02-21T13:12:28.8260207Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 1.5 configs/s 2026-02-21T13:12:47.1097083Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 98/98 5.4 configs/s 2026-02-21T13:12:47.4716855Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 33/33 - configs/s 2026-02-21T13:12:57.6373515Z [297s] Generation 1 complete: 2026-02-21T13:12:57.6373936Z ok=101 2026-02-21T13:12:57.6374154Z min=6.0520 2026-02-21T13:12:57.6374369Z mid=12.1041 2026-02-21T13:12:57.6374590Z max=68.3722 2026-02-21T13:12:57.6374873Z best={'block_sizes': [1, 128, 32], 2026-02-21T13:12:57.6375294Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T13:12:57.6376242Z 'l2_groupings': [16], 2026-02-21T13:12:57.6376522Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:12:57.6376847Z 'loop_orders': [[0, 1]], 2026-02-21T13:12:57.6377132Z 'matrix_instr_nonkdim': 16, 2026-02-21T13:12:57.6377421Z 'num_stages': 1, 2026-02-21T13:12:57.6377655Z 'num_warps': 4, 2026-02-21T13:12:57.6377895Z 'pid_type': 'flat', 2026-02-21T13:12:57.6378155Z 'range_flattens': [None, None], 2026-02-21T13:12:57.6378468Z 'range_multi_buffers': [None, False], 2026-02-21T13:12:57.6378830Z 'range_num_stages': [0, 4], 2026-02-21T13:12:57.6379124Z 'range_unroll_factors': [0, 4], 2026-02-21T13:12:57.6379388Z 'range_warp_specializes': [], 2026-02-21T13:12:57.6379629Z 'waves_per_eu': 2} 2026-02-21T13:12:57.6414099Z [297s] Fitting surrogate: 201 points, 201 targets 2026-02-21T13:12:58.5833046Z [298s] Generation 2 starting: 88 neighbors, 5 active search path(s) 2026-02-21T13:13:36.1010180Z [336s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[3, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:13:49.9628305Z [350s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:13:49.9653660Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 0.4 configs/s 2026-02-21T13:14:05.7976919Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 90/90 5.7 configs/s 2026-02-21T13:14:06.2006992Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 33/33 - configs/s 2026-02-21T13:14:17.4107896Z [377s] Generation 2 complete: 2026-02-21T13:14:17.4108307Z error=9 2026-02-21T13:14:17.4108500Z timeout=2 2026-02-21T13:14:17.4108688Z ok=82 2026-02-21T13:14:17.4108862Z min=5.8575 2026-02-21T13:14:17.4109053Z mid=8.0320 2026-02-21T13:14:17.4109231Z max=103.1657 2026-02-21T13:14:17.4109471Z best={'block_sizes': [1, 256, 64], 2026-02-21T13:14:17.4109850Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T13:14:17.4110230Z 'l2_groupings': [2], 2026-02-21T13:14:17.4110491Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:14:17.4110786Z 'loop_orders': [[0, 1]], 2026-02-21T13:14:17.4111047Z 'matrix_instr_nonkdim': 16, 2026-02-21T13:14:17.4111303Z 'num_stages': 1, 2026-02-21T13:14:17.4111516Z 'num_warps': 8, 2026-02-21T13:14:17.4111728Z 'pid_type': 'flat', 2026-02-21T13:14:17.4111990Z 'range_flattens': [None, None], 2026-02-21T13:14:17.4112273Z 'range_multi_buffers': [None, True], 2026-02-21T13:14:17.4112551Z 'range_num_stages': [0, 1], 2026-02-21T13:14:17.4112806Z 'range_unroll_factors': [0, 0], 2026-02-21T13:14:17.4113082Z 'range_warp_specializes': [], 2026-02-21T13:14:17.4113333Z 'waves_per_eu': 1} 2026-02-21T13:14:17.4148052Z [377s] Fitting surrogate: 294 points, 294 targets 2026-02-21T13:14:19.3291138Z [379s] Generation 3 starting: 86 neighbors, 5 active search path(s) 2026-02-21T13:14:53.6258322Z [413s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:14:53.8478032Z [413s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[3, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:14:57.5060542Z [417s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[2, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:14:57.5083028Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 0.8 configs/s 2026-02-21T13:15:14.1933113Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 87/87 5.2 configs/s 2026-02-21T13:15:14.5767410Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s 2026-02-21T13:15:25.4174523Z [445s] Generation 3 complete: 2026-02-21T13:15:25.4174894Z error=5 2026-02-21T13:15:25.4175072Z timeout=3 2026-02-21T13:15:25.4175246Z ok=83 2026-02-21T13:15:25.4175424Z min=5.9373 2026-02-21T13:15:25.4175638Z mid=7.1941 2026-02-21T13:15:25.4175805Z max=123.1062 2026-02-21T13:15:25.4176016Z best={'block_sizes': [1, 256, 32], 2026-02-21T13:15:25.4176369Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'pointer'], 2026-02-21T13:15:25.4176713Z 'l2_groupings': [1], 2026-02-21T13:15:25.4176963Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:15:25.4177236Z 'loop_orders': [[1, 0]], 2026-02-21T13:15:25.4177862Z 'matrix_instr_nonkdim': 32, 2026-02-21T13:15:25.4178094Z 'num_stages': 2, 2026-02-21T13:15:25.4178315Z 'num_warps': 8, 2026-02-21T13:15:25.4178514Z 'pid_type': 'flat', 2026-02-21T13:15:25.4178743Z 'range_flattens': [None, False], 2026-02-21T13:15:25.4179005Z 'range_multi_buffers': [None, False], 2026-02-21T13:15:25.4179280Z 'range_num_stages': [0, 2], 2026-02-21T13:15:25.4179519Z 'range_unroll_factors': [0, 2], 2026-02-21T13:15:25.4179775Z 'range_warp_specializes': [], 2026-02-21T13:15:25.4180005Z 'waves_per_eu': 2} 2026-02-21T13:15:25.4216402Z [445s] Fitting surrogate: 385 points, 385 targets 2026-02-21T13:15:26.3676785Z [446s] Generation 4 starting: 85 neighbors, 5 active search path(s) 2026-02-21T13:16:03.6646411Z [483s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:16:08.3220036Z [488s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:16:08.3245874Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 0.6 configs/s 2026-02-21T13:16:19.3405510Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 87/87 8.0 configs/s 2026-02-21T13:16:19.7435410Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 37/37 - configs/s 2026-02-21T13:16:31.9892110Z [512s] Generation 4 complete: 2026-02-21T13:16:31.9892349Z error=11 2026-02-21T13:16:31.9892445Z timeout=2 2026-02-21T13:16:31.9892828Z ok=78 2026-02-21T13:16:31.9892938Z min=5.3599 2026-02-21T13:16:31.9893027Z mid=7.0520 2026-02-21T13:16:31.9893118Z max=56.8961 2026-02-21T13:16:31.9893227Z best={'block_sizes': [1, 128, 32], 2026-02-21T13:16:31.9894655Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T13:16:31.9894939Z 'l2_groupings': [8], 2026-02-21T13:16:31.9895131Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:16:31.9895274Z 'loop_orders': [[0, 1]], 2026-02-21T13:16:31.9895403Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:16:31.9895520Z 'num_stages': 1, 2026-02-21T13:16:31.9895622Z 'num_warps': 4, 2026-02-21T13:16:31.9895726Z 'pid_type': 'flat', 2026-02-21T13:16:31.9895866Z 'range_flattens': [None, True], 2026-02-21T13:16:31.9896083Z 'range_multi_buffers': [None, False], 2026-02-21T13:16:31.9896305Z 'range_num_stages': [0, 3], 2026-02-21T13:16:31.9896505Z 'range_unroll_factors': [0, 4], 2026-02-21T13:16:31.9896714Z 'range_warp_specializes': [], 2026-02-21T13:16:31.9896877Z 'waves_per_eu': 2} 2026-02-21T13:16:31.9930691Z [512s] Fitting surrogate: 476 points, 476 targets 2026-02-21T13:16:32.8648038Z [513s] Generation 5 starting: 75 neighbors, 5 active search path(s) 2026-02-21T13:17:11.6563122Z [551s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:17:11.6582829Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 0.4 configs/s 2026-02-21T13:17:21.0513775Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 76/76 8.1 configs/s 2026-02-21T13:17:21.4037494Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 38/38 - configs/s 2026-02-21T13:17:32.1835423Z [572s] Generation 5 complete: 2026-02-21T13:17:32.1835724Z error=8 2026-02-21T13:17:32.1835863Z timeout=1 2026-02-21T13:17:32.1835985Z ok=71 2026-02-21T13:17:32.1836133Z min=5.2410 2026-02-21T13:17:32.1836303Z mid=6.8149 2026-02-21T13:17:32.1836436Z max=64.5690 2026-02-21T13:17:32.1836581Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:17:32.1836837Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T13:17:32.1837094Z 'l2_groupings': [8], 2026-02-21T13:17:32.1837284Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:17:32.1837475Z 'loop_orders': [[0, 1]], 2026-02-21T13:17:32.1837649Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:17:32.1837815Z 'num_stages': 1, 2026-02-21T13:17:32.1837958Z 'num_warps': 4, 2026-02-21T13:17:32.1838097Z 'pid_type': 'flat', 2026-02-21T13:17:32.1838263Z 'range_flattens': [None, True], 2026-02-21T13:17:32.1838463Z 'range_multi_buffers': [None, False], 2026-02-21T13:17:32.1838652Z 'range_num_stages': [0, 3], 2026-02-21T13:17:32.1839226Z 'range_unroll_factors': [0, 4], 2026-02-21T13:17:32.1839410Z 'range_warp_specializes': [], 2026-02-21T13:17:32.1839574Z 'waves_per_eu': 2} 2026-02-21T13:17:32.1886924Z [572s] Fitting surrogate: 556 points, 556 targets 2026-02-21T13:17:32.8577626Z [573s] Generation 6 starting: 62 neighbors, 4 active search path(s) 2026-02-21T13:18:07.8462413Z [607s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:18:07.8478075Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 0.4 configs/s 2026-02-21T13:18:16.6066976Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 63/63 7.2 configs/s 2026-02-21T13:18:16.8540400Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 38/38 - configs/s 2026-02-21T13:18:24.3442272Z [624s] Generation 6 complete: 2026-02-21T13:18:24.3442674Z error=7 2026-02-21T13:18:24.3443277Z timeout=1 2026-02-21T13:18:24.3443419Z ok=58 2026-02-21T13:18:24.3443559Z min=5.2401 2026-02-21T13:18:24.3443700Z mid=6.9285 2026-02-21T13:18:24.3443843Z max=93.9612 2026-02-21T13:18:24.3444010Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:18:24.3444322Z 'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'], 2026-02-21T13:18:24.3444611Z 'l2_groupings': [8], 2026-02-21T13:18:24.3444807Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:18:24.3445032Z 'loop_orders': [[0, 1]], 2026-02-21T13:18:24.3445230Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:18:24.3445417Z 'num_stages': 1, 2026-02-21T13:18:24.3445581Z 'num_warps': 4, 2026-02-21T13:18:24.3445874Z 'pid_type': 'flat', 2026-02-21T13:18:24.3446060Z 'range_flattens': [None, True], 2026-02-21T13:18:24.3446300Z 'range_multi_buffers': [None, True], 2026-02-21T13:18:24.3446518Z 'range_num_stages': [0, 3], 2026-02-21T13:18:24.3446718Z 'range_unroll_factors': [0, 4], 2026-02-21T13:18:24.3446927Z 'range_warp_specializes': [], 2026-02-21T13:18:24.3447131Z 'waves_per_eu': 2} 2026-02-21T13:18:24.3478797Z [624s] Fitting surrogate: 622 points, 622 targets 2026-02-21T13:18:25.0433786Z [625s] Generation 7 starting: 62 neighbors, 4 active search path(s) 2026-02-21T13:18:48.9709852Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 0.8 configs/s 2026-02-21T13:18:57.6663068Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 63/63 7.3 configs/s 2026-02-21T13:18:57.9267338Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 39/39 - configs/s 2026-02-21T13:19:05.9791778Z [666s] Generation 7 complete: 2026-02-21T13:19:05.9791999Z error=1 2026-02-21T13:19:05.9792119Z ok=65 2026-02-21T13:19:05.9792200Z min=5.2029 2026-02-21T13:19:05.9792281Z mid=7.0048 2026-02-21T13:19:05.9792356Z max=29.7676 2026-02-21T13:19:05.9792465Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:19:05.9792621Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:19:05.9792780Z 'l2_groupings': [8], 2026-02-21T13:19:05.9792892Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:19:05.9793007Z 'loop_orders': [[0, 1]], 2026-02-21T13:19:05.9793110Z 'matrix_instr_nonkdim': 32, 2026-02-21T13:19:05.9793208Z 'num_stages': 1, 2026-02-21T13:19:05.9793294Z 'num_warps': 4, 2026-02-21T13:19:05.9793724Z 'pid_type': 'flat', 2026-02-21T13:19:05.9793829Z 'range_flattens': [None, True], 2026-02-21T13:19:05.9793940Z 'range_multi_buffers': [None, False], 2026-02-21T13:19:05.9794055Z 'range_num_stages': [0, 1], 2026-02-21T13:19:05.9794155Z 'range_unroll_factors': [0, 1], 2026-02-21T13:19:05.9794263Z 'range_warp_specializes': [], 2026-02-21T13:19:05.9794364Z 'waves_per_eu': 2} 2026-02-21T13:19:05.9830254Z [666s] Fitting surrogate: 688 points, 688 targets 2026-02-21T13:19:06.6628236Z [666s] Generation 8 starting: 61 neighbors, 4 active search path(s) 2026-02-21T13:19:41.1228302Z [701s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[1, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:19:41.1247562Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 0.4 configs/s 2026-02-21T13:19:50.5502382Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 61/61 6.5 configs/s 2026-02-21T13:19:50.7340652Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 40/40 - configs/s 2026-02-21T13:19:56.4205748Z [716s] Generation 8 complete: 2026-02-21T13:19:56.4206165Z error=4 2026-02-21T13:19:56.4206378Z timeout=1 2026-02-21T13:19:56.4206596Z ok=60 2026-02-21T13:19:56.4206794Z min=4.8733 2026-02-21T13:19:56.4206993Z mid=10.6151 2026-02-21T13:19:56.4207194Z max=81.1952 2026-02-21T13:19:56.4207423Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:19:56.4208290Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:19:56.4208704Z 'l2_groupings': [8], 2026-02-21T13:19:56.4208981Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:19:56.4209297Z 'loop_orders': [[0, 1]], 2026-02-21T13:19:56.4209590Z 'matrix_instr_nonkdim': 32, 2026-02-21T13:19:56.4209858Z 'num_stages': 1, 2026-02-21T13:19:56.4210081Z 'num_warps': 4, 2026-02-21T13:19:56.4210310Z 'pid_type': 'flat', 2026-02-21T13:19:56.4210565Z 'range_flattens': [None, None], 2026-02-21T13:19:56.4210866Z 'range_multi_buffers': [None, False], 2026-02-21T13:19:56.4211180Z 'range_num_stages': [0, 1], 2026-02-21T13:19:56.4211630Z 'range_unroll_factors': [0, 1], 2026-02-21T13:19:56.4211922Z 'range_warp_specializes': [], 2026-02-21T13:19:56.4212204Z 'waves_per_eu': 2} 2026-02-21T13:19:56.4244283Z [716s] Fitting surrogate: 753 points, 753 targets 2026-02-21T13:19:56.9177907Z [717s] Generation 9 starting: 37 neighbors, 2 active search path(s) 2026-02-21T13:20:28.5842498Z [748s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[1, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:20:31.3808547Z [751s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[1, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:20:33.0081468Z [753s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:20:33.0101190Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 0.4 configs/s 2026-02-21T13:20:39.8516657Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 38/38 5.5 configs/s 2026-02-21T13:20:39.9337274Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s 2026-02-21T13:20:42.4237043Z [762s] Generation 9 complete: 2026-02-21T13:20:42.4237769Z error=2 2026-02-21T13:20:42.4237988Z timeout=3 2026-02-21T13:20:42.4238180Z ok=34 2026-02-21T13:20:42.4238381Z min=4.9134 2026-02-21T13:20:42.4238589Z mid=10.6098 2026-02-21T13:20:42.4238814Z max=99.0830 2026-02-21T13:20:42.4239039Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:20:42.4239461Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:20:42.4239874Z 'l2_groupings': [8], 2026-02-21T13:20:42.4240154Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:20:42.4240488Z 'loop_orders': [[0, 1]], 2026-02-21T13:20:42.4240766Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:20:42.4241031Z 'num_stages': 1, 2026-02-21T13:20:42.4241255Z 'num_warps': 4, 2026-02-21T13:20:42.4241493Z 'pid_type': 'flat', 2026-02-21T13:20:42.4241752Z 'range_flattens': [None, None], 2026-02-21T13:20:42.4242070Z 'range_multi_buffers': [None, False], 2026-02-21T13:20:42.4242450Z 'range_num_stages': [0, 1], 2026-02-21T13:20:42.4242781Z 'range_unroll_factors': [0, 1], 2026-02-21T13:20:42.4243037Z 'range_warp_specializes': [], 2026-02-21T13:20:42.4243266Z 'waves_per_eu': 2} 2026-02-21T13:20:42.4271159Z [762s] Fitting surrogate: 792 points, 792 targets 2026-02-21T13:20:42.8800958Z [763s] Generation 10 starting: 33 neighbors, 2 active search path(s) 2026-02-21T13:21:00.2781264Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 1.4 configs/s 2026-02-21T13:21:06.5214194Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 5.6 configs/s 2026-02-21T13:21:06.6024371Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s 2026-02-21T13:21:09.0120946Z [789s] Generation 10 complete: 2026-02-21T13:21:09.0124033Z ok=35 2026-02-21T13:21:09.0124690Z min=4.8251 2026-02-21T13:21:09.0124977Z mid=9.9433 2026-02-21T13:21:09.0125178Z max=78.2560 2026-02-21T13:21:09.0125772Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:21:09.0126221Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:21:09.0126661Z 'l2_groupings': [8], 2026-02-21T13:21:09.0126933Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:21:09.0127248Z 'loop_orders': [[0, 1]], 2026-02-21T13:21:09.0127521Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:21:09.0127800Z 'num_stages': 1, 2026-02-21T13:21:09.0128029Z 'num_warps': 4, 2026-02-21T13:21:09.0128263Z 'pid_type': 'flat', 2026-02-21T13:21:09.0128522Z 'range_flattens': [None, None], 2026-02-21T13:21:09.0128825Z 'range_multi_buffers': [None, False], 2026-02-21T13:21:09.0129149Z 'range_num_stages': [0, 1], 2026-02-21T13:21:09.0129422Z 'range_unroll_factors': [0, 1], 2026-02-21T13:21:09.0129714Z 'range_warp_specializes': [], 2026-02-21T13:21:09.0129987Z 'waves_per_eu': 2} 2026-02-21T13:21:09.0154025Z [789s] Fitting surrogate: 827 points, 827 targets 2026-02-21T13:21:09.5060350Z [789s] Generation 11 starting: 33 neighbors, 2 active search path(s) 2026-02-21T13:21:27.2315611Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 1.1 configs/s 2026-02-21T13:21:34.3838769Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 33/33 4.7 configs/s 2026-02-21T13:21:34.4990097Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s 2026-02-21T13:21:38.0424536Z [818s] Generation 11 complete: 2026-02-21T13:21:38.0425139Z ok=35 2026-02-21T13:21:38.0425396Z min=4.8449 2026-02-21T13:21:38.0425604Z mid=7.7070 2026-02-21T13:21:38.0425774Z max=95.6146 2026-02-21T13:21:38.0425970Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:21:38.0426673Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:21:38.0427019Z 'l2_groupings': [8], 2026-02-21T13:21:38.0427245Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:21:38.0427507Z 'loop_orders': [[0, 1]], 2026-02-21T13:21:38.0427744Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:21:38.0427962Z 'num_stages': 1, 2026-02-21T13:21:38.0428148Z 'num_warps': 4, 2026-02-21T13:21:38.0428349Z 'pid_type': 'flat', 2026-02-21T13:21:38.0428568Z 'range_flattens': [None, None], 2026-02-21T13:21:38.0433023Z 'range_multi_buffers': [None, False], 2026-02-21T13:21:38.0433282Z 'range_num_stages': [0, 1], 2026-02-21T13:21:38.0433507Z 'range_unroll_factors': [0, 1], 2026-02-21T13:21:38.0433850Z 'range_warp_specializes': [], 2026-02-21T13:21:38.0434072Z 'waves_per_eu': 2} 2026-02-21T13:21:38.0459837Z [818s] Fitting surrogate: 862 points, 862 targets 2026-02-21T13:21:38.3505516Z [818s] Generation 12 starting: 17 neighbors, 1 active search path(s) 2026-02-21T13:21:56.9153357Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.8 configs/s 2026-02-21T13:22:00.7402166Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 4.4 configs/s 2026-02-21T13:22:00.8371681Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s 2026-02-21T13:22:03.8106994Z [843s] Generation 12 complete: 2026-02-21T13:22:03.8107464Z ok=19 2026-02-21T13:22:03.8108113Z min=5.0059 2026-02-21T13:22:03.8108331Z mid=5.4232 2026-02-21T13:22:03.8108533Z max=76.7894 2026-02-21T13:22:03.8108794Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:22:03.8109216Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:22:03.8109633Z 'l2_groupings': [8], 2026-02-21T13:22:03.8109921Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:22:03.8110242Z 'loop_orders': [[0, 1]], 2026-02-21T13:22:03.8110516Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:22:03.8110790Z 'num_stages': 1, 2026-02-21T13:22:03.8111022Z 'num_warps': 4, 2026-02-21T13:22:03.8111263Z 'pid_type': 'flat', 2026-02-21T13:22:03.8111522Z 'range_flattens': [None, None], 2026-02-21T13:22:03.8111828Z 'range_multi_buffers': [None, False], 2026-02-21T13:22:03.8112143Z 'range_num_stages': [0, 1], 2026-02-21T13:22:03.8112422Z 'range_unroll_factors': [0, 1], 2026-02-21T13:22:03.8112715Z 'range_warp_specializes': [], 2026-02-21T13:22:03.8112979Z 'waves_per_eu': 2} 2026-02-21T13:22:03.8142801Z [843s] Fitting surrogate: 881 points, 881 targets 2026-02-21T13:22:04.1023915Z [844s] Generation 13 starting: 16 neighbors, 1 active search path(s) 2026-02-21T13:22:10.3247249Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.5 configs/s 2026-02-21T13:22:12.9072220Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 6.3 configs/s 2026-02-21T13:22:13.0024391Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s 2026-02-21T13:22:15.9683027Z [856s] Generation 13 complete: 2026-02-21T13:22:15.9683454Z ok=18 2026-02-21T13:22:15.9683664Z min=4.9615 2026-02-21T13:22:15.9683904Z mid=5.3730 2026-02-21T13:22:15.9684109Z max=72.5528 2026-02-21T13:22:15.9684351Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:22:15.9684772Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:22:15.9685187Z 'l2_groupings': [8], 2026-02-21T13:22:15.9685458Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:22:15.9685792Z 'loop_orders': [[0, 1]], 2026-02-21T13:22:15.9686068Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:22:15.9686327Z 'num_stages': 1, 2026-02-21T13:22:15.9686935Z 'num_warps': 4, 2026-02-21T13:22:15.9687168Z 'pid_type': 'flat', 2026-02-21T13:22:15.9687429Z 'range_flattens': [None, None], 2026-02-21T13:22:15.9687734Z 'range_multi_buffers': [None, False], 2026-02-21T13:22:15.9688054Z 'range_num_stages': [0, 1], 2026-02-21T13:22:15.9688337Z 'range_unroll_factors': [0, 1], 2026-02-21T13:22:15.9688621Z 'range_warp_specializes': [], 2026-02-21T13:22:15.9688903Z 'waves_per_eu': 2} 2026-02-21T13:22:15.9717888Z [856s] Fitting surrogate: 899 points, 899 targets 2026-02-21T13:22:16.2501575Z [856s] Generation 14 starting: 16 neighbors, 1 active search path(s) 2026-02-21T13:22:24.5072850Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 1.9 configs/s 2026-02-21T13:22:27.0159118Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 6.5 configs/s 2026-02-21T13:22:27.0742385Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s 2026-02-21T13:22:28.7860155Z [868s] Generation 14 complete: 2026-02-21T13:22:28.7860485Z ok=18 2026-02-21T13:22:28.7860674Z min=4.9430 2026-02-21T13:22:28.7860868Z mid=10.0618 2026-02-21T13:22:28.7861051Z max=48.6261 2026-02-21T13:22:28.7861259Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:22:28.7861916Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:22:28.7862291Z 'l2_groupings': [8], 2026-02-21T13:22:28.7862543Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:22:28.7862828Z 'loop_orders': [[0, 1]], 2026-02-21T13:22:28.7863091Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:22:28.7863330Z 'num_stages': 1, 2026-02-21T13:22:28.7863541Z 'num_warps': 4, 2026-02-21T13:22:28.7863748Z 'pid_type': 'flat', 2026-02-21T13:22:28.7863984Z 'range_flattens': [None, None], 2026-02-21T13:22:28.7864257Z 'range_multi_buffers': [None, False], 2026-02-21T13:22:28.7864539Z 'range_num_stages': [0, 1], 2026-02-21T13:22:28.7864951Z 'range_unroll_factors': [0, 1], 2026-02-21T13:22:28.7865215Z 'range_warp_specializes': [], 2026-02-21T13:22:28.7865465Z 'waves_per_eu': 2} 2026-02-21T13:22:28.7894583Z [868s] Fitting surrogate: 917 points, 917 targets 2026-02-21T13:22:29.0815758Z [869s] Generation 15 starting: 16 neighbors, 1 active search path(s) 2026-02-21T13:22:35.8669563Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.2 configs/s 2026-02-21T13:22:38.4805606Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 6.2 configs/s 2026-02-21T13:22:38.5844539Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s 2026-02-21T13:22:41.7610292Z [881s] Generation 15 complete: 2026-02-21T13:22:41.7610759Z ok=18 2026-02-21T13:22:41.7610970Z min=4.9948 2026-02-21T13:22:41.7611181Z mid=5.3809 2026-02-21T13:22:41.7611378Z max=75.6484 2026-02-21T13:22:41.7611611Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:22:41.7612020Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:22:41.7612479Z 'l2_groupings': [8], 2026-02-21T13:22:41.7612751Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:22:41.7613083Z 'loop_orders': [[0, 1]], 2026-02-21T13:22:41.7613367Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:22:41.7613629Z 'num_stages': 1, 2026-02-21T13:22:41.7613853Z 'num_warps': 4, 2026-02-21T13:22:41.7614095Z 'pid_type': 'flat', 2026-02-21T13:22:41.7614355Z 'range_flattens': [None, None], 2026-02-21T13:22:41.7614654Z 'range_multi_buffers': [None, False], 2026-02-21T13:22:41.7614961Z 'range_num_stages': [0, 1], 2026-02-21T13:22:41.7615231Z 'range_unroll_factors': [0, 1], 2026-02-21T13:22:41.7615887Z 'range_warp_specializes': [], 2026-02-21T13:22:41.7616158Z 'waves_per_eu': 2} 2026-02-21T13:22:41.7648497Z [881s] Fitting surrogate: 935 points, 935 targets 2026-02-21T13:22:42.0479863Z [882s] Generation 16 starting: 16 neighbors, 1 active search path(s) 2026-02-21T13:22:47.7240657Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 4.0 configs/s 2026-02-21T13:22:49.7441446Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 8.3 configs/s 2026-02-21T13:22:49.8085401Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s 2026-02-21T13:22:51.7640750Z [891s] Generation 16 complete: 2026-02-21T13:22:51.7641136Z ok=18 2026-02-21T13:22:51.7641339Z min=4.9749 2026-02-21T13:22:51.7641588Z mid=7.2286 2026-02-21T13:22:51.7641792Z max=16.2258 2026-02-21T13:22:51.7642030Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:22:51.7642452Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:22:51.7643021Z 'l2_groupings': [8], 2026-02-21T13:22:51.7643298Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:22:51.7643616Z 'loop_orders': [[0, 1]], 2026-02-21T13:22:51.7643887Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:22:51.7644145Z 'num_stages': 1, 2026-02-21T13:22:51.7644377Z 'num_warps': 4, 2026-02-21T13:22:51.7644605Z 'pid_type': 'flat', 2026-02-21T13:22:51.7644872Z 'range_flattens': [None, None], 2026-02-21T13:22:51.7645174Z 'range_multi_buffers': [None, False], 2026-02-21T13:22:51.7645482Z 'range_num_stages': [0, 1], 2026-02-21T13:22:51.7645780Z 'range_unroll_factors': [0, 1], 2026-02-21T13:22:51.7646071Z 'range_warp_specializes': [], 2026-02-21T13:22:51.7646377Z 'waves_per_eu': 2} 2026-02-21T13:22:51.7677599Z [891s] Fitting surrogate: 953 points, 953 targets 2026-02-21T13:22:52.0570336Z [892s] Generation 17 starting: 16 neighbors, 1 active search path(s) 2026-02-21T13:22:59.2733103Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.3 configs/s 2026-02-21T13:23:02.0995523Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 5.7 configs/s 2026-02-21T13:23:02.2076930Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s 2026-02-21T13:23:05.5595055Z [905s] Generation 17 complete: 2026-02-21T13:23:05.5595415Z ok=18 2026-02-21T13:23:05.5595623Z min=4.9209 2026-02-21T13:23:05.5595885Z mid=5.3017 2026-02-21T13:23:05.5596087Z max=76.3154 2026-02-21T13:23:05.5596819Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:23:05.5597247Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:23:05.5597684Z 'l2_groupings': [8], 2026-02-21T13:23:05.5597963Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:23:05.5598292Z 'loop_orders': [[0, 1]], 2026-02-21T13:23:05.5598589Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:23:05.5598854Z 'num_stages': 1, 2026-02-21T13:23:05.5599094Z 'num_warps': 4, 2026-02-21T13:23:05.5599327Z 'pid_type': 'flat', 2026-02-21T13:23:05.5599590Z 'range_flattens': [None, None], 2026-02-21T13:23:05.5599908Z 'range_multi_buffers': [None, False], 2026-02-21T13:23:05.5600227Z 'range_num_stages': [0, 1], 2026-02-21T13:23:05.5600512Z 'range_unroll_factors': [0, 1], 2026-02-21T13:23:05.5600811Z 'range_warp_specializes': [], 2026-02-21T13:23:05.5601088Z 'waves_per_eu': 2} 2026-02-21T13:23:05.5634519Z [905s] Fitting surrogate: 971 points, 971 targets 2026-02-21T13:23:05.8322106Z [905s] Generation 18 starting: 15 neighbors, 1 active search path(s) 2026-02-21T13:23:16.7857294Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 0.5 configs/s 2026-02-21T13:23:18.8101768Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 15/15 7.7 configs/s 2026-02-21T13:23:18.8854434Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s 2026-02-21T13:23:21.1895040Z [921s] Generation 18 complete: 2026-02-21T13:23:21.1895465Z ok=17 2026-02-21T13:23:21.1895730Z min=4.9914 2026-02-21T13:23:21.1895947Z mid=6.2518 2026-02-21T13:23:21.1896152Z max=19.8900 2026-02-21T13:23:21.1896707Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:23:21.1897136Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:23:21.1897548Z 'l2_groupings': [8], 2026-02-21T13:23:21.1897837Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:23:21.1898158Z 'loop_orders': [[0, 1]], 2026-02-21T13:23:21.1898434Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:23:21.1898720Z 'num_stages': 1, 2026-02-21T13:23:21.1898950Z 'num_warps': 4, 2026-02-21T13:23:21.1899186Z 'pid_type': 'flat', 2026-02-21T13:23:21.1899592Z 'range_flattens': [None, None], 2026-02-21T13:23:21.1899904Z 'range_multi_buffers': [None, False], 2026-02-21T13:23:21.1900214Z 'range_num_stages': [0, 1], 2026-02-21T13:23:21.1900675Z 'range_unroll_factors': [0, 1], 2026-02-21T13:23:21.1900973Z 'range_warp_specializes': [], 2026-02-21T13:23:21.1901262Z 'waves_per_eu': 2} 2026-02-21T13:23:21.1928979Z [921s] Fitting surrogate: 988 points, 988 targets 2026-02-21T13:23:21.4725461Z [921s] Generation 19 starting: 15 neighbors, 1 active search path(s) 2026-02-21T13:23:27.8737019Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 2.3 configs/s 2026-02-21T13:23:30.1402534Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 15/15 6.8 configs/s 2026-02-21T13:23:30.1927649Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s 2026-02-21T13:23:31.7391240Z [931s] Generation 19 complete: 2026-02-21T13:23:31.7391615Z ok=17 2026-02-21T13:23:31.7391739Z min=4.9294 2026-02-21T13:23:31.7391884Z mid=11.1025 2026-02-21T13:23:31.7392061Z max=19.9553 2026-02-21T13:23:31.7392235Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:23:31.7392561Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:23:31.7392891Z 'l2_groupings': [8], 2026-02-21T13:23:31.7393100Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:23:31.7393343Z 'loop_orders': [[0, 1]], 2026-02-21T13:23:31.7393556Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:23:31.7393756Z 'num_stages': 1, 2026-02-21T13:23:31.7393949Z 'num_warps': 4, 2026-02-21T13:23:31.7394130Z 'pid_type': 'flat', 2026-02-21T13:23:31.7394330Z 'range_flattens': [None, None], 2026-02-21T13:23:31.7394569Z 'range_multi_buffers': [None, False], 2026-02-21T13:23:31.7394804Z 'range_num_stages': [0, 1], 2026-02-21T13:23:31.7395029Z 'range_unroll_factors': [0, 1], 2026-02-21T13:23:31.7395249Z 'range_warp_specializes': [], 2026-02-21T13:23:31.7395786Z 'waves_per_eu': 2} 2026-02-21T13:23:31.7427969Z [931s] Fitting surrogate: 1005 points, 1005 targets 2026-02-21T13:23:32.0341152Z [932s] Generation 20 starting: 17 neighbors, 1 active search path(s) 2026-02-21T13:23:39.1135769Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 3.2 configs/s 2026-02-21T13:23:41.7786939Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 6.5 configs/s 2026-02-21T13:23:41.8542441Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s 2026-02-21T13:23:44.1736136Z [944s] Generation 20 complete: 2026-02-21T13:23:44.1736596Z error=1 2026-02-21T13:23:44.1736800Z ok=18 2026-02-21T13:23:44.1737006Z min=4.9479 2026-02-21T13:23:44.1737217Z mid=7.1990 2026-02-21T13:23:44.1737415Z max=75.5786 2026-02-21T13:23:44.1737650Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:23:44.1738066Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:23:44.1738482Z 'l2_groupings': [8], 2026-02-21T13:23:44.1738766Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:23:44.1739084Z 'loop_orders': [[0, 1]], 2026-02-21T13:23:44.1739683Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:23:44.1739948Z 'num_stages': 1, 2026-02-21T13:23:44.1740176Z 'num_warps': 4, 2026-02-21T13:23:44.1740412Z 'pid_type': 'flat', 2026-02-21T13:23:44.1740684Z 'range_flattens': [None, None], 2026-02-21T13:23:44.1740982Z 'range_multi_buffers': [None, False], 2026-02-21T13:23:44.1741296Z 'range_num_stages': [0, 1], 2026-02-21T13:23:44.1741579Z 'range_unroll_factors': [0, 1], 2026-02-21T13:23:44.1741807Z 'range_warp_specializes': [], 2026-02-21T13:23:44.1742068Z 'waves_per_eu': 2} 2026-02-21T13:23:44.1772455Z [944s] Fitting surrogate: 1024 points, 1024 targets 2026-02-21T13:23:44.3047828Z [944s] Autotuning complete in 944.5s after searching 938 configs. 2026-02-21T13:23:44.3048050Z One can hardcode the best config and skip autotuning with: 2026-02-21T13:23:44.3048929Z @helion.kernel(config=helion.Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T13:23:44.3049572Z 2026-02-21T13:23:44.3049757Z [944s] Code of selected kernel: /tmp/torchinductor_root/ue/cuel5li36jx6visw4dnehr6crgpmvnmtejhc4lconh7ghkb7a2du.py 2026-02-21T13:23:44.3282774Z from __future__ import annotations 2026-02-21T13:23:44.3283134Z 2026-02-21T13:23:44.3283453Z import torch 2026-02-21T13:23:44.3283738Z import triton 2026-02-21T13:23:44.3284041Z import triton.language as tl 2026-02-21T13:23:44.3284412Z from torch._inductor.runtime import triton_helpers 2026-02-21T13:23:44.3284892Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T13:23:44.3285754Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T13:23:44.3286072Z 2026-02-21T13:23:44.3286196Z _BLOCK_SIZE_1 = tl.constexpr(128) 2026-02-21T13:23:44.3286510Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T13:23:44.3286806Z _SHAPE_DIM = tl.constexpr(128) 2026-02-21T13:23:44.3287115Z _BLOCK_SIZE_3 = tl.constexpr(128) 2026-02-21T13:23:44.3287421Z _SHAPE_DIM_4 = tl.constexpr(128) 2026-02-21T13:23:44.3287710Z _SHAPE_DIM_5 = tl.constexpr(128) 2026-02-21T13:23:44.3287901Z 2026-02-21T13:23:44.3287989Z @triton.jit 2026-02-21T13:23:44.3288368Z def _helion_attention(q_view, k_view, v_view, out, _RDIM_SIZE_2: tl.constexpr): 2026-02-21T13:23:44.3288993Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T13:23:44.3289443Z num_pid_m = 192 2026-02-21T13:23:44.3289725Z num_pid_n = tl.cdiv(4096, _BLOCK_SIZE_1) 2026-02-21T13:23:44.3289960Z inner_2d_pid = tl.program_id(0) 2026-02-21T13:23:44.3290130Z num_pid_in_group = 8 * num_pid_n 2026-02-21T13:23:44.3290309Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T13:23:44.3290494Z first_pid_m = group_id * 8 2026-02-21T13:23:44.3290671Z group_size_m = min(num_pid_m - first_pid_m, 8) 2026-02-21T13:23:44.3290907Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T13:23:44.3291165Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T13:23:44.3291360Z offset_0 = pid_0 2026-02-21T13:23:44.3291513Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T13:23:44.3291711Z offset_1 = pid_1 * _BLOCK_SIZE_1 2026-02-21T13:23:44.3291887Z indices_4 = tl.arange(0, _RDIM_SIZE_2).to(tl.int32) 2026-02-21T13:23:44.3292159Z # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T13:23:44.3292455Z m_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], float('-inf'), tl.float32) 2026-02-21T13:23:44.3292706Z # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0) 2026-02-21T13:23:44.3292945Z l_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 1.0, tl.float32) 2026-02-21T13:23:44.3293210Z # src[attention.py:70]: acc = hl.zeros([tile_b, tile_m, head_dim], dtype=torch.float32) 2026-02-21T13:23:44.3293578Z acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128], 0.0, tl.float32) 2026-02-21T13:23:44.3293812Z # src[attention.py:71]: q = q_view[tile_b, tile_m, :] 2026-02-21T13:23:44.3294264Z q = tl.load(tl.make_block_ptr(q_view, [192, 4096, 128], [524288, 128, 1], [offset_0, offset_1, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _SHAPE_DIM], [2, 1, 0]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T13:23:44.3294796Z # src[attention.py:72]: for tile_n in hl.tile(v_view.size(1)): 2026-02-21T13:23:44.3295044Z # src[attention.py:73]: k = k_view[tile_b, :, tile_n] 2026-02-21T13:23:44.3295252Z # src[attention.py:74]: qk = torch.bmm(q, k) 2026-02-21T13:23:44.3295424Z # src[attention.py:72-85]: ... 2026-02-21T13:23:44.3295725Z for offset_2 in tl.range(0, 4096, _BLOCK_SIZE_3, loop_unroll_factor=1, num_stages=1, disallow_acc_multi_buffer=True): 2026-02-21T13:23:44.3296077Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_3).to(tl.int32) 2026-02-21T13:23:44.3296277Z q_copy = q 2026-02-21T13:23:44.3296393Z m_i_copy = m_i 2026-02-21T13:23:44.3296518Z l_i_copy = l_i 2026-02-21T13:23:44.3296670Z acc_copy = acc 2026-02-21T13:23:44.3296796Z q_copy_0 = q_copy 2026-02-21T13:23:44.3296933Z m_i_copy_0 = m_i_copy 2026-02-21T13:23:44.3297067Z l_i_copy_0 = l_i_copy 2026-02-21T13:23:44.3297207Z acc_copy_0 = acc_copy 2026-02-21T13:23:44.3297374Z # src[attention.py:73]: k = k_view[tile_b, :, tile_n] 2026-02-21T13:23:44.3297717Z k = tl.load(k_view + (indices_0[:, None, None] * 524288 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None) 2026-02-21T13:23:44.3298044Z # src[attention.py:74]: qk = torch.bmm(q, k) 2026-02-21T13:23:44.3298664Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T13:23:44.3299361Z # src[attention.py:75]: m_ij = torch.maximum(m_i, torch.amax(qk, -1) * qk_scale) 2026-02-21T13:23:44.3299613Z amax = tl.cast(tl.max(qk, 2), tl.bfloat16) 2026-02-21T13:23:44.3299787Z v_0 = 0.12751743074602467 2026-02-21T13:23:44.3299986Z v_1 = tl.cast(amax * v_0, tl.bfloat16) 2026-02-21T13:23:44.3300120Z v_2 = tl.cast(v_1, tl.float32) 2026-02-21T13:23:44.3300255Z v_3 = triton_helpers.maximum(m_i_copy_0, v_2) 2026-02-21T13:23:44.3300430Z # src[attention.py:76]: qk = qk * qk_scale - m_ij[:, :, None] 2026-02-21T13:23:44.3300583Z v_4 = 0.12751743074602467 2026-02-21T13:23:44.3300700Z v_5 = tl.cast(qk * v_4, tl.bfloat16) 2026-02-21T13:23:44.3300826Z subscript = v_3[:, :, None] 2026-02-21T13:23:44.3300951Z v_6 = tl.cast(v_5, tl.float32) 2026-02-21T13:23:44.3301067Z v_7 = v_6 - subscript 2026-02-21T13:23:44.3301192Z # src[attention.py:77]: p = torch.exp2(qk) 2026-02-21T13:23:44.3301327Z v_8 = libdevice.exp2(v_7) 2026-02-21T13:23:44.3301461Z # src[attention.py:78]: l_ij = torch.sum(p, -1) 2026-02-21T13:23:44.3301613Z l_ij = tl.cast(tl.sum(v_8, 2), tl.float32) 2026-02-21T13:23:44.3301769Z # src[attention.py:79]: alpha = torch.exp2(m_i - m_ij) 2026-02-21T13:23:44.3301919Z v_9 = m_i_copy_0 - v_3 2026-02-21T13:23:44.3302030Z v_10 = libdevice.exp2(v_9) 2026-02-21T13:23:44.3302169Z # src[attention.py:80]: l_i = l_i * alpha + l_ij 2026-02-21T13:23:44.3302307Z v_11 = l_i_copy_0 * v_10 2026-02-21T13:23:44.3302421Z l_i = v_11 + l_ij 2026-02-21T13:23:44.3302548Z # src[attention.py:81]: acc = acc * alpha[:, :, None] 2026-02-21T13:23:44.3302694Z subscript_1 = v_10[:, :, None] 2026-02-21T13:23:44.3302819Z v_13 = acc_copy_0 * subscript_1 2026-02-21T13:23:44.3302965Z # src[attention.py:82]: v = v_view[tile_b, tile_n, :] 2026-02-21T13:23:44.3303345Z v = tl.load(tl.make_block_ptr(v_view, [192, 4096, 128], [524288, 128, 1], [offset_0, offset_2, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_3, _SHAPE_DIM_4], [2, 1, 0]), boundary_check=[0, 1, 2], padding_option='zero') 2026-02-21T13:23:44.3303687Z # src[attention.py:83]: p = p.to(v.dtype) 2026-02-21T13:23:44.3303824Z v_14 = tl.cast(v_8, tl.bfloat16) 2026-02-21T13:23:44.3303976Z # src[attention.py:84]: acc = torch.baddbmm(acc, p, v) 2026-02-21T13:23:44.3304480Z acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128]) 2026-02-21T13:23:44.3304942Z # src[attention.py:85]: m_i = m_ij 2026-02-21T13:23:44.3305065Z m_i = v_3 2026-02-21T13:23:44.3305179Z # src[attention.py:87]: acc = acc / l_i[:, :, None] 2026-02-21T13:23:44.3305322Z subscript_2 = l_i[:, :, None] 2026-02-21T13:23:44.3305438Z v_15 = acc / subscript_2 2026-02-21T13:23:44.3305587Z # src[attention.py:88]: out[tile_b, tile_m, :] = acc.to(out.dtype) 2026-02-21T13:23:44.3305767Z v_16 = tl.cast(v_15, tl.bfloat16) 2026-02-21T13:23:44.3306071Z tl.store(tl.make_block_ptr(out, [192, 4096, 128], [524288, 128, 1], [offset_0, offset_1, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _SHAPE_DIM_5], [2, 1, 0]), v_16, boundary_check=[0, 1, 2]) 2026-02-21T13:23:44.3306335Z 2026-02-21T13:23:44.3306480Z def attention(q_in: torch.Tensor, k_in: torch.Tensor, v_in: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T13:23:44.3306693Z """ 2026-02-21T13:23:44.3306791Z Computes scaled dot-product attention. 2026-02-21T13:23:44.3306881Z 2026-02-21T13:23:44.3307016Z Implements the attention mechanism: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V 2026-02-21T13:23:44.3307183Z 2026-02-21T13:23:44.3307216Z Args: 2026-02-21T13:23:44.3307333Z q_in: Query tensor of shape [..., seq_len_q, head_dim] 2026-02-21T13:23:44.3307504Z k_in: Key tensor of shape [..., seq_len_k, head_dim] 2026-02-21T13:23:44.3307674Z v_in: Value tensor of shape [..., seq_len_k, head_dim] 2026-02-21T13:23:44.3307778Z 2026-02-21T13:23:44.3307815Z Returns: 2026-02-21T13:23:44.3307927Z Output tensor of shape [..., seq_len_q, head_dim] 2026-02-21T13:23:44.3308061Z """ 2026-02-21T13:23:44.3308160Z # src[attention.py:56]: m_dim = q_in.size(-2) 2026-02-21T13:23:44.3308292Z m_dim = q_in.size(-2) 2026-02-21T13:23:44.3308413Z # src[attention.py:57]: n_dim = k_in.size(-2) 2026-02-21T13:23:44.3308545Z n_dim = k_in.size(-2) 2026-02-21T13:23:44.3308673Z # src[attention.py:58]: assert n_dim == v_in.size(-2) 2026-02-21T13:23:44.3308815Z assert n_dim == v_in.size(-2) 2026-02-21T13:23:44.3308973Z # src[attention.py:59]: head_dim = hl.specialize(q_in.size(-1)) 2026-02-21T13:23:44.3309130Z head_dim = 128 2026-02-21T13:23:44.3309269Z # src[attention.py:60]: assert head_dim == k_in.size(-1) == v_in.size(-1) 2026-02-21T13:23:44.3309460Z assert head_dim == k_in.size(-1) == v_in.size(-1) 2026-02-21T13:23:44.3309641Z # src[attention.py:61]: q_view = q_in.reshape([-1, m_dim, head_dim]) 2026-02-21T13:23:44.3309805Z q_view = q_in.reshape([-1, m_dim, head_dim]) 2026-02-21T13:23:44.3309962Z # src[attention.py:62]: v_view = v_in.reshape([-1, n_dim, head_dim]) 2026-02-21T13:23:44.3310117Z v_view = v_in.reshape([-1, n_dim, head_dim]) 2026-02-21T13:23:44.3310296Z # src[attention.py:63]: k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2) 2026-02-21T13:23:44.3310492Z k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2) 2026-02-21T13:23:44.3310654Z # src[attention.py:64]: out = torch.empty_like(q_view) 2026-02-21T13:23:44.3310791Z out = torch.empty_like(q_view) 2026-02-21T13:23:44.3310947Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T13:23:44.3311129Z _BLOCK_SIZE_1 = 128 2026-02-21T13:23:44.3311221Z _RDIM_SIZE_2 = 128 2026-02-21T13:23:44.3311363Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T13:23:44.3311587Z # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T13:23:44.3311797Z # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0) 2026-02-21T13:23:44.3311939Z # src[attention.py:67-88]: ... 2026-02-21T13:23:44.3312251Z _launcher(_helion_attention, (192 * triton.cdiv(4096, _BLOCK_SIZE_1),), q_view, k_view, v_view, out, _RDIM_SIZE_2, num_warps=4, num_stages=1, waves_per_eu=2, matrix_instr_nonkdim=0) 2026-02-21T13:23:44.3312574Z # src[attention.py:89]: return out.view(q_in.size()) 2026-02-21T13:23:44.3312707Z return out.view(q_in.size()) 2026-02-21T13:23:45.3577158Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T13:23:45.3579523Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T13:23:45.3581502Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T13:23:45.3581969Z WARNING:tritonbench.utils.triton_op:Completed input ID 5: 2026-02-21T13:23:45.3582389Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T13:23:45.3582718Z ------------------------------------------ 2026-02-21T13:23:45.3583022Z (4, 48, 4096, 4096, 128) 2026-02-21T13:23:45.3583203Z 2026-02-21T13:23:45.3584067Z 83%|████████▎ | 5/6 [56:51<13:17, 797.55s/it]WARNING:tritonbench.utils.triton_op:Running input ID 6: 2026-02-21T13:23:45.3584591Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T13:23:45.3584907Z ------------------------------------------ 2026-02-21T13:23:45.3585205Z (4, 48, 8192, 8192, 128) 2026-02-21T13:23:45.3590234Z INFO:tritonbench.utils.triton_op:Took 0.06ms to get benchmark function for aten 2026-02-21T13:23:47.2369488Z INFO:tritonbench.utils.triton_op:Took 1.44ms to get benchmark function for flex_attention 2026-02-21T13:23:48.7369926Z WARNING:__main__:Input tensor metadata: 2026-02-21T13:23:48.7370377Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T13:23:48.7370705Z 'dtype': 'torch.bfloat16', 2026-02-21T13:23:48.7371028Z 'shape': (4, 48, 8192, 128), 2026-02-21T13:23:48.7371357Z 'stride': (50331648, 1048576, 128, 1)}, 2026-02-21T13:23:48.7371689Z { 'device': 'cuda:0', 2026-02-21T13:23:48.7371992Z 'dtype': 'torch.bfloat16', 2026-02-21T13:23:48.7372301Z 'shape': (4, 48, 8192, 128), 2026-02-21T13:23:48.7372615Z 'stride': (50331648, 1048576, 128, 1)}, 2026-02-21T13:23:48.7372959Z { 'device': 'cuda:0', 2026-02-21T13:23:48.7373242Z 'dtype': 'torch.bfloat16', 2026-02-21T13:23:48.7373547Z 'shape': (4, 48, 8192, 128), 2026-02-21T13:23:48.7373873Z 'stride': (50331648, 1048576, 128, 1)}), 2026-02-21T13:23:48.7374184Z 'kwargs': {}} 2026-02-21T13:23:48.7415028Z INFO:tritonbench.utils.triton_op:Took 4.86ms to get benchmark function for helion_attention 2026-02-21T13:23:48.9863428Z [0s] Autotune random seed: 2150287535 2026-02-21T13:23:49.1214909Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T13:24:21.6519037Z [32s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:24:24.1257609Z [35s] Timeout after 30s compiling Config(block_sizes=[1, 4096, 128], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:24:24.4183187Z [35s] Timeout after 30s compiling Config(block_sizes=[1, 32, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[3, 3], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:24:26.5105472Z [37s] Timeout after 30s compiling Config(block_sizes=[1, 2, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:24:27.6041856Z [38s] Timeout after 30s compiling Config(block_sizes=[1, 32, 2048], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[3, 3], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:24:28.7854041Z [39s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 32], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, True], range_num_stages=[1, 4], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:24:29.7400521Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:24:31.4641828Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 128, 1024], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[3, 2], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:24:33.1068618Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 64, 1024], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:24:34.6430573Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 512], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:24:35.1492342Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:24:35.8291296Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[1, 0], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:24:36.3053733Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 4, 2048], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:24:36.5686881Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:24:37.2166680Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 16, 4096], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:24:37.4821886Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 2, 2048], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[4, 4], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:24:37.7443764Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 256, 2048], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:24:40.2872079Z [51s] Timeout after 30s compiling Config(block_sizes=[1, 8, 2048], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[2, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:24:40.5592823Z [51s] Timeout after 31s compiling Config(block_sizes=[1, 8, 4096], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:24:42.4304912Z [53s] Timeout after 30s compiling Config(block_sizes=[1, 4096, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:24:43.7679023Z [54s] Timeout after 30s compiling Config(block_sizes=[1, 2, 1024], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[4, 3], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:24:44.0559501Z [54s] Timeout after 30s compiling Config(block_sizes=[1, 32, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:24:44.3558474Z [55s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=4) 2026-02-21T13:24:44.3591704Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.1 configs/s 2026-02-21T13:28:44.9811843Z /tmp/torchinductor_root/sr/csruq75k5vrnqasgu6rcu5p5ucqjhzcjg6lokhjrwpi27fgp4emv.py:93:135: error: 'tt.load' op operation destroyed but still has uses 2026-02-21T13:28:44.9812729Z v = tl.load(v_view + (indices_0[:, None, None] * 1048576 + indices_2[None, :, None] * 128 + indices_4[None, None, :] * 1), None) 2026-02-21T13:28:44.9813266Z ^ 2026-02-21T13:28:44.9814749Z /tmp/torchinductor_root/sr/csruq75k5vrnqasgu6rcu5p5ucqjhzcjg6lokhjrwpi27fgp4emv.py:97:144: note: - use: %122 = "tt.reshape"(<>) : (tensor<1x4096x128xbf16, #ttg.blocked<{sizePerThread = [1, 1, 8], threadsPerWarp = [1, 4, 16], warpsPerCTA = [1, 2, 1], order = [2, 0, 1]}>>) -> tensor<4096x128xbf16, #ttg.linear<{register = [[0, 1], [0, 2], [0, 4], [8, 0], [16, 0], [32, 0], [64, 0], [128, 0], [256, 0], [512, 0], [1024, 0], [2048, 0]], lane = [[0, 8], [0, 16], [0, 32], [0, 64], [1, 0], [2, 0]], warp = [[4, 0]], block = []}>> 2026-02-21T13:28:44.9816051Z 2026-02-21T13:28:44.9816750Z acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128]) 2026-02-21T13:28:44.9817966Z ^ 2026-02-21T13:28:44.9818323Z LLVM ERROR: operation destroyed but still has uses 2026-02-21T13:28:44.9840748Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}> 2026-02-21T13:28:44.9841395Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}> 2026-02-21T13:28:44.9841847Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T13:28:44.9842289Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T13:28:44.9842790Z #blocked4 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T13:28:44.9843200Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T13:28:44.9843735Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [0, 1, 2]}> 2026-02-21T13:28:44.9844202Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 2, 1], order = [0, 1, 2]}> 2026-02-21T13:28:44.9844648Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T13:28:44.9845104Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [2, 1, 1], order = [0, 1, 2]}> 2026-02-21T13:28:44.9845687Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}> 2026-02-21T13:28:44.9846191Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T13:28:44.9846957Z tt.func public @_helion_attention(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T13:28:44.9847555Z %c128_i32 = arith.constant 128 : i32 2026-02-21T13:28:44.9847746Z %c1048576_i32 = arith.constant 1048576 : i32 2026-02-21T13:28:44.9847936Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T13:28:44.9848116Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T13:28:44.9848280Z %c0_i32 = arith.constant 0 : i32 2026-02-21T13:28:44.9848462Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T13:28:44.9848638Z %c1_i32 = arith.constant 1 : i32 2026-02-21T13:28:44.9848852Z %cst = arith.constant dense<128> : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9849149Z %cst_0 = arith.constant dense<0.127517432> : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9849453Z %cst_1 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9849747Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:44.9850025Z %cst_3 = arith.constant dense<128> : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:44.9850318Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9850605Z %cst_5 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9850887Z %cst_6 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9851120Z %c16_i32 = arith.constant 16 : i32 2026-02-21T13:28:44.9851294Z %c192_i32 = arith.constant 192 : i32 2026-02-21T13:28:44.9851472Z %c1572864_i32 = arith.constant 1572864 : i32 2026-02-21T13:28:44.9851653Z %c324_i32 = arith.constant 324 : i32 2026-02-21T13:28:44.9851821Z %0 = tt.get_program_id x : i32 2026-02-21T13:28:44.9852026Z %1 = arith.muli %0, %c324_i32 : i32 2026-02-21T13:28:44.9852191Z %2 = arith.addi %1, %c324_i32 : i32 2026-02-21T13:28:44.9852370Z %3 = arith.minsi %2, %c1572864_i32 : i32 2026-02-21T13:28:44.9852620Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked4> 2026-02-21T13:28:44.9853020Z %5 = ttg.convert_layout %4 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:44.9853542Z %6 = tt.expand_dims %5 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x128xi32, #blocked5> 2026-02-21T13:28:44.9853992Z %7 = ttg.convert_layout %6 : tensor<1x128xi32, #blocked5> -> tensor<1x128xi32, #blocked3> 2026-02-21T13:28:44.9854435Z %8 = ttg.convert_layout %7 : tensor<1x128xi32, #blocked3> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:44.9854830Z %9 = tt.expand_dims %8 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x128xi32, #blocked6> 2026-02-21T13:28:44.9855183Z %10 = ttg.convert_layout %9 : tensor<1x1x128xi32, #blocked6> -> tensor<1x1x128xi32, #blocked1> 2026-02-21T13:28:44.9855476Z %11 = tt.splat %arg0 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9855723Z %12 = tt.make_range {end = 4096 : i32, start = 0 : i32} : tensor<4096xi32, #blocked4> 2026-02-21T13:28:44.9856046Z %13 = ttg.convert_layout %7 : tensor<1x128xi32, #blocked3> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:44.9856435Z %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x128x1xi32, #blocked7> 2026-02-21T13:28:44.9856790Z %15 = ttg.convert_layout %14 : tensor<1x128x1xi32, #blocked7> -> tensor<1x128x1xi32, #blocked> 2026-02-21T13:28:44.9857087Z %16 = tt.splat %arg1 : !tt.ptr -> tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9857420Z %17 = tt.broadcast %10 : tensor<1x1x128xi32, #blocked1> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9857696Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9857935Z %19 = tt.splat %arg3 : !tt.ptr -> tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9858126Z %20 = arith.subi %3, %1 : i32 2026-02-21T13:28:44.9858265Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T13:28:44.9858413Z %21 = arith.subi %c1_i32, %c1_i32_7 : i32 2026-02-21T13:28:44.9858550Z %22 = arith.addi %20, %21 : i32 2026-02-21T13:28:44.9858681Z %23 = arith.divui %22, %c1_i32 : i32 2026-02-21T13:28:44.9858820Z %c2_i32 = arith.constant 2 : i32 2026-02-21T13:28:44.9858947Z %24 = arith.remsi %23, %c2_i32 : i32 2026-02-21T13:28:44.9859083Z %25 = arith.subi %23, %24 : i32 2026-02-21T13:28:44.9859209Z %26 = arith.muli %25, %c1_i32 : i32 2026-02-21T13:28:44.9859345Z %27 = arith.addi %1, %26 : i32 2026-02-21T13:28:44.9859478Z %28 = arith.muli %c1_i32, %c2_i32 : i32 2026-02-21T13:28:44.9859627Z scf.for %arg4 = %1 to %27 step %28 : i32 { 2026-02-21T13:28:44.9859791Z %29 = arith.divsi %arg4, %c131072_i32 : i32 2026-02-21T13:28:44.9859935Z %30 = arith.muli %29, %c16_i32 : i32 2026-02-21T13:28:44.9860071Z %31 = arith.subi %c192_i32, %30 : i32 2026-02-21T13:28:44.9860206Z %32 = arith.minsi %31, %c16_i32 : i32 2026-02-21T13:28:44.9860347Z %33 = arith.remsi %arg4, %c131072_i32 : i32 2026-02-21T13:28:44.9860484Z %34 = arith.remsi %33, %32 : i32 2026-02-21T13:28:44.9860620Z %35 = arith.addi %30, %34 : i32 2026-02-21T13:28:44.9860744Z %36 = arith.divsi %33, %32 : i32 2026-02-21T13:28:44.9860883Z %37 = arith.muli %35, %c1048576_i32 : i32 2026-02-21T13:28:44.9861027Z %38 = arith.muli %36, %c128_i32 : i32 2026-02-21T13:28:44.9861154Z %39 = arith.addi %37, %38 : i32 2026-02-21T13:28:44.9861321Z %40 = tt.splat %39 : i32 -> tensor<1x1x128xi32, #blocked1> 2026-02-21T13:28:44.9861532Z %41 = arith.addi %40, %10 : tensor<1x1x128xi32, #blocked1> 2026-02-21T13:28:44.9861786Z %42 = tt.addptr %11, %41 : tensor<1x1x128x!tt.ptr, #blocked1>, tensor<1x1x128xi32, #blocked1> 2026-02-21T13:28:44.9862030Z %43 = tt.load %42 : tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9862218Z %44 = tt.splat %37 : i32 -> tensor<1x128x1xi32, #blocked> 2026-02-21T13:28:44.9862408Z %45 = arith.addi %44, %15 : tensor<1x128x1xi32, #blocked> 2026-02-21T13:28:44.9862661Z %46 = tt.broadcast %45 : tensor<1x128x1xi32, #blocked> -> tensor<1x128x4096xi32, #blocked> 2026-02-21T13:28:44.9862943Z %47 = ttg.convert_layout %46 : tensor<1x128x4096xi32, #blocked> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9863194Z %48 = tt.reshape %43 : tensor<1x1x128xbf16, #blocked1> -> tensor<1x128xbf16, #blocked3> 2026-02-21T13:28:44.9863387Z %49 = tt.splat %37 : i32 -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9863527Z %c0_i32_8 = arith.constant 0 : i32 2026-02-21T13:28:44.9863648Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T13:28:44.9864002Z %50:3 = scf.for %arg5 = %c0_i32 to %c0_i32_8 step %c16384_i32 iter_args(%arg6 = %cst_6, %arg7 = %cst_5, %arg8 = %cst_4) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>) : i32 { 2026-02-21T13:28:44.9864352Z %93 = tt.splat %arg5 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:44.9864533Z %94 = arith.addi %93, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:44.9864772Z %95 = ttg.convert_layout %94 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:44.9865126Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:44.9865421Z %97 = ttg.convert_layout %96 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:44.9865722Z %98 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:44.9866070Z %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:44.9866380Z %100 = ttg.convert_layout %99 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:44.9866601Z %101 = arith.muli %100, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:44.9866811Z %102 = tt.broadcast %101 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9867026Z %103 = arith.addi %47, %102 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9867253Z %104 = tt.addptr %16, %103 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9867476Z %105 = tt.load %104 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9867693Z %106 = tt.reshape %105 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:44.9867996Z %107 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:44.9868354Z %108 = ttg.convert_layout %106 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:44.9868672Z %109 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:44.9869089Z %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:44.9869497Z %111 = ttg.convert_layout %110 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:44.9869759Z %112 = tt.reshape %111 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9870002Z %113 = arith.truncf %112 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9870253Z %114 = arith.extf %113 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9870444Z %115 = "tt.reduce"(%114) <{axis = 2 : i32}> ({ 2026-02-21T13:28:44.9870586Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:44.9870706Z %395 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:44.9870833Z tt.reduce.return %395 : f32 2026-02-21T13:28:44.9871027Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:44.9871319Z %116 = ttg.convert_layout %115 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9871598Z %117 = arith.truncf %116 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:44.9871844Z %118 = arith.extf %117 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9872038Z %119 = arith.mulf %118, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9872230Z %120 = arith.truncf %119 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:44.9872444Z %121 = arith.extf %120 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9872642Z %122 = arith.cmpf ogt, %arg6, %121 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9872813Z %123 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9872982Z %124 = arith.ori %122, %123 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:44.9873188Z %125 = arith.select %124, %arg6, %121 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9873398Z %126 = arith.mulf %114, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9873609Z %127 = arith.truncf %126 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9873896Z %128 = ttg.convert_layout %125 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:44.9874231Z %129 = tt.expand_dims %128 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:44.9874531Z %130 = ttg.convert_layout %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:44.9874776Z %131 = arith.extf %127 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9875025Z %132 = tt.broadcast %130 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:44.9875280Z %133 = ttg.convert_layout %132 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9875498Z %134 = arith.subf %131, %133 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9875813Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9876107Z %136 = "tt.reduce"(%135) <{axis = 2 : i32}> ({ 2026-02-21T13:28:44.9876233Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:44.9876351Z %395 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:44.9876477Z tt.reduce.return %395 : f32 2026-02-21T13:28:44.9876663Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:44.9876958Z %137 = ttg.convert_layout %136 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9877201Z %138 = arith.subf %arg6, %125 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9877504Z %139 = tt.extern_elementwise %138 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9877792Z %140 = arith.mulf %arg7, %139 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9877949Z %141 = arith.addf %140, %137 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9878191Z %142 = ttg.convert_layout %139 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:44.9878541Z %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:44.9878840Z %144 = ttg.convert_layout %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:44.9879090Z %145 = tt.broadcast %144 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:44.9879341Z %146 = ttg.convert_layout %145 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9879556Z %147 = arith.mulf %arg8, %146 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9879821Z %148 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:44.9880169Z %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:44.9880479Z %150 = ttg.convert_layout %149 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9880692Z %151 = arith.muli %150, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9880853Z %152 = arith.addi %49, %151 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9881072Z %153 = tt.broadcast %152 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:44.9881333Z %154 = ttg.convert_layout %153 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9881558Z %155 = arith.addi %154, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9881785Z %156 = tt.addptr %18, %155 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9882014Z %157 = tt.load %156 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9882227Z %158 = arith.truncf %135 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9882467Z %159 = tt.reshape %147 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9882750Z %160 = tt.reshape %158 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:44.9882996Z %161 = tt.reshape %157 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:44.9883307Z %162 = ttg.convert_layout %160 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:44.9883675Z %163 = ttg.convert_layout %161 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:44.9883977Z %164 = ttg.convert_layout %159 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9884386Z %165 = tt.dot %162, %163, %164, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9884784Z %166 = tt.reshape %165 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9884964Z %c1_i32_12 = arith.constant 1 : i32 2026-02-21T13:28:44.9885095Z %167 = arith.muli %c4096_i32, %c1_i32_12 : i32 2026-02-21T13:28:44.9885242Z %168 = arith.addi %arg5, %167 : i32 2026-02-21T13:28:44.9885381Z %169 = tt.splat %168 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:44.9885539Z %170 = arith.addi %169, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:44.9885784Z %171 = ttg.convert_layout %170 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:44.9886130Z %172 = tt.expand_dims %171 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:44.9886446Z %173 = ttg.convert_layout %172 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:44.9886740Z %174 = ttg.convert_layout %173 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:44.9887091Z %175 = tt.expand_dims %174 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:44.9887404Z %176 = ttg.convert_layout %175 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:44.9887646Z %177 = arith.muli %176, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:44.9887853Z %178 = tt.broadcast %177 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9888064Z %179 = arith.addi %47, %178 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9888288Z %180 = tt.addptr %16, %179 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9888514Z %181 = tt.load %180 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9888726Z %182 = tt.reshape %181 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:44.9889049Z %183 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:44.9889408Z %184 = ttg.convert_layout %182 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:44.9889721Z %185 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:44.9890126Z %186 = tt.dot %183, %184, %185, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:44.9890536Z %187 = ttg.convert_layout %186 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:44.9890780Z %188 = tt.reshape %187 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9891023Z %189 = arith.truncf %188 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9891272Z %190 = arith.extf %189 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9891461Z %191 = "tt.reduce"(%190) <{axis = 2 : i32}> ({ 2026-02-21T13:28:44.9891588Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:44.9891708Z %395 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:44.9891833Z tt.reduce.return %395 : f32 2026-02-21T13:28:44.9892022Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:44.9892314Z %192 = ttg.convert_layout %191 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9892584Z %193 = arith.truncf %192 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:44.9892804Z %194 = arith.extf %193 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9892997Z %195 = arith.mulf %194, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9893203Z %196 = arith.truncf %195 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:44.9893418Z %197 = arith.extf %196 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9893615Z %198 = arith.cmpf ogt, %125, %197 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9893779Z %199 = arith.cmpf une, %125, %125 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9893952Z %200 = arith.ori %198, %199 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:44.9894142Z %201 = arith.select %200, %125, %197 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9894345Z %202 = arith.mulf %190, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9894556Z %203 = arith.truncf %202 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9894846Z %204 = ttg.convert_layout %201 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:44.9895183Z %205 = tt.expand_dims %204 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:44.9895490Z %206 = ttg.convert_layout %205 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:44.9895734Z %207 = arith.extf %203 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9895980Z %208 = tt.broadcast %206 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:44.9896235Z %209 = ttg.convert_layout %208 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9896453Z %210 = arith.subf %207, %209 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9896779Z %211 = tt.extern_elementwise %210 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9897070Z %212 = "tt.reduce"(%211) <{axis = 2 : i32}> ({ 2026-02-21T13:28:44.9897197Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:44.9897314Z %395 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:44.9897436Z tt.reduce.return %395 : f32 2026-02-21T13:28:44.9897620Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:44.9897914Z %213 = ttg.convert_layout %212 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9898155Z %214 = arith.subf %125, %201 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9898437Z %215 = tt.extern_elementwise %214 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9898722Z %216 = arith.mulf %141, %215 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9898880Z %217 = arith.addf %216, %213 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9899119Z %218 = ttg.convert_layout %215 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:44.9899450Z %219 = tt.expand_dims %218 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:44.9899742Z %220 = ttg.convert_layout %219 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:44.9899991Z %221 = tt.broadcast %220 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:44.9900242Z %222 = ttg.convert_layout %221 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9900457Z %223 = arith.mulf %166, %222 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9900707Z %224 = ttg.convert_layout %173 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:44.9901071Z %225 = tt.expand_dims %224 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:44.9901383Z %226 = ttg.convert_layout %225 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9901596Z %227 = arith.muli %226, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9901781Z %228 = arith.addi %49, %227 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9901983Z %229 = tt.broadcast %228 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:44.9902242Z %230 = ttg.convert_layout %229 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9902469Z %231 = arith.addi %230, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9902691Z %232 = tt.addptr %18, %231 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9902919Z %233 = tt.load %232 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9903148Z %234 = arith.truncf %211 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9903388Z %235 = tt.reshape %223 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9903625Z %236 = tt.reshape %234 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:44.9903871Z %237 = tt.reshape %233 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:44.9904174Z %238 = ttg.convert_layout %236 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:44.9916563Z %239 = ttg.convert_layout %237 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:44.9916892Z %240 = ttg.convert_layout %235 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9917302Z %241 = tt.dot %238, %239, %240, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9917700Z %242 = tt.reshape %241 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9917885Z %c2_i32_13 = arith.constant 2 : i32 2026-02-21T13:28:44.9918015Z %243 = arith.muli %c4096_i32, %c2_i32_13 : i32 2026-02-21T13:28:44.9918139Z %244 = arith.addi %arg5, %243 : i32 2026-02-21T13:28:44.9918280Z %245 = tt.splat %244 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:44.9918438Z %246 = arith.addi %245, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:44.9918681Z %247 = ttg.convert_layout %246 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:44.9919024Z %248 = tt.expand_dims %247 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:44.9919320Z %249 = ttg.convert_layout %248 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:44.9919616Z %250 = ttg.convert_layout %249 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:44.9919970Z %251 = tt.expand_dims %250 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:44.9920279Z %252 = ttg.convert_layout %251 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:44.9920502Z %253 = arith.muli %252, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:44.9920709Z %254 = tt.broadcast %253 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9920945Z %255 = arith.addi %47, %254 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9921164Z %256 = tt.addptr %16, %255 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9921388Z %257 = tt.load %256 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9921604Z %258 = tt.reshape %257 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:44.9921922Z %259 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:44.9922283Z %260 = ttg.convert_layout %258 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:44.9922634Z %261 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:44.9923073Z %262 = tt.dot %259, %260, %261, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:44.9923476Z %263 = ttg.convert_layout %262 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:44.9923715Z %264 = tt.reshape %263 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9923962Z %265 = arith.truncf %264 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9924215Z %266 = arith.extf %265 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9924405Z %267 = "tt.reduce"(%266) <{axis = 2 : i32}> ({ 2026-02-21T13:28:44.9924552Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:44.9924672Z %395 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:44.9924802Z tt.reduce.return %395 : f32 2026-02-21T13:28:44.9924990Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:44.9925279Z %268 = ttg.convert_layout %267 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9925550Z %269 = arith.truncf %268 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:44.9925771Z %270 = arith.extf %269 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9925964Z %271 = arith.mulf %270, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9926154Z %272 = arith.truncf %271 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:44.9926375Z %273 = arith.extf %272 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9926569Z %274 = arith.cmpf ogt, %201, %273 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9926733Z %275 = arith.cmpf une, %201, %201 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9926894Z %276 = arith.ori %274, %275 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:44.9927085Z %277 = arith.select %276, %201, %273 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9927288Z %278 = arith.mulf %266, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9927498Z %279 = arith.truncf %278 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9927782Z %280 = ttg.convert_layout %277 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:44.9928114Z %281 = tt.expand_dims %280 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:44.9928406Z %282 = ttg.convert_layout %281 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:44.9928678Z %283 = arith.extf %279 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9928925Z %284 = tt.broadcast %282 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:44.9929182Z %285 = ttg.convert_layout %284 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9929401Z %286 = arith.subf %283, %285 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9929728Z %287 = tt.extern_elementwise %286 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9930017Z %288 = "tt.reduce"(%287) <{axis = 2 : i32}> ({ 2026-02-21T13:28:44.9930144Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:44.9930263Z %395 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:44.9930383Z tt.reduce.return %395 : f32 2026-02-21T13:28:44.9930572Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:44.9930878Z %289 = ttg.convert_layout %288 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9931116Z %290 = arith.subf %201, %277 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9931397Z %291 = tt.extern_elementwise %290 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9931685Z %292 = arith.mulf %217, %291 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9931841Z %293 = arith.addf %292, %289 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9932101Z %294 = ttg.convert_layout %291 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:44.9932437Z %295 = tt.expand_dims %294 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:44.9932735Z %296 = ttg.convert_layout %295 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:44.9932987Z %297 = tt.broadcast %296 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:44.9933243Z %298 = ttg.convert_layout %297 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9933464Z %299 = arith.mulf %242, %298 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9933724Z %300 = ttg.convert_layout %249 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:44.9934072Z %301 = tt.expand_dims %300 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:44.9934387Z %302 = ttg.convert_layout %301 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9934603Z %303 = arith.muli %302, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9934773Z %304 = arith.addi %49, %303 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9934980Z %305 = tt.broadcast %304 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:44.9935244Z %306 = ttg.convert_layout %305 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9935472Z %307 = arith.addi %306, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9935695Z %308 = tt.addptr %18, %307 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9935928Z %309 = tt.load %308 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9936149Z %310 = arith.truncf %287 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9936416Z %311 = tt.reshape %299 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9936657Z %312 = tt.reshape %310 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:44.9936905Z %313 = tt.reshape %309 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:44.9937214Z %314 = ttg.convert_layout %312 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:44.9937596Z %315 = ttg.convert_layout %313 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:44.9937902Z %316 = ttg.convert_layout %311 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9938318Z %317 = tt.dot %314, %315, %316, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9938718Z %318 = tt.reshape %317 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9938916Z %c3_i32 = arith.constant 3 : i32 2026-02-21T13:28:44.9939047Z %319 = arith.muli %c4096_i32, %c3_i32 : i32 2026-02-21T13:28:44.9939174Z %320 = arith.addi %arg5, %319 : i32 2026-02-21T13:28:44.9939317Z %321 = tt.splat %320 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:44.9939477Z %322 = arith.addi %321, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:44.9939722Z %323 = ttg.convert_layout %322 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:44.9940078Z %324 = tt.expand_dims %323 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:44.9940377Z %325 = ttg.convert_layout %324 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:44.9940678Z %326 = ttg.convert_layout %325 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:44.9941026Z %327 = tt.expand_dims %326 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:44.9941345Z %328 = ttg.convert_layout %327 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:44.9941570Z %329 = arith.muli %328, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:44.9941781Z %330 = tt.broadcast %329 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9942000Z %331 = arith.addi %47, %330 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9942223Z %332 = tt.addptr %16, %331 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9942459Z %333 = tt.load %332 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9942680Z %334 = tt.reshape %333 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:44.9942983Z %335 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:44.9943349Z %336 = ttg.convert_layout %334 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:44.9943662Z %337 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:44.9944082Z %338 = tt.dot %335, %336, %337, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:44.9944488Z %339 = ttg.convert_layout %338 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:44.9944772Z %340 = tt.reshape %339 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9945022Z %341 = arith.truncf %340 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9945275Z %342 = arith.extf %341 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9945482Z %343 = "tt.reduce"(%342) <{axis = 2 : i32}> ({ 2026-02-21T13:28:44.9945615Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:44.9945738Z %395 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:44.9945871Z tt.reduce.return %395 : f32 2026-02-21T13:28:44.9946224Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:44.9946529Z %344 = ttg.convert_layout %343 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9946809Z %345 = arith.truncf %344 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:44.9947051Z %346 = arith.extf %345 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9947250Z %347 = arith.mulf %346, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9947443Z %348 = arith.truncf %347 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:44.9947670Z %349 = arith.extf %348 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9947869Z %350 = arith.cmpf ogt, %277, %349 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9948036Z %351 = arith.cmpf une, %277, %277 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9948219Z %352 = arith.ori %350, %351 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:44.9948412Z %353 = arith.select %352, %277, %349 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9948624Z %354 = arith.mulf %342, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9948839Z %355 = arith.truncf %354 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9949130Z %356 = ttg.convert_layout %353 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:44.9949468Z %357 = tt.expand_dims %356 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:44.9949768Z %358 = ttg.convert_layout %357 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:44.9950025Z %359 = arith.extf %355 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9950281Z %360 = tt.broadcast %358 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:44.9950540Z %361 = ttg.convert_layout %360 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9950763Z %362 = arith.subf %359, %361 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9951071Z %363 = tt.extern_elementwise %362 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9951369Z %364 = "tt.reduce"(%363) <{axis = 2 : i32}> ({ 2026-02-21T13:28:44.9951502Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:44.9951623Z %395 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:44.9951749Z tt.reduce.return %395 : f32 2026-02-21T13:28:44.9951937Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:44.9952235Z %365 = ttg.convert_layout %364 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9952498Z %366 = arith.subf %277, %353 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9952785Z %367 = tt.extern_elementwise %366 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9953077Z %368 = arith.mulf %293, %367 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9953237Z %369 = arith.addf %368, %365 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9953495Z %370 = ttg.convert_layout %367 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:44.9953829Z %371 = tt.expand_dims %370 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:44.9954125Z %372 = ttg.convert_layout %371 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:44.9954376Z %373 = tt.broadcast %372 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:44.9954632Z %374 = ttg.convert_layout %373 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9954868Z %375 = arith.mulf %318, %374 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9955124Z %376 = ttg.convert_layout %325 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:44.9955475Z %377 = tt.expand_dims %376 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:44.9955792Z %378 = ttg.convert_layout %377 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9956005Z %379 = arith.muli %378, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9956195Z %380 = arith.addi %49, %379 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9956401Z %381 = tt.broadcast %380 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:44.9956662Z %382 = ttg.convert_layout %381 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9956890Z %383 = arith.addi %382, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9957117Z %384 = tt.addptr %18, %383 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9957349Z %385 = tt.load %384 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9957567Z %386 = arith.truncf %363 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9957812Z %387 = tt.reshape %375 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9958056Z %388 = tt.reshape %386 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:44.9958309Z %389 = tt.reshape %385 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:44.9958622Z %390 = ttg.convert_layout %388 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:44.9958990Z %391 = ttg.convert_layout %389 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:44.9959294Z %392 = ttg.convert_layout %387 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9959705Z %393 = tt.dot %390, %391, %392, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9960103Z %394 = tt.reshape %393 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9960377Z scf.yield %353, %369, %394 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9960621Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T13:28:44.9960956Z %51:3 = scf.for %arg5 = %c0_i32_8 to %c8192_i32 step %c4096_i32 iter_args(%arg6 = %50#0, %arg7 = %50#1, %arg8 = %50#2) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>) : i32 { 2026-02-21T13:28:44.9961306Z %93 = tt.splat %arg5 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:44.9961488Z %94 = arith.addi %93, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:44.9961729Z %95 = ttg.convert_layout %94 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:44.9962068Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:44.9962364Z %97 = ttg.convert_layout %96 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:44.9962711Z %98 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:44.9963087Z %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:44.9963397Z %100 = ttg.convert_layout %99 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:44.9963624Z %101 = arith.muli %100, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:44.9963837Z %102 = tt.broadcast %101 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9964054Z %103 = arith.addi %47, %102 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9964309Z %104 = tt.addptr %16, %103 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9964536Z %105 = tt.load %104 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9964758Z %106 = tt.reshape %105 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:44.9965062Z %107 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:44.9965425Z %108 = ttg.convert_layout %106 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:44.9965747Z %109 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:44.9966168Z %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:44.9966580Z %111 = ttg.convert_layout %110 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:44.9966832Z %112 = tt.reshape %111 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9967082Z %113 = arith.truncf %112 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9967340Z %114 = arith.extf %113 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9967536Z %115 = "tt.reduce"(%114) <{axis = 2 : i32}> ({ 2026-02-21T13:28:44.9967670Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:44.9967792Z %167 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:44.9967923Z tt.reduce.return %167 : f32 2026-02-21T13:28:44.9968118Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:44.9968423Z %116 = ttg.convert_layout %115 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9968717Z %117 = arith.truncf %116 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:44.9968938Z %118 = arith.extf %117 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9969136Z %119 = arith.mulf %118, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9969331Z %120 = arith.truncf %119 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:44.9969550Z %121 = arith.extf %120 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9969767Z %122 = arith.cmpf ogt, %arg6, %121 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9969939Z %123 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9970109Z %124 = arith.ori %122, %123 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:44.9970305Z %125 = arith.select %124, %arg6, %121 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9970519Z %126 = arith.mulf %114, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9970736Z %127 = arith.truncf %126 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9971041Z %128 = ttg.convert_layout %125 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:44.9971377Z %129 = tt.expand_dims %128 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:44.9971675Z %130 = ttg.convert_layout %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:44.9971924Z %131 = arith.extf %127 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9972190Z %132 = tt.broadcast %130 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:44.9972445Z %133 = ttg.convert_layout %132 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9972667Z %134 = arith.subf %131, %133 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9972983Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9973273Z %136 = "tt.reduce"(%135) <{axis = 2 : i32}> ({ 2026-02-21T13:28:44.9973401Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:44.9973523Z %167 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:44.9973644Z tt.reduce.return %167 : f32 2026-02-21T13:28:44.9973831Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:44.9974122Z %137 = ttg.convert_layout %136 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9974365Z %138 = arith.subf %arg6, %125 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9974650Z %139 = tt.extern_elementwise %138 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9974940Z %140 = arith.mulf %arg7, %139 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9975098Z %141 = arith.addf %140, %137 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9975337Z %142 = ttg.convert_layout %139 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:44.9975670Z %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:44.9975965Z %144 = ttg.convert_layout %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:44.9976212Z %145 = tt.broadcast %144 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:44.9976477Z %146 = ttg.convert_layout %145 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9976692Z %147 = arith.mulf %arg8, %146 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9976943Z %148 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:44.9977288Z %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:44.9977612Z %150 = ttg.convert_layout %149 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9977826Z %151 = arith.muli %150, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9977992Z %152 = arith.addi %49, %151 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9978197Z %153 = tt.broadcast %152 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:44.9978458Z %154 = ttg.convert_layout %153 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9978696Z %155 = arith.addi %154, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9978917Z %156 = tt.addptr %18, %155 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:44.9979146Z %157 = tt.load %156 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9979359Z %158 = arith.truncf %135 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9979600Z %159 = tt.reshape %147 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9979835Z %160 = tt.reshape %158 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:44.9980093Z %161 = tt.reshape %157 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:44.9980399Z %162 = ttg.convert_layout %160 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:44.9980762Z %163 = ttg.convert_layout %161 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:44.9981062Z %164 = ttg.convert_layout %159 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9981469Z %165 = tt.dot %162, %163, %164, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:44.9981870Z %166 = tt.reshape %165 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9982139Z scf.yield %125, %141, %166 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9982358Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T13:28:44.9982580Z %52 = ttg.convert_layout %51#1 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:44.9982910Z %53 = tt.expand_dims %52 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:44.9983203Z %54 = ttg.convert_layout %53 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:44.9983441Z %55 = tt.broadcast %54 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:44.9983683Z %56 = ttg.convert_layout %55 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9983891Z %57 = arith.divf %51#2, %56 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:44.9984092Z %58 = arith.truncf %57 : tensor<1x1x128xf32, #blocked1> to tensor<1x1x128xbf16, #blocked1> 2026-02-21T13:28:44.9984342Z %59 = tt.addptr %19, %41 : tensor<1x1x128x!tt.ptr, #blocked1>, tensor<1x1x128xi32, #blocked1> 2026-02-21T13:28:44.9984571Z tt.store %59, %58 : tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9984718Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T13:28:44.9984841Z %60 = arith.muli %c1_i32, %c1_i32_9 : i32 2026-02-21T13:28:44.9984962Z %61 = arith.addi %arg4, %60 : i32 2026-02-21T13:28:44.9985079Z %62 = arith.divsi %61, %c131072_i32 : i32 2026-02-21T13:28:44.9985217Z %63 = arith.muli %62, %c16_i32 : i32 2026-02-21T13:28:44.9985334Z %64 = arith.subi %c192_i32, %63 : i32 2026-02-21T13:28:44.9985446Z %65 = arith.minsi %64, %c16_i32 : i32 2026-02-21T13:28:44.9985566Z %66 = arith.remsi %61, %c131072_i32 : i32 2026-02-21T13:28:44.9985682Z %67 = arith.remsi %66, %65 : i32 2026-02-21T13:28:44.9985793Z %68 = arith.addi %63, %67 : i32 2026-02-21T13:28:44.9985901Z %69 = arith.divsi %66, %65 : i32 2026-02-21T13:28:44.9986018Z %70 = arith.muli %68, %c1048576_i32 : i32 2026-02-21T13:28:44.9986137Z %71 = arith.muli %69, %c128_i32 : i32 2026-02-21T13:28:44.9986249Z %72 = arith.addi %70, %71 : i32 2026-02-21T13:28:44.9986397Z %73 = tt.splat %72 : i32 -> tensor<1x1x128xi32, #blocked1> 2026-02-21T13:28:44.9986552Z %74 = arith.addi %73, %10 : tensor<1x1x128xi32, #blocked1> 2026-02-21T13:28:44.9986758Z %75 = tt.addptr %11, %74 : tensor<1x1x128x!tt.ptr, #blocked1>, tensor<1x1x128xi32, #blocked1> 2026-02-21T13:28:44.9986967Z %76 = tt.load %75 : tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9987125Z %77 = tt.splat %70 : i32 -> tensor<1x128x1xi32, #blocked> 2026-02-21T13:28:44.9987277Z %78 = arith.addi %77, %15 : tensor<1x128x1xi32, #blocked> 2026-02-21T13:28:44.9987473Z %79 = tt.broadcast %78 : tensor<1x128x1xi32, #blocked> -> tensor<1x128x4096xi32, #blocked> 2026-02-21T13:28:44.9987741Z %80 = ttg.convert_layout %79 : tensor<1x128x4096xi32, #blocked> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9987987Z %81 = tt.reshape %76 : tensor<1x1x128xbf16, #blocked1> -> tensor<1x128xbf16, #blocked3> 2026-02-21T13:28:44.9988178Z %82 = tt.splat %70 : i32 -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:44.9988315Z %c0_i32_10 = arith.constant 0 : i32 2026-02-21T13:28:44.9988439Z %c16384_i32_11 = arith.constant 16384 : i32 2026-02-21T13:28:44.9988781Z %83:3 = scf.for %arg5 = %c0_i32 to %c0_i32_10 step %c16384_i32_11 iter_args(%arg6 = %cst_6, %arg7 = %cst_5, %arg8 = %cst_4) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>) : i32 { 2026-02-21T13:28:44.9989135Z %93 = tt.splat %arg5 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:44.9989293Z %94 = arith.addi %93, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:44.9989529Z %95 = ttg.convert_layout %94 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:44.9989861Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:44.9990157Z %97 = ttg.convert_layout %96 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:44.9990446Z %98 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:44.9990791Z %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:44.9991098Z %100 = ttg.convert_layout %99 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:44.9991316Z %101 = arith.muli %100, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:44.9991532Z %102 = tt.broadcast %101 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9991741Z %103 = arith.addi %80, %102 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9991986Z %104 = tt.addptr %16, %103 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:44.9992211Z %105 = tt.load %104 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:44.9992425Z %106 = tt.reshape %105 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:44.9992725Z %107 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:44.9993097Z %108 = ttg.convert_layout %106 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:44.9993413Z %109 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:44.9993823Z %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:44.9994239Z %111 = ttg.convert_layout %110 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:44.9994483Z %112 = tt.reshape %111 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9994725Z %113 = arith.truncf %112 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9994975Z %114 = arith.extf %113 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9995169Z %115 = "tt.reduce"(%114) <{axis = 2 : i32}> ({ 2026-02-21T13:28:44.9995294Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:44.9995417Z %395 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:44.9995555Z tt.reduce.return %395 : f32 2026-02-21T13:28:44.9995743Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:44.9996036Z %116 = ttg.convert_layout %115 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9996308Z %117 = arith.truncf %116 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:44.9996528Z %118 = arith.extf %117 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9996719Z %119 = arith.mulf %118, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9996909Z %120 = arith.truncf %119 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:44.9997125Z %121 = arith.extf %120 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9997321Z %122 = arith.cmpf ogt, %arg6, %121 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9997494Z %123 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9997661Z %124 = arith.ori %122, %123 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:44.9997854Z %125 = arith.select %124, %arg6, %121 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:44.9998057Z %126 = arith.mulf %114, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9998267Z %127 = arith.truncf %126 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:44.9998554Z %128 = ttg.convert_layout %125 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:44.9998884Z %129 = tt.expand_dims %128 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:44.9999179Z %130 = ttg.convert_layout %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:44.9999425Z %131 = arith.extf %127 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:44.9999690Z %132 = tt.broadcast %130 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:44.9999951Z %133 = ttg.convert_layout %132 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0000169Z %134 = arith.subf %131, %133 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0000474Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0000776Z %136 = "tt.reduce"(%135) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0000905Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0001023Z %395 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:45.0001140Z tt.reduce.return %395 : f32 2026-02-21T13:28:45.0001328Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0001620Z %137 = ttg.convert_layout %136 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0001884Z %138 = arith.subf %arg6, %125 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0002169Z %139 = tt.extern_elementwise %138 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0002456Z %140 = arith.mulf %arg7, %139 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0002655Z %141 = arith.addf %140, %137 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0002891Z %142 = ttg.convert_layout %139 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0003243Z %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0003539Z %144 = ttg.convert_layout %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0003784Z %145 = tt.broadcast %144 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:45.0004038Z %146 = ttg.convert_layout %145 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0004251Z %147 = arith.mulf %arg8, %146 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0004504Z %148 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:45.0004850Z %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:45.0005162Z %150 = ttg.convert_layout %149 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0005377Z %151 = arith.muli %150, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0005541Z %152 = arith.addi %82, %151 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0005746Z %153 = tt.broadcast %152 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:45.0006008Z %154 = ttg.convert_layout %153 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0006228Z %155 = arith.addi %154, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0006456Z %156 = tt.addptr %18, %155 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0006680Z %157 = tt.load %156 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0006896Z %158 = arith.truncf %135 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0007144Z %159 = tt.reshape %147 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0007399Z %160 = tt.reshape %158 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:45.0007647Z %161 = tt.reshape %157 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:45.0007949Z %162 = ttg.convert_layout %160 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:45.0008310Z %163 = ttg.convert_layout %161 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:45.0008629Z %164 = ttg.convert_layout %159 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0009032Z %165 = tt.dot %162, %163, %164, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0009424Z %166 = tt.reshape %165 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0009603Z %c1_i32_12 = arith.constant 1 : i32 2026-02-21T13:28:45.0009749Z %167 = arith.muli %c4096_i32, %c1_i32_12 : i32 2026-02-21T13:28:45.0009877Z %168 = arith.addi %arg5, %167 : i32 2026-02-21T13:28:45.0010011Z %169 = tt.splat %168 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0010173Z %170 = arith.addi %169, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0010414Z %171 = ttg.convert_layout %170 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:45.0010752Z %172 = tt.expand_dims %171 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:45.0011066Z %173 = ttg.convert_layout %172 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:45.0011359Z %174 = ttg.convert_layout %173 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:45.0011707Z %175 = tt.expand_dims %174 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:45.0012015Z %176 = ttg.convert_layout %175 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0012234Z %177 = arith.muli %176, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0012444Z %178 = tt.broadcast %177 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0012656Z %179 = arith.addi %80, %178 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0012877Z %180 = tt.addptr %16, %179 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0013103Z %181 = tt.load %180 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0013321Z %182 = tt.reshape %181 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:45.0013622Z %183 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:45.0013977Z %184 = ttg.convert_layout %182 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:45.0014287Z %185 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0014693Z %186 = tt.dot %183, %184, %185, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0015095Z %187 = ttg.convert_layout %186 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:45.0015356Z %188 = tt.reshape %187 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0015598Z %189 = arith.truncf %188 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0015848Z %190 = arith.extf %189 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0016036Z %191 = "tt.reduce"(%190) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0016163Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0016299Z %395 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:45.0016421Z tt.reduce.return %395 : f32 2026-02-21T13:28:45.0016610Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0016900Z %192 = ttg.convert_layout %191 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0017170Z %193 = arith.truncf %192 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0017398Z %194 = arith.extf %193 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0017603Z %195 = arith.mulf %194, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0017795Z %196 = arith.truncf %195 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0018013Z %197 = arith.extf %196 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0018208Z %198 = arith.cmpf ogt, %125, %197 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0018375Z %199 = arith.cmpf une, %125, %125 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0018538Z %200 = arith.ori %198, %199 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:45.0018749Z %201 = arith.select %200, %125, %197 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0018950Z %202 = arith.mulf %190, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0019160Z %203 = arith.truncf %202 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0028918Z %204 = ttg.convert_layout %201 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0029274Z %205 = tt.expand_dims %204 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0029576Z %206 = ttg.convert_layout %205 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0029829Z %207 = arith.extf %203 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0030075Z %208 = tt.broadcast %206 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:45.0030337Z %209 = ttg.convert_layout %208 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0030561Z %210 = arith.subf %207, %209 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0030871Z %211 = tt.extern_elementwise %210 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0031166Z %212 = "tt.reduce"(%211) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0031293Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0031420Z %395 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:45.0031545Z tt.reduce.return %395 : f32 2026-02-21T13:28:45.0031736Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0032030Z %213 = ttg.convert_layout %212 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0032274Z %214 = arith.subf %125, %201 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0032559Z %215 = tt.extern_elementwise %214 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0032896Z %216 = arith.mulf %141, %215 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0033050Z %217 = arith.addf %216, %213 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0033287Z %218 = ttg.convert_layout %215 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0033636Z %219 = tt.expand_dims %218 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0033933Z %220 = ttg.convert_layout %219 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0034183Z %221 = tt.broadcast %220 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:45.0034433Z %222 = ttg.convert_layout %221 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0034651Z %223 = arith.mulf %166, %222 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0034918Z %224 = ttg.convert_layout %173 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:45.0035268Z %225 = tt.expand_dims %224 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:45.0035580Z %226 = ttg.convert_layout %225 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0035792Z %227 = arith.muli %226, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0035958Z %228 = arith.addi %82, %227 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0036173Z %229 = tt.broadcast %228 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:45.0036436Z %230 = ttg.convert_layout %229 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0036664Z %231 = arith.addi %230, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0036886Z %232 = tt.addptr %18, %231 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0037114Z %233 = tt.load %232 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0037325Z %234 = arith.truncf %211 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0037571Z %235 = tt.reshape %223 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0037809Z %236 = tt.reshape %234 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:45.0038052Z %237 = tt.reshape %233 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:45.0038357Z %238 = ttg.convert_layout %236 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:45.0038721Z %239 = ttg.convert_layout %237 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:45.0039025Z %240 = ttg.convert_layout %235 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0039436Z %241 = tt.dot %238, %239, %240, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0039831Z %242 = tt.reshape %241 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0040013Z %c2_i32_13 = arith.constant 2 : i32 2026-02-21T13:28:45.0040141Z %243 = arith.muli %c4096_i32, %c2_i32_13 : i32 2026-02-21T13:28:45.0040269Z %244 = arith.addi %arg5, %243 : i32 2026-02-21T13:28:45.0040425Z %245 = tt.splat %244 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0040582Z %246 = arith.addi %245, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0040828Z %247 = ttg.convert_layout %246 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:45.0041163Z %248 = tt.expand_dims %247 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:45.0041478Z %249 = ttg.convert_layout %248 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:45.0041774Z %250 = ttg.convert_layout %249 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:45.0042124Z %251 = tt.expand_dims %250 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:45.0042440Z %252 = ttg.convert_layout %251 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0042707Z %253 = arith.muli %252, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0042938Z %254 = tt.broadcast %253 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0043150Z %255 = arith.addi %80, %254 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0043371Z %256 = tt.addptr %16, %255 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0043602Z %257 = tt.load %256 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0043812Z %258 = tt.reshape %257 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:45.0044127Z %259 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:45.0044486Z %260 = ttg.convert_layout %258 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:45.0044797Z %261 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0045208Z %262 = tt.dot %259, %260, %261, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0045613Z %263 = ttg.convert_layout %262 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:45.0045852Z %264 = tt.reshape %263 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0046097Z %265 = arith.truncf %264 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0046341Z %266 = arith.extf %265 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0046538Z %267 = "tt.reduce"(%266) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0046664Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0046783Z %395 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:45.0046911Z tt.reduce.return %395 : f32 2026-02-21T13:28:45.0047100Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0047396Z %268 = ttg.convert_layout %267 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0047671Z %269 = arith.truncf %268 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0047895Z %270 = arith.extf %269 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0048089Z %271 = arith.mulf %270, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0048276Z %272 = arith.truncf %271 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0048513Z %273 = arith.extf %272 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0048705Z %274 = arith.cmpf ogt, %201, %273 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0048876Z %275 = arith.cmpf une, %201, %201 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0049040Z %276 = arith.ori %274, %275 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:45.0049245Z %277 = arith.select %276, %201, %273 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0049449Z %278 = arith.mulf %266, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0049659Z %279 = arith.truncf %278 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0049948Z %280 = ttg.convert_layout %277 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0050281Z %281 = tt.expand_dims %280 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0050596Z %282 = ttg.convert_layout %281 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0050842Z %283 = arith.extf %279 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0051087Z %284 = tt.broadcast %282 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:45.0051341Z %285 = ttg.convert_layout %284 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0051560Z %286 = arith.subf %283, %285 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0051881Z %287 = tt.extern_elementwise %286 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0052173Z %288 = "tt.reduce"(%287) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0052296Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0052416Z %395 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:45.0052536Z tt.reduce.return %395 : f32 2026-02-21T13:28:45.0052724Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0053018Z %289 = ttg.convert_layout %288 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0053257Z %290 = arith.subf %201, %277 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0053546Z %291 = tt.extern_elementwise %290 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0053835Z %292 = arith.mulf %217, %291 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0053989Z %293 = arith.addf %292, %289 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0054228Z %294 = ttg.convert_layout %291 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0054565Z %295 = tt.expand_dims %294 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0054861Z %296 = ttg.convert_layout %295 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0055108Z %297 = tt.broadcast %296 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:45.0055358Z %298 = ttg.convert_layout %297 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0055570Z %299 = arith.mulf %242, %298 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0055820Z %300 = ttg.convert_layout %249 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:45.0056421Z %301 = tt.expand_dims %300 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:45.0056730Z %302 = ttg.convert_layout %301 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0056941Z %303 = arith.muli %302, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0057105Z %304 = arith.addi %82, %303 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0057325Z %305 = tt.broadcast %304 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:45.0057584Z %306 = ttg.convert_layout %305 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0057806Z %307 = arith.addi %306, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0058027Z %308 = tt.addptr %18, %307 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0058258Z %309 = tt.load %308 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0058489Z %310 = arith.truncf %287 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0058736Z %311 = tt.reshape %299 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0058969Z %312 = tt.reshape %310 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:45.0059215Z %313 = tt.reshape %309 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:45.0059524Z %314 = ttg.convert_layout %312 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:45.0059911Z %315 = ttg.convert_layout %313 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:45.0060218Z %316 = ttg.convert_layout %311 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0060647Z %317 = tt.dot %314, %315, %316, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0061056Z %318 = tt.reshape %317 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0061250Z %c3_i32 = arith.constant 3 : i32 2026-02-21T13:28:45.0061375Z %319 = arith.muli %c4096_i32, %c3_i32 : i32 2026-02-21T13:28:45.0061509Z %320 = arith.addi %arg5, %319 : i32 2026-02-21T13:28:45.0061648Z %321 = tt.splat %320 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0061804Z %322 = arith.addi %321, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0062048Z %323 = ttg.convert_layout %322 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:45.0062383Z %324 = tt.expand_dims %323 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:45.0062684Z %325 = ttg.convert_layout %324 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:45.0062987Z %326 = ttg.convert_layout %325 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:45.0063339Z %327 = tt.expand_dims %326 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:45.0063652Z %328 = ttg.convert_layout %327 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0063870Z %329 = arith.muli %328, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0064083Z %330 = tt.broadcast %329 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0064319Z %331 = arith.addi %80, %330 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0064539Z %332 = tt.addptr %16, %331 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0064766Z %333 = tt.load %332 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0064976Z %334 = tt.reshape %333 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:45.0065293Z %335 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:45.0065649Z %336 = ttg.convert_layout %334 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:45.0065963Z %337 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0066379Z %338 = tt.dot %335, %336, %337, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0066797Z %339 = ttg.convert_layout %338 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:45.0067035Z %340 = tt.reshape %339 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0067283Z %341 = arith.truncf %340 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0067529Z %342 = arith.extf %341 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0067721Z %343 = "tt.reduce"(%342) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0067844Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0067980Z %395 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:45.0068106Z tt.reduce.return %395 : f32 2026-02-21T13:28:45.0068291Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0068585Z %344 = ttg.convert_layout %343 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0068854Z %345 = arith.truncf %344 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0069076Z %346 = arith.extf %345 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0069270Z %347 = arith.mulf %346, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0069456Z %348 = arith.truncf %347 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0069672Z %349 = arith.extf %348 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0069864Z %350 = arith.cmpf ogt, %277, %349 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0070033Z %351 = arith.cmpf une, %277, %277 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0070193Z %352 = arith.ori %350, %351 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:45.0070388Z %353 = arith.select %352, %277, %349 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0070588Z %354 = arith.mulf %342, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0070795Z %355 = arith.truncf %354 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0071080Z %356 = ttg.convert_layout %353 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0071410Z %357 = tt.expand_dims %356 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0071703Z %358 = ttg.convert_layout %357 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0071944Z %359 = arith.extf %355 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0072204Z %360 = tt.broadcast %358 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:45.0072456Z %361 = ttg.convert_layout %360 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0072670Z %362 = arith.subf %359, %361 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0072988Z %363 = tt.extern_elementwise %362 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0073276Z %364 = "tt.reduce"(%363) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0073398Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0073514Z %395 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:45.0073633Z tt.reduce.return %395 : f32 2026-02-21T13:28:45.0073816Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0074118Z %365 = ttg.convert_layout %364 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0074354Z %366 = arith.subf %277, %353 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0074632Z %367 = tt.extern_elementwise %366 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0074915Z %368 = arith.mulf %293, %367 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0075067Z %369 = arith.addf %368, %365 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0075299Z %370 = ttg.convert_layout %367 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0075642Z %371 = tt.expand_dims %370 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0075933Z %372 = ttg.convert_layout %371 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0076178Z %373 = tt.broadcast %372 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:45.0076426Z %374 = ttg.convert_layout %373 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0076636Z %375 = arith.mulf %318, %374 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0076885Z %376 = ttg.convert_layout %325 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:45.0077228Z %377 = tt.expand_dims %376 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:45.0077534Z %378 = ttg.convert_layout %377 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0077745Z %379 = arith.muli %378, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0077905Z %380 = arith.addi %82, %379 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0078104Z %381 = tt.broadcast %380 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:45.0078360Z %382 = ttg.convert_layout %381 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0078580Z %383 = arith.addi %382, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0078799Z %384 = tt.addptr %18, %383 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0079021Z %385 = tt.load %384 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0079230Z %386 = arith.truncf %363 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0079470Z %387 = tt.reshape %375 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0079721Z %388 = tt.reshape %386 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:45.0079964Z %389 = tt.reshape %385 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:45.0080263Z %390 = ttg.convert_layout %388 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:45.0080637Z %391 = ttg.convert_layout %389 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:45.0080936Z %392 = ttg.convert_layout %387 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0081342Z %393 = tt.dot %390, %391, %392, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0081738Z %394 = tt.reshape %393 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0082019Z scf.yield %353, %369, %394 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0082235Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T13:28:45.0082607Z %84:3 = scf.for %arg5 = %c0_i32_10 to %c8192_i32 step %c4096_i32 iter_args(%arg6 = %83#0, %arg7 = %83#1, %arg8 = %83#2) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>) : i32 { 2026-02-21T13:28:45.0082953Z %93 = tt.splat %arg5 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0083110Z %94 = arith.addi %93, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0083363Z %95 = ttg.convert_layout %94 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:45.0083692Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:45.0083987Z %97 = ttg.convert_layout %96 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:45.0084271Z %98 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:45.0084607Z %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:45.0084912Z %100 = ttg.convert_layout %99 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0085130Z %101 = arith.muli %100, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0085339Z %102 = tt.broadcast %101 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0085547Z %103 = arith.addi %80, %102 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0085766Z %104 = tt.addptr %16, %103 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0085991Z %105 = tt.load %104 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0086201Z %106 = tt.reshape %105 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:45.0086500Z %107 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:45.0086852Z %108 = ttg.convert_layout %106 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:45.0087159Z %109 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0087566Z %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0087985Z %111 = ttg.convert_layout %110 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:45.0088225Z %112 = tt.reshape %111 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0088465Z %113 = arith.truncf %112 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0088733Z %114 = arith.extf %113 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0088922Z %115 = "tt.reduce"(%114) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0089042Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0089161Z %167 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:45.0089282Z tt.reduce.return %167 : f32 2026-02-21T13:28:45.0089467Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0089756Z %116 = ttg.convert_layout %115 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0090039Z %117 = arith.truncf %116 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0090255Z %118 = arith.extf %117 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0090445Z %119 = arith.mulf %118, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0090637Z %120 = arith.truncf %119 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0090856Z %121 = arith.extf %120 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0091067Z %122 = arith.cmpf ogt, %arg6, %121 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0091235Z %123 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0091395Z %124 = arith.ori %122, %123 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:45.0091586Z %125 = arith.select %124, %arg6, %121 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0091787Z %126 = arith.mulf %114, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0091993Z %127 = arith.truncf %126 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0092277Z %128 = ttg.convert_layout %125 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0092611Z %129 = tt.expand_dims %128 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0092908Z %130 = ttg.convert_layout %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0093149Z %131 = arith.extf %127 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0093392Z %132 = tt.broadcast %130 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:45.0093645Z %133 = ttg.convert_layout %132 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0093862Z %134 = arith.subf %131, %133 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0094165Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0094451Z %136 = "tt.reduce"(%135) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0094576Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0094690Z %167 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:45.0094808Z tt.reduce.return %167 : f32 2026-02-21T13:28:45.0094995Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0095297Z %137 = ttg.convert_layout %136 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0095534Z %138 = arith.subf %arg6, %125 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0095814Z %139 = tt.extern_elementwise %138 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0096113Z %140 = arith.mulf %arg7, %139 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0096270Z %141 = arith.addf %140, %137 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0096505Z %142 = ttg.convert_layout %139 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0096836Z %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0097130Z %144 = ttg.convert_layout %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0097377Z %145 = tt.broadcast %144 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:45.0097643Z %146 = ttg.convert_layout %145 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0097852Z %147 = arith.mulf %arg8, %146 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0098102Z %148 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:45.0098447Z %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:45.0098770Z %150 = ttg.convert_layout %149 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0098981Z %151 = arith.muli %150, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0099141Z %152 = arith.addi %82, %151 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0099340Z %153 = tt.broadcast %152 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:45.0099597Z %154 = ttg.convert_layout %153 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0099816Z %155 = arith.addi %154, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0100036Z %156 = tt.addptr %18, %155 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0100257Z %157 = tt.load %156 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0100465Z %158 = arith.truncf %135 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0100704Z %159 = tt.reshape %147 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0100935Z %160 = tt.reshape %158 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:45.0101179Z %161 = tt.reshape %157 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:45.0101480Z %162 = ttg.convert_layout %160 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:45.0101835Z %163 = ttg.convert_layout %161 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:45.0102134Z %164 = ttg.convert_layout %159 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0102539Z %165 = tt.dot %162, %163, %164, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0102933Z %166 = tt.reshape %165 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0103211Z scf.yield %125, %141, %166 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0103426Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T13:28:45.0103645Z %85 = ttg.convert_layout %84#1 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0103967Z %86 = tt.expand_dims %85 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0104266Z %87 = ttg.convert_layout %86 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0104501Z %88 = tt.broadcast %87 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:45.0104741Z %89 = ttg.convert_layout %88 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0104951Z %90 = arith.divf %84#2, %89 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0105144Z %91 = arith.truncf %90 : tensor<1x1x128xf32, #blocked1> to tensor<1x1x128xbf16, #blocked1> 2026-02-21T13:28:45.0105401Z %92 = tt.addptr %19, %74 : tensor<1x1x128x!tt.ptr, #blocked1>, tensor<1x1x128xi32, #blocked1> 2026-02-21T13:28:45.0105607Z tt.store %92, %91 : tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0105782Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T13:28:45.0105948Z scf.for %arg4 = %27 to %3 step %c1_i32 : i32 { 2026-02-21T13:28:45.0106078Z %29 = arith.divsi %arg4, %c131072_i32 : i32 2026-02-21T13:28:45.0106201Z %30 = arith.muli %29, %c16_i32 : i32 2026-02-21T13:28:45.0106314Z %31 = arith.subi %c192_i32, %30 : i32 2026-02-21T13:28:45.0106445Z %32 = arith.minsi %31, %c16_i32 : i32 2026-02-21T13:28:45.0106561Z %33 = arith.remsi %arg4, %c131072_i32 : i32 2026-02-21T13:28:45.0106678Z %34 = arith.remsi %33, %32 : i32 2026-02-21T13:28:45.0106784Z %35 = arith.addi %30, %34 : i32 2026-02-21T13:28:45.0106890Z %36 = arith.divsi %33, %32 : i32 2026-02-21T13:28:45.0107003Z %37 = arith.muli %35, %c1048576_i32 : i32 2026-02-21T13:28:45.0107118Z %38 = arith.muli %36, %c128_i32 : i32 2026-02-21T13:28:45.0107227Z %39 = arith.addi %37, %38 : i32 2026-02-21T13:28:45.0107353Z %40 = tt.splat %39 : i32 -> tensor<1x1x128xi32, #blocked1> 2026-02-21T13:28:45.0107509Z %41 = arith.addi %40, %10 : tensor<1x1x128xi32, #blocked1> 2026-02-21T13:28:45.0107707Z %42 = tt.addptr %11, %41 : tensor<1x1x128x!tt.ptr, #blocked1>, tensor<1x1x128xi32, #blocked1> 2026-02-21T13:28:45.0107911Z %43 = tt.load %42 : tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0108065Z %44 = tt.splat %37 : i32 -> tensor<1x128x1xi32, #blocked> 2026-02-21T13:28:45.0108215Z %45 = arith.addi %44, %15 : tensor<1x128x1xi32, #blocked> 2026-02-21T13:28:45.0108409Z %46 = tt.broadcast %45 : tensor<1x128x1xi32, #blocked> -> tensor<1x128x4096xi32, #blocked> 2026-02-21T13:28:45.0108660Z %47 = ttg.convert_layout %46 : tensor<1x128x4096xi32, #blocked> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0108905Z %48 = tt.reshape %43 : tensor<1x1x128xbf16, #blocked1> -> tensor<1x128xbf16, #blocked3> 2026-02-21T13:28:45.0109093Z %49 = tt.splat %37 : i32 -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0109228Z %c0_i32_8 = arith.constant 0 : i32 2026-02-21T13:28:45.0109348Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T13:28:45.0109684Z %50:3 = scf.for %arg5 = %c0_i32 to %c0_i32_8 step %c16384_i32 iter_args(%arg6 = %cst_6, %arg7 = %cst_5, %arg8 = %cst_4) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>) : i32 { 2026-02-21T13:28:45.0110038Z %60 = tt.splat %arg5 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0110191Z %61 = arith.addi %60, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0110442Z %62 = ttg.convert_layout %61 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:45.0110770Z %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:45.0111058Z %64 = ttg.convert_layout %63 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:45.0111368Z %65 = ttg.convert_layout %64 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:45.0111706Z %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:45.0112009Z %67 = ttg.convert_layout %66 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0112221Z %68 = arith.muli %67, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0112425Z %69 = tt.broadcast %68 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0112640Z %70 = arith.addi %47, %69 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0112755Z %71 = tt.addptr %16, %70 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0112821Z %72 = tt.load %71 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0112923Z %73 = tt.reshape %72 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:45.0113074Z %74 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:45.0113251Z %75 = ttg.convert_layout %73 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:45.0113357Z %76 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0113615Z %77 = tt.dot %74, %75, %76, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0113716Z %78 = ttg.convert_layout %77 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:45.0113809Z %79 = tt.reshape %78 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0113908Z %80 = arith.truncf %79 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0114005Z %81 = arith.extf %80 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0114054Z %82 = "tt.reduce"(%81) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0114095Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0114144Z %362 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:45.0114184Z tt.reduce.return %362 : f32 2026-02-21T13:28:45.0114298Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0114431Z %83 = ttg.convert_layout %82 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0114519Z %84 = arith.truncf %83 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0114603Z %85 = arith.extf %84 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0114661Z %86 = arith.mulf %85, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0114747Z %87 = arith.truncf %86 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0114829Z %88 = arith.extf %87 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0114916Z %89 = arith.cmpf ogt, %arg6, %88 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0114983Z %90 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0115039Z %91 = arith.ori %89, %90 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:45.0115132Z %92 = arith.select %91, %arg6, %88 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0115195Z %93 = arith.mulf %81, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0115310Z %94 = arith.truncf %93 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0115447Z %95 = ttg.convert_layout %92 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0115595Z %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0115696Z %97 = ttg.convert_layout %96 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0115793Z %98 = arith.extf %94 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0115904Z %99 = tt.broadcast %97 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:45.0116014Z %100 = ttg.convert_layout %99 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0116077Z %101 = arith.subf %98, %100 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0116284Z %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0116333Z %103 = "tt.reduce"(%102) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0116387Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0116432Z %362 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:45.0116473Z tt.reduce.return %362 : f32 2026-02-21T13:28:45.0116586Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0116726Z %104 = ttg.convert_layout %103 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0116788Z %105 = arith.subf %arg6, %92 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0116974Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0117035Z %107 = arith.mulf %arg7, %106 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0117096Z %108 = arith.addf %107, %104 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0117235Z %109 = ttg.convert_layout %106 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0117384Z %110 = tt.expand_dims %109 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0117494Z %111 = ttg.convert_layout %110 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0117593Z %112 = tt.broadcast %111 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:45.0117702Z %113 = ttg.convert_layout %112 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0117769Z %114 = arith.mulf %arg8, %113 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0117918Z %115 = ttg.convert_layout %64 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:45.0118078Z %116 = tt.expand_dims %115 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:45.0118192Z %117 = ttg.convert_layout %116 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0118272Z %118 = arith.muli %117, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0118335Z %119 = arith.addi %49, %118 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0118439Z %120 = tt.broadcast %119 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:45.0118553Z %121 = ttg.convert_layout %120 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0118632Z %122 = arith.addi %121, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0118751Z %123 = tt.addptr %18, %122 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0118817Z %124 = tt.load %123 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0118921Z %125 = arith.truncf %102 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0119018Z %126 = tt.reshape %114 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0119129Z %127 = tt.reshape %125 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:45.0119232Z %128 = tt.reshape %124 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:45.0119389Z %129 = ttg.convert_layout %127 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:45.0119551Z %130 = ttg.convert_layout %128 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:45.0119652Z %131 = ttg.convert_layout %126 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0119930Z %132 = tt.dot %129, %130, %131, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0120024Z %133 = tt.reshape %132 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0120068Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T13:28:45.0120120Z %134 = arith.muli %c4096_i32, %c1_i32_9 : i32 2026-02-21T13:28:45.0120161Z %135 = arith.addi %arg5, %134 : i32 2026-02-21T13:28:45.0120221Z %136 = tt.splat %135 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0120281Z %137 = arith.addi %136, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0120425Z %138 = ttg.convert_layout %137 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:45.0120575Z %139 = tt.expand_dims %138 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:45.0120680Z %140 = ttg.convert_layout %139 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:45.0120829Z %141 = ttg.convert_layout %140 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:45.0120985Z %142 = tt.expand_dims %141 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:45.0121095Z %143 = ttg.convert_layout %142 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0121163Z %144 = arith.muli %143, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0121267Z %145 = tt.broadcast %144 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0121331Z %146 = arith.addi %47, %145 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0121449Z %147 = tt.addptr %16, %146 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0121533Z %148 = tt.load %147 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0121638Z %149 = tt.reshape %148 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:45.0121792Z %150 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:45.0121953Z %151 = ttg.convert_layout %149 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:45.0122073Z %152 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0122334Z %153 = tt.dot %150, %151, %152, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0122437Z %154 = ttg.convert_layout %153 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:45.0122545Z %155 = tt.reshape %154 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0122692Z %156 = arith.truncf %155 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0122792Z %157 = arith.extf %156 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0122841Z %158 = "tt.reduce"(%157) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0122882Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0122930Z %362 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:45.0122970Z tt.reduce.return %362 : f32 2026-02-21T13:28:45.0123100Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0123236Z %159 = ttg.convert_layout %158 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0123326Z %160 = arith.truncf %159 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0123416Z %161 = arith.extf %160 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0123478Z %162 = arith.mulf %161, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0123567Z %163 = arith.truncf %162 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0123654Z %164 = arith.extf %163 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0123717Z %165 = arith.cmpf ogt, %92, %164 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0123779Z %166 = arith.cmpf une, %92, %92 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0123838Z %167 = arith.ori %165, %166 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:45.0123933Z %168 = arith.select %167, %92, %164 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0123999Z %169 = arith.mulf %157, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0124103Z %170 = arith.truncf %169 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0124244Z %171 = ttg.convert_layout %168 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0124392Z %172 = tt.expand_dims %171 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0124497Z %173 = ttg.convert_layout %172 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0124598Z %174 = arith.extf %170 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0124700Z %175 = tt.broadcast %173 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:45.0124830Z %176 = ttg.convert_layout %175 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0124896Z %177 = arith.subf %174, %176 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0125101Z %178 = tt.extern_elementwise %177 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0125150Z %179 = "tt.reduce"(%178) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0125211Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0125257Z %362 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:45.0125297Z tt.reduce.return %362 : f32 2026-02-21T13:28:45.0125409Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0125546Z %180 = ttg.convert_layout %179 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0125607Z %181 = arith.subf %92, %168 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0125814Z %182 = tt.extern_elementwise %181 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0125874Z %183 = arith.mulf %108, %182 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0125932Z %184 = arith.addf %183, %180 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0126078Z %185 = ttg.convert_layout %182 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0126227Z %186 = tt.expand_dims %185 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0126351Z %187 = ttg.convert_layout %186 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0126454Z %188 = tt.broadcast %187 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:45.0126564Z %189 = ttg.convert_layout %188 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0126628Z %190 = arith.mulf %133, %189 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0126781Z %191 = ttg.convert_layout %140 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:45.0126939Z %192 = tt.expand_dims %191 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:45.0127049Z %193 = ttg.convert_layout %192 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0127116Z %194 = arith.muli %193, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0127179Z %195 = arith.addi %49, %194 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0127282Z %196 = tt.broadcast %195 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:45.0127404Z %197 = ttg.convert_layout %196 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0127470Z %198 = arith.addi %197, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0127589Z %199 = tt.addptr %18, %198 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0127658Z %200 = tt.load %199 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0127765Z %201 = arith.truncf %178 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0127859Z %202 = tt.reshape %190 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0127958Z %203 = tt.reshape %201 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:45.0128065Z %204 = tt.reshape %200 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:45.0128239Z %205 = ttg.convert_layout %203 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:45.0128403Z %206 = ttg.convert_layout %204 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:45.0128507Z %207 = ttg.convert_layout %202 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0128781Z %208 = tt.dot %205, %206, %207, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0128876Z %209 = tt.reshape %208 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0128923Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T13:28:45.0128973Z %210 = arith.muli %c4096_i32, %c2_i32_10 : i32 2026-02-21T13:28:45.0129016Z %211 = arith.addi %arg5, %210 : i32 2026-02-21T13:28:45.0129081Z %212 = tt.splat %211 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0129153Z %213 = arith.addi %212, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0129297Z %214 = ttg.convert_layout %213 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:45.0129452Z %215 = tt.expand_dims %214 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:45.0129557Z %216 = ttg.convert_layout %215 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:45.0129718Z %217 = ttg.convert_layout %216 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:45.0129879Z %218 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:45.0129994Z %219 = ttg.convert_layout %218 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0130060Z %220 = arith.muli %219, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0130168Z %221 = tt.broadcast %220 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0130234Z %222 = arith.addi %47, %221 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0130351Z %223 = tt.addptr %16, %222 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0130419Z %224 = tt.load %223 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0130525Z %225 = tt.reshape %224 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:45.0130678Z %226 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:45.0130846Z %227 = ttg.convert_layout %225 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:45.0130953Z %228 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0131212Z %229 = tt.dot %226, %227, %228, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0131319Z %230 = ttg.convert_layout %229 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:45.0131416Z %231 = tt.reshape %230 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0131521Z %232 = arith.truncf %231 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0131638Z %233 = arith.extf %232 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0131687Z %234 = "tt.reduce"(%233) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0131727Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0131779Z %362 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:45.0131820Z tt.reduce.return %362 : f32 2026-02-21T13:28:45.0131946Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0132086Z %235 = ttg.convert_layout %234 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0132176Z %236 = arith.truncf %235 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0132263Z %237 = arith.extf %236 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0132328Z %238 = arith.mulf %237, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0132418Z %239 = arith.truncf %238 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0132518Z %240 = arith.extf %239 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0132584Z %241 = arith.cmpf ogt, %168, %240 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0132650Z %242 = arith.cmpf une, %168, %168 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0132710Z %243 = arith.ori %241, %242 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:45.0132805Z %244 = arith.select %243, %168, %240 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0132874Z %245 = arith.mulf %233, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0132991Z %246 = arith.truncf %245 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0133132Z %247 = ttg.convert_layout %244 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0133285Z %248 = tt.expand_dims %247 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0133389Z %249 = ttg.convert_layout %248 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0133488Z %250 = arith.extf %246 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0133592Z %251 = tt.broadcast %249 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:45.0133705Z %252 = ttg.convert_layout %251 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0133770Z %253 = arith.subf %250, %252 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0133975Z %254 = tt.extern_elementwise %253 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0134025Z %255 = "tt.reduce"(%254) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0134065Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0134113Z %362 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:45.0134154Z tt.reduce.return %362 : f32 2026-02-21T13:28:45.0134268Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0134406Z %256 = ttg.convert_layout %255 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0134469Z %257 = arith.subf %168, %244 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0134655Z %258 = tt.extern_elementwise %257 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0134733Z %259 = arith.mulf %184, %258 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0134793Z %260 = arith.addf %259, %256 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0134932Z %261 = ttg.convert_layout %258 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0135082Z %262 = tt.expand_dims %261 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0135202Z %263 = ttg.convert_layout %262 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0135301Z %264 = tt.broadcast %263 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:45.0135410Z %265 = ttg.convert_layout %264 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0135477Z %266 = arith.mulf %209, %265 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0135628Z %267 = ttg.convert_layout %216 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:45.0135800Z %268 = tt.expand_dims %267 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:45.0135913Z %269 = ttg.convert_layout %268 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0135977Z %270 = arith.muli %269, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0136040Z %271 = arith.addi %49, %270 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0136145Z %272 = tt.broadcast %271 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:45.0136272Z %273 = ttg.convert_layout %272 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0136336Z %274 = arith.addi %273, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0136455Z %275 = tt.addptr %18, %274 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0136524Z %276 = tt.load %275 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0136629Z %277 = arith.truncf %254 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0136728Z %278 = tt.reshape %266 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0136827Z %279 = tt.reshape %277 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:45.0136932Z %280 = tt.reshape %276 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:45.0137093Z %281 = ttg.convert_layout %279 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:45.0137253Z %282 = ttg.convert_layout %280 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:45.0137355Z %283 = ttg.convert_layout %278 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0137621Z %284 = tt.dot %281, %282, %283, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0137718Z %285 = tt.reshape %284 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0137761Z %c3_i32 = arith.constant 3 : i32 2026-02-21T13:28:45.0137813Z %286 = arith.muli %c4096_i32, %c3_i32 : i32 2026-02-21T13:28:45.0137856Z %287 = arith.addi %arg5, %286 : i32 2026-02-21T13:28:45.0137917Z %288 = tt.splat %287 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0137980Z %289 = arith.addi %288, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0138143Z %290 = ttg.convert_layout %289 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:45.0138295Z %291 = tt.expand_dims %290 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:45.0138402Z %292 = ttg.convert_layout %291 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:45.0138563Z %293 = ttg.convert_layout %292 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:45.0138719Z %294 = tt.expand_dims %293 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:45.0138834Z %295 = ttg.convert_layout %294 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0138900Z %296 = arith.muli %295, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0139006Z %297 = tt.broadcast %296 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0139091Z %298 = arith.addi %47, %297 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0139210Z %299 = tt.addptr %16, %298 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0139276Z %300 = tt.load %299 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0139382Z %301 = tt.reshape %300 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:45.0139537Z %302 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:45.0139712Z %303 = ttg.convert_layout %301 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:45.0139822Z %304 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0140084Z %305 = tt.dot %302, %303, %304, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0140187Z %306 = ttg.convert_layout %305 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:45.0140286Z %307 = tt.reshape %306 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0140391Z %308 = arith.truncf %307 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0140490Z %309 = arith.extf %308 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0140541Z %310 = "tt.reduce"(%309) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0140582Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0140630Z %362 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:45.0140670Z tt.reduce.return %362 : f32 2026-02-21T13:28:45.0140785Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0140922Z %311 = ttg.convert_layout %310 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0141014Z %312 = arith.truncf %311 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0141105Z %313 = arith.extf %312 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0141167Z %314 = arith.mulf %313, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0141260Z %315 = arith.truncf %314 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0141348Z %316 = arith.extf %315 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0141426Z %317 = arith.cmpf ogt, %244, %316 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0141490Z %318 = arith.cmpf une, %244, %244 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0141550Z %319 = arith.ori %317, %318 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:45.0141646Z %320 = arith.select %319, %244, %316 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0141711Z %321 = arith.mulf %309, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0141832Z %322 = arith.truncf %321 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0141972Z %323 = ttg.convert_layout %320 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0142122Z %324 = tt.expand_dims %323 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0142230Z %325 = ttg.convert_layout %324 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0142343Z %326 = arith.extf %322 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0142445Z %327 = tt.broadcast %325 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:45.0142561Z %328 = ttg.convert_layout %327 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0142625Z %329 = arith.subf %326, %328 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0142827Z %330 = tt.extern_elementwise %329 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0142878Z %331 = "tt.reduce"(%330) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0142930Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0142976Z %362 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:45.0143017Z tt.reduce.return %362 : f32 2026-02-21T13:28:45.0143131Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0143268Z %332 = ttg.convert_layout %331 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0143328Z %333 = arith.subf %244, %320 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0143521Z %334 = tt.extern_elementwise %333 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0143580Z %335 = arith.mulf %260, %334 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0143638Z %336 = arith.addf %335, %332 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0143779Z %337 = ttg.convert_layout %334 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0143929Z %338 = tt.expand_dims %337 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0144034Z %339 = ttg.convert_layout %338 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0144136Z %340 = tt.broadcast %339 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:45.0144244Z %341 = ttg.convert_layout %340 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0144309Z %342 = arith.mulf %285, %341 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0144460Z %343 = ttg.convert_layout %292 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:45.0144617Z %344 = tt.expand_dims %343 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:45.0144743Z %345 = ttg.convert_layout %344 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0144810Z %346 = arith.muli %345, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0144871Z %347 = arith.addi %49, %346 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0144973Z %348 = tt.broadcast %347 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:45.0145091Z %349 = ttg.convert_layout %348 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0145171Z %350 = arith.addi %349, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0145289Z %351 = tt.addptr %18, %350 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0145358Z %352 = tt.load %351 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0145463Z %353 = arith.truncf %330 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0145560Z %354 = tt.reshape %342 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0145672Z %355 = tt.reshape %353 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:45.0145776Z %356 = tt.reshape %352 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:45.0145932Z %357 = ttg.convert_layout %355 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:45.0146097Z %358 = ttg.convert_layout %356 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:45.0146214Z %359 = ttg.convert_layout %354 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0146475Z %360 = tt.dot %357, %358, %359, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0146575Z %361 = tt.reshape %360 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0146708Z scf.yield %320, %336, %361 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0146754Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T13:28:45.0147006Z %51:3 = scf.for %arg5 = %c0_i32_8 to %c8192_i32 step %c4096_i32 iter_args(%arg6 = %50#0, %arg7 = %50#1, %arg8 = %50#2) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>) : i32 { 2026-02-21T13:28:45.0147067Z %60 = tt.splat %arg5 : i32 -> tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0147128Z %61 = arith.addi %60, %12 : tensor<4096xi32, #blocked4> 2026-02-21T13:28:45.0147273Z %62 = ttg.convert_layout %61 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T13:28:45.0147425Z %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5> 2026-02-21T13:28:45.0147527Z %64 = ttg.convert_layout %63 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3> 2026-02-21T13:28:45.0147673Z %65 = ttg.convert_layout %64 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> 2026-02-21T13:28:45.0147828Z %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6> 2026-02-21T13:28:45.0147937Z %67 = ttg.convert_layout %66 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0148006Z %68 = arith.muli %67, %cst_3 : tensor<1x1x4096xi32, #blocked1> 2026-02-21T13:28:45.0148107Z %69 = tt.broadcast %68 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0148184Z %70 = arith.addi %47, %69 : tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0148303Z %71 = tt.addptr %16, %70 : tensor<1x128x4096x!tt.ptr, #blocked1>, tensor<1x128x4096xi32, #blocked1> 2026-02-21T13:28:45.0148368Z %72 = tt.load %71 : tensor<1x128x4096x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0148469Z %73 = tt.reshape %72 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3> 2026-02-21T13:28:45.0148634Z %74 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> 2026-02-21T13:28:45.0148794Z %75 = ttg.convert_layout %73 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> 2026-02-21T13:28:45.0148900Z %76 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0149173Z %77 = tt.dot %74, %75, %76, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8> 2026-02-21T13:28:45.0149276Z %78 = ttg.convert_layout %77 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3> 2026-02-21T13:28:45.0149371Z %79 = tt.reshape %78 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0149474Z %80 = arith.truncf %79 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0149572Z %81 = arith.extf %80 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0149624Z %82 = "tt.reduce"(%81) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0149678Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0149729Z %134 = arith.maxnumf %arg9, %arg10 : f32 2026-02-21T13:28:45.0149770Z tt.reduce.return %134 : f32 2026-02-21T13:28:45.0149883Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0150020Z %83 = ttg.convert_layout %82 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0150108Z %84 = arith.truncf %83 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0150191Z %85 = arith.extf %84 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0150256Z %86 = arith.mulf %85, %cst_1 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0150340Z %87 = arith.truncf %86 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2> 2026-02-21T13:28:45.0150423Z %88 = arith.extf %87 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0150491Z %89 = arith.cmpf ogt, %arg6, %88 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0150558Z %90 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0150616Z %91 = arith.ori %89, %90 : tensor<1x1xi1, #blocked2> 2026-02-21T13:28:45.0150714Z %92 = arith.select %91, %arg6, %88 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0150777Z %93 = arith.mulf %81, %cst_0 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0150877Z %94 = arith.truncf %93 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0151017Z %95 = ttg.convert_layout %92 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0151165Z %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0151266Z %97 = ttg.convert_layout %96 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0151363Z %98 = arith.extf %94 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0151486Z %99 = tt.broadcast %97 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10> 2026-02-21T13:28:45.0151598Z %100 = ttg.convert_layout %99 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0151661Z %101 = arith.subf %98, %100 : tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0151866Z %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1> 2026-02-21T13:28:45.0151934Z %103 = "tt.reduce"(%102) <{axis = 2 : i32}> ({ 2026-02-21T13:28:45.0151974Z ^bb0(%arg9: f32, %arg10: f32): 2026-02-21T13:28:45.0152021Z %134 = arith.addf %arg9, %arg10 : f32 2026-02-21T13:28:45.0152063Z tt.reduce.return %134 : f32 2026-02-21T13:28:45.0152174Z }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> 2026-02-21T13:28:45.0152330Z %104 = ttg.convert_layout %103 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0152392Z %105 = arith.subf %arg6, %92 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0152577Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0152643Z %107 = arith.mulf %arg7, %106 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0152701Z %108 = arith.addf %107, %104 : tensor<1x1xf32, #blocked2> 2026-02-21T13:28:45.0152840Z %109 = ttg.convert_layout %106 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0153005Z %110 = tt.expand_dims %109 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0153111Z %111 = ttg.convert_layout %110 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0153211Z %112 = tt.broadcast %111 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:45.0153325Z %113 = ttg.convert_layout %112 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0153390Z %114 = arith.mulf %arg8, %113 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0153539Z %115 = ttg.convert_layout %64 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> 2026-02-21T13:28:45.0153700Z %116 = tt.expand_dims %115 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7> 2026-02-21T13:28:45.0153812Z %117 = ttg.convert_layout %116 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0153876Z %118 = arith.muli %117, %cst : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0153942Z %119 = arith.addi %49, %118 : tensor<1x4096x1xi32, #blocked> 2026-02-21T13:28:45.0154048Z %120 = tt.broadcast %119 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked> 2026-02-21T13:28:45.0154163Z %121 = ttg.convert_layout %120 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0154230Z %122 = arith.addi %121, %17 : tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0154349Z %123 = tt.addptr %18, %122 : tensor<1x4096x128x!tt.ptr, #blocked1>, tensor<1x4096x128xi32, #blocked1> 2026-02-21T13:28:45.0154416Z %124 = tt.load %123 : tensor<1x4096x128x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0154522Z %125 = arith.truncf %102 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1> 2026-02-21T13:28:45.0154617Z %126 = tt.reshape %114 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0154736Z %127 = tt.reshape %125 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3> 2026-02-21T13:28:45.0154840Z %128 = tt.reshape %124 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3> 2026-02-21T13:28:45.0154999Z %129 = ttg.convert_layout %127 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> 2026-02-21T13:28:45.0155172Z %130 = ttg.convert_layout %128 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> 2026-02-21T13:28:45.0155274Z %131 = ttg.convert_layout %126 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0155537Z %132 = tt.dot %129, %130, %131, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3> 2026-02-21T13:28:45.0155633Z %133 = tt.reshape %132 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0155780Z scf.yield %92, %108, %133 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0155829Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T13:28:45.0155969Z %52 = ttg.convert_layout %51#1 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> 2026-02-21T13:28:45.0156118Z %53 = tt.expand_dims %52 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9> 2026-02-21T13:28:45.0156220Z %54 = ttg.convert_layout %53 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10> 2026-02-21T13:28:45.0156329Z %55 = tt.broadcast %54 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10> 2026-02-21T13:28:45.0156436Z %56 = ttg.convert_layout %55 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0156501Z %57 = arith.divf %51#2, %56 : tensor<1x1x128xf32, #blocked1> 2026-02-21T13:28:45.0156602Z %58 = arith.truncf %57 : tensor<1x1x128xf32, #blocked1> to tensor<1x1x128xbf16, #blocked1> 2026-02-21T13:28:45.0156707Z %59 = tt.addptr %19, %41 : tensor<1x1x128x!tt.ptr, #blocked1>, tensor<1x1x128xi32, #blocked1> 2026-02-21T13:28:45.0156773Z tt.store %59, %58 : tensor<1x1x128x!tt.ptr, #blocked1> 2026-02-21T13:28:45.0156850Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T13:28:45.0156885Z tt.return 2026-02-21T13:28:45.0156918Z } 2026-02-21T13:28:45.0156951Z } 2026-02-21T13:28:45.0156954Z 2026-02-21T13:28:45.0156984Z {-# 2026-02-21T13:28:45.0157029Z external_resources: { 2026-02-21T13:28:45.0157066Z mlir_reproducer: { 2026-02-21T13:28:45.0159197Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T13:28:45.0159259Z disable_threading: false, 2026-02-21T13:28:45.0159296Z verify_each: true 2026-02-21T13:28:45.0159329Z } 2026-02-21T13:28:45.0159359Z } 2026-02-21T13:28:45.0159387Z #-} 2026-02-21T13:28:45.0159623Z /tmp/torchinductor_root/sr/csruq75k5vrnqasgu6rcu5p5ucqjhzcjg6lokhjrwpi27fgp4emv.py:16:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T13:28:45.0160084Z /tmp/torchinductor_root/sr/csruq75k5vrnqasgu6rcu5p5ucqjhzcjg6lokhjrwpi27fgp4emv.py:16:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T13:28:45.0160197Z [295s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T13:28:45.0160844Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 4096], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T13:28:45.0160901Z Error: RuntimeError: PassManager::run failed 2026-02-21T13:28:45.0160982Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T13:30:17.7946870Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 0.7 configs/s 2026-02-21T13:30:17.7956089Z [388s] Adaptive compile timeout: 30s (90% percentile=30.0s, bounds=[30.0s, 30s]) 2026-02-21T13:30:17.9633019Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6/6 - configs/s 2026-02-21T13:30:18.9425916Z [389s] Initial random population of 100, 5 starting points: 2026-02-21T13:30:18.9426349Z error=19 2026-02-21T13:30:18.9426559Z timeout=23 2026-02-21T13:30:18.9429403Z ok=58 2026-02-21T13:30:18.9429683Z min=30.2500 2026-02-21T13:30:18.9429930Z mid=215.0913 2026-02-21T13:30:18.9430151Z max=11076.6172 2026-02-21T13:30:18.9430410Z best={'block_sizes': [1, 128, 32], 2026-02-21T13:30:18.9430832Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'pointer'], 2026-02-21T13:30:18.9431253Z 'l2_groupings': [1], 2026-02-21T13:30:18.9431525Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:30:18.9431836Z 'loop_orders': [[1, 0]], 2026-02-21T13:30:18.9432115Z 'matrix_instr_nonkdim': 32, 2026-02-21T13:30:18.9432371Z 'num_sm_multiplier': 1, 2026-02-21T13:30:18.9432610Z 'num_stages': 2, 2026-02-21T13:30:18.9432822Z 'num_warps': 4, 2026-02-21T13:30:18.9433076Z 'pid_type': 'persistent_interleaved', 2026-02-21T13:30:18.9433381Z 'range_flattens': [False, False], 2026-02-21T13:30:18.9433676Z 'range_multi_buffers': [True, None], 2026-02-21T13:30:18.9433959Z 'range_num_stages': [3, 2], 2026-02-21T13:30:18.9434217Z 'range_unroll_factors': [3, 2], 2026-02-21T13:30:18.9434498Z 'range_warp_specializes': [], 2026-02-21T13:30:18.9434751Z 'waves_per_eu': 2} 2026-02-21T13:30:18.9438327Z [389s] Fitting surrogate: 100 points, 100 targets 2026-02-21T13:30:19.8744244Z [390s] Generation 1 starting: 97 neighbors, 5 active search path(s) 2026-02-21T13:30:55.1616011Z [426s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[3, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:31:00.1148647Z [430s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, False], range_num_stages=[0, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:31:00.1168262Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 0.5 configs/s 2026-02-21T13:31:48.7014165Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 97/97 2.2 configs/s 2026-02-21T13:31:49.7143689Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 8/8 - configs/s 2026-02-21T13:31:57.7471634Z [488s] Generation 1 complete: 2026-02-21T13:31:57.7471977Z error=7 2026-02-21T13:31:57.7472169Z timeout=2 2026-02-21T13:31:57.7472327Z ok=93 2026-02-21T13:31:57.7472474Z min=22.4960 2026-02-21T13:31:57.7472670Z mid=34.5427 2026-02-21T13:31:57.7472816Z max=387.1271 2026-02-21T13:31:57.7473000Z best={'block_sizes': [1, 128, 32], 2026-02-21T13:31:57.7473315Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:31:57.7474018Z 'l2_groupings': [8], 2026-02-21T13:31:57.7474243Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:31:57.7474476Z 'loop_orders': [[0, 1]], 2026-02-21T13:31:57.7474683Z 'matrix_instr_nonkdim': 16, 2026-02-21T13:31:57.7474891Z 'num_sm_multiplier': 64, 2026-02-21T13:31:57.7475092Z 'num_stages': 2, 2026-02-21T13:31:57.7475257Z 'num_warps': 4, 2026-02-21T13:31:57.7475447Z 'pid_type': 'persistent_blocked', 2026-02-21T13:31:57.7475677Z 'range_flattens': [False, None], 2026-02-21T13:31:57.7475916Z 'range_multi_buffers': [True, False], 2026-02-21T13:31:57.7476148Z 'range_num_stages': [1, 1], 2026-02-21T13:31:57.7476484Z 'range_unroll_factors': [3, 0], 2026-02-21T13:31:57.7476703Z 'range_warp_specializes': [], 2026-02-21T13:31:57.7476904Z 'waves_per_eu': 2} 2026-02-21T13:31:57.7488012Z [488s] Fitting surrogate: 202 points, 202 targets 2026-02-21T13:31:58.6968296Z [489s] Generation 2 starting: 94 neighbors, 5 active search path(s) 2026-02-21T13:32:32.4735275Z [523s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[True, True], range_num_stages=[3, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:32:42.2950616Z [533s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[1, 1], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:32:45.8884961Z [536s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:32:47.2913170Z [538s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:32:47.2935914Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 0.6 configs/s 2026-02-21T13:34:01.0895446Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 97/97 1.9 configs/s 2026-02-21T13:34:02.2113841Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T13:34:12.1397203Z [623s] Generation 2 complete: 2026-02-21T13:34:12.1397557Z error=3 2026-02-21T13:34:12.1397791Z timeout=4 2026-02-21T13:34:12.1397978Z ok=93 2026-02-21T13:34:12.1398162Z min=20.9861 2026-02-21T13:34:12.1398357Z mid=40.1549 2026-02-21T13:34:12.1398538Z max=846.0421 2026-02-21T13:34:12.1398767Z best={'block_sizes': [1, 128, 32], 2026-02-21T13:34:12.1399158Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:34:12.1399541Z 'l2_groupings': [8], 2026-02-21T13:34:12.1399815Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:34:12.1400125Z 'loop_orders': [[0, 1]], 2026-02-21T13:34:12.1400403Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:34:12.1400660Z 'num_sm_multiplier': 64, 2026-02-21T13:34:12.1400898Z 'num_stages': 2, 2026-02-21T13:34:12.1401110Z 'num_warps': 4, 2026-02-21T13:34:12.1401689Z 'pid_type': 'persistent_blocked', 2026-02-21T13:34:12.1401990Z 'range_flattens': [None, None], 2026-02-21T13:34:12.1402246Z 'range_multi_buffers': [True, False], 2026-02-21T13:34:12.1402504Z 'range_num_stages': [1, 1], 2026-02-21T13:34:12.1402848Z 'range_unroll_factors': [3, 0], 2026-02-21T13:34:12.1403099Z 'range_warp_specializes': [], 2026-02-21T13:34:12.1403333Z 'waves_per_eu': 2} 2026-02-21T13:34:12.1415519Z [623s] Fitting surrogate: 302 points, 302 targets 2026-02-21T13:34:13.0621156Z [623s] Generation 3 starting: 90 neighbors, 5 active search path(s) 2026-02-21T13:34:52.4150616Z [663s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:34:53.0860030Z [663s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:34:54.3226095Z [665s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:34:59.8661242Z [670s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:35:00.7833549Z [671s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:35:01.0361915Z [671s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:35:01.3144519Z [672s] Timeout after 30s compiling Config(block_sizes=[1, 256, 512], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:35:01.5669841Z [672s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:35:01.8162959Z [672s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:35:02.2360763Z [673s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:35:02.2375291Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 0.6 configs/s 2026-02-21T13:35:52.0009222Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 92/92 2.3 configs/s 2026-02-21T13:35:52.8628796Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s 2026-02-21T13:36:00.4781909Z [731s] Generation 3 complete: 2026-02-21T13:36:00.4782299Z error=5 2026-02-21T13:36:00.4782557Z timeout=10 2026-02-21T13:36:00.4782765Z ok=80 2026-02-21T13:36:00.4782996Z min=20.6766 2026-02-21T13:36:00.4783198Z mid=36.8367 2026-02-21T13:36:00.4783401Z max=509.9584 2026-02-21T13:36:00.4783640Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:36:00.4784074Z 'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'], 2026-02-21T13:36:00.4784480Z 'l2_groupings': [4], 2026-02-21T13:36:00.4784757Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:36:00.4785073Z 'loop_orders': [[0, 1]], 2026-02-21T13:36:00.4785354Z 'matrix_instr_nonkdim': 16, 2026-02-21T13:36:00.4785994Z 'num_stages': 1, 2026-02-21T13:36:00.4786228Z 'num_warps': 4, 2026-02-21T13:36:00.4786477Z 'pid_type': 'flat', 2026-02-21T13:36:00.4786735Z 'range_flattens': [None, False], 2026-02-21T13:36:00.4787041Z 'range_multi_buffers': [None, None], 2026-02-21T13:36:00.4787343Z 'range_num_stages': [0, 0], 2026-02-21T13:36:00.4787624Z 'range_unroll_factors': [0, 1], 2026-02-21T13:36:00.4787922Z 'range_warp_specializes': [], 2026-02-21T13:36:00.4788197Z 'waves_per_eu': 2} 2026-02-21T13:36:00.4803858Z [731s] Fitting surrogate: 397 points, 397 targets 2026-02-21T13:36:01.3636670Z [732s] Generation 4 starting: 90 neighbors, 5 active search path(s) 2026-02-21T13:36:39.5966180Z [770s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:36:41.2048889Z [772s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:36:46.8361681Z [777s] Timeout after 30s compiling Config(block_sizes=[1, 256, 512], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:36:49.5804329Z [780s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:36:49.5826211Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93/93 0.7 configs/s 2026-02-21T13:37:56.6514439Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 93/93 1.6 configs/s 2026-02-21T13:37:57.2990840Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:38:03.6454095Z [854s] Generation 4 complete: 2026-02-21T13:38:03.6454441Z error=7 2026-02-21T13:38:03.6454555Z timeout=4 2026-02-21T13:38:03.6454662Z ok=84 2026-02-21T13:38:03.6454771Z min=19.5046 2026-02-21T13:38:03.6454882Z mid=49.8971 2026-02-21T13:38:03.6454991Z max=860.8808 2026-02-21T13:38:03.6455124Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:38:03.6455354Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:38:03.6455587Z 'l2_groupings': [8], 2026-02-21T13:38:03.6455766Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:38:03.6455956Z 'loop_orders': [[0, 1]], 2026-02-21T13:38:03.6456104Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:38:03.6456250Z 'num_sm_multiplier': 64, 2026-02-21T13:38:03.6456391Z 'num_stages': 2, 2026-02-21T13:38:03.6456521Z 'num_warps': 4, 2026-02-21T13:38:03.6456664Z 'pid_type': 'persistent_interleaved', 2026-02-21T13:38:03.6456839Z 'range_flattens': [None, None], 2026-02-21T13:38:03.6456997Z 'range_multi_buffers': [True, None], 2026-02-21T13:38:03.6457162Z 'range_num_stages': [1, 1], 2026-02-21T13:38:03.6457629Z 'range_unroll_factors': [3, 0], 2026-02-21T13:38:03.6457786Z 'range_warp_specializes': [], 2026-02-21T13:38:03.6457932Z 'waves_per_eu': 2} 2026-02-21T13:38:03.6479913Z [854s] Fitting surrogate: 492 points, 492 targets 2026-02-21T13:38:04.5664349Z [855s] Generation 5 starting: 87 neighbors, 5 active search path(s) 2026-02-21T13:38:45.5797964Z [896s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:38:45.8866876Z [896s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T13:38:45.8892743Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 0.8 configs/s 2026-02-21T13:40:03.6952776Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 90/90 1.5 configs/s 2026-02-21T13:40:04.1398990Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:40:08.4734116Z [979s] Generation 5 complete: 2026-02-21T13:40:08.4734300Z error=4 2026-02-21T13:40:08.4734433Z timeout=2 2026-02-21T13:40:08.4734546Z ok=86 2026-02-21T13:40:08.4734670Z min=19.3694 2026-02-21T13:40:08.4734794Z mid=63.8328 2026-02-21T13:40:08.4734908Z max=712.0135 2026-02-21T13:40:08.4735047Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:40:08.4735232Z 'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:40:08.4735385Z 'l2_groupings': [8], 2026-02-21T13:40:08.4735490Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:40:08.4735608Z 'loop_orders': [[0, 1]], 2026-02-21T13:40:08.4735723Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:40:08.4735878Z 'num_sm_multiplier': 64, 2026-02-21T13:40:08.4736029Z 'num_stages': 2, 2026-02-21T13:40:08.4736490Z 'num_warps': 4, 2026-02-21T13:40:08.4736620Z 'pid_type': 'persistent_interleaved', 2026-02-21T13:40:08.4736754Z 'range_flattens': [None, None], 2026-02-21T13:40:08.4736865Z 'range_multi_buffers': [True, None], 2026-02-21T13:40:08.4736977Z 'range_num_stages': [0, 1], 2026-02-21T13:40:08.4737080Z 'range_unroll_factors': [3, 0], 2026-02-21T13:40:08.4737188Z 'range_warp_specializes': [], 2026-02-21T13:40:08.4737291Z 'waves_per_eu': 2} 2026-02-21T13:40:08.4759352Z [979s] Fitting surrogate: 584 points, 584 targets 2026-02-21T13:40:09.3577892Z [980s] Generation 6 starting: 85 neighbors, 5 active search path(s) 2026-02-21T13:40:48.4249562Z [1019s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[0, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:40:50.3988460Z [1021s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[0, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:40:50.6973977Z [1021s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:40:56.9370432Z [1027s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:40:56.9393577Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 0.6 configs/s 2026-02-21T13:42:21.9709715Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 88/88 1.4 configs/s 2026-02-21T13:42:22.4823141Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:42:27.4848736Z [1118s] Generation 6 complete: 2026-02-21T13:42:27.4849069Z error=4 2026-02-21T13:42:27.4849615Z timeout=4 2026-02-21T13:42:27.4849824Z ok=82 2026-02-21T13:42:27.4850027Z min=19.4487 2026-02-21T13:42:27.4850255Z mid=67.1629 2026-02-21T13:42:27.4850452Z max=927.1636 2026-02-21T13:42:27.4850693Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:42:27.4851121Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:42:27.4851524Z 'l2_groupings': [8], 2026-02-21T13:42:27.4851793Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:42:27.4852113Z 'loop_orders': [[0, 1]], 2026-02-21T13:42:27.4852383Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:42:27.4852670Z 'num_sm_multiplier': 64, 2026-02-21T13:42:27.4852927Z 'num_stages': 2, 2026-02-21T13:42:27.4853153Z 'num_warps': 4, 2026-02-21T13:42:27.4853411Z 'pid_type': 'persistent_interleaved', 2026-02-21T13:42:27.4853727Z 'range_flattens': [None, None], 2026-02-21T13:42:27.4854028Z 'range_multi_buffers': [True, None], 2026-02-21T13:42:27.4854331Z 'range_num_stages': [0, 1], 2026-02-21T13:42:27.4854615Z 'range_unroll_factors': [3, 0], 2026-02-21T13:42:27.4854905Z 'range_warp_specializes': [], 2026-02-21T13:42:27.4855184Z 'waves_per_eu': 2} 2026-02-21T13:42:27.4876156Z [1118s] Fitting surrogate: 674 points, 674 targets 2026-02-21T13:42:28.3689740Z [1119s] Generation 7 starting: 87 neighbors, 5 active search path(s) 2026-02-21T13:43:08.1647245Z [1159s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:43:08.4463672Z [1159s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:43:09.6894435Z [1160s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[0, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:43:15.4250030Z [1166s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:43:16.1828219Z [1167s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[3, 1], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:43:16.1841661Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 0.7 configs/s 2026-02-21T13:44:05.6333782Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 89/89 1.7 configs/s 2026-02-21T13:44:06.3911763Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:44:13.8170851Z [1224s] Generation 7 complete: 2026-02-21T13:44:13.8171323Z error=5 2026-02-21T13:44:13.8171533Z timeout=5 2026-02-21T13:44:13.8171739Z ok=82 2026-02-21T13:44:13.8171944Z min=19.0203 2026-02-21T13:44:13.8172151Z mid=39.8274 2026-02-21T13:44:13.8172365Z max=318.4178 2026-02-21T13:44:13.8173071Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:44:13.8173494Z 'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T13:44:13.8173897Z 'l2_groupings': [8], 2026-02-21T13:44:13.8174190Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:44:13.8174529Z 'loop_orders': [[0, 1]], 2026-02-21T13:44:13.8174814Z 'matrix_instr_nonkdim': 32, 2026-02-21T13:44:13.8175082Z 'num_stages': 1, 2026-02-21T13:44:13.8175314Z 'num_warps': 4, 2026-02-21T13:44:13.8175551Z 'pid_type': 'flat', 2026-02-21T13:44:13.8175813Z 'range_flattens': [None, False], 2026-02-21T13:44:13.8176136Z 'range_multi_buffers': [None, None], 2026-02-21T13:44:13.8176444Z 'range_num_stages': [0, 0], 2026-02-21T13:44:13.8176731Z 'range_unroll_factors': [0, 1], 2026-02-21T13:44:13.8177035Z 'range_warp_specializes': [], 2026-02-21T13:44:13.8177316Z 'waves_per_eu': 2} 2026-02-21T13:44:13.8202272Z [1224s] Fitting surrogate: 766 points, 766 targets 2026-02-21T13:44:14.7694909Z [1225s] Generation 8 starting: 94 neighbors, 5 active search path(s) 2026-02-21T13:44:51.7541650Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 1.0 configs/s 2026-02-21T13:45:36.6453066Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 95/95 2.2 configs/s 2026-02-21T13:45:37.4974833Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:45:45.8829631Z [1316s] Generation 8 complete: 2026-02-21T13:45:45.8830064Z error=2 2026-02-21T13:45:45.8830289Z ok=97 2026-02-21T13:45:45.8830496Z min=19.0580 2026-02-21T13:45:45.8830711Z mid=33.2071 2026-02-21T13:45:45.8830909Z max=422.5741 2026-02-21T13:45:45.8831208Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:45:45.8831633Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T13:45:45.8832493Z 'l2_groupings': [8], 2026-02-21T13:45:45.8832772Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:45:45.8833099Z 'loop_orders': [[0, 1]], 2026-02-21T13:45:45.8833391Z 'matrix_instr_nonkdim': 32, 2026-02-21T13:45:45.8833666Z 'num_stages': 1, 2026-02-21T13:45:45.8833903Z 'num_warps': 4, 2026-02-21T13:45:45.8834134Z 'pid_type': 'flat', 2026-02-21T13:45:45.8834400Z 'range_flattens': [None, False], 2026-02-21T13:45:45.8834844Z 'range_multi_buffers': [None, None], 2026-02-21T13:45:45.8835156Z 'range_num_stages': [0, 0], 2026-02-21T13:45:45.8835439Z 'range_unroll_factors': [0, 1], 2026-02-21T13:45:45.8835744Z 'range_warp_specializes': [], 2026-02-21T13:45:45.8836021Z 'waves_per_eu': 2} 2026-02-21T13:45:45.8861012Z [1316s] Fitting surrogate: 865 points, 865 targets 2026-02-21T13:45:46.7928407Z [1317s] Generation 9 starting: 93 neighbors, 5 active search path(s) 2026-02-21T13:46:25.1383190Z [1356s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:46:26.8188853Z [1357s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:46:27.1426323Z [1358s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:46:27.7400835Z [1358s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:46:28.0303843Z [1358s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:46:28.7871041Z [1359s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:46:28.7889206Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 1.1 configs/s 2026-02-21T13:47:20.2529252Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 94/94 2.1 configs/s 2026-02-21T13:47:20.9006361Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:47:27.2599212Z [1418s] Generation 9 complete: 2026-02-21T13:47:27.2599650Z error=6 2026-02-21T13:47:27.2599863Z timeout=6 2026-02-21T13:47:27.2600063Z ok=86 2026-02-21T13:47:27.2600273Z min=18.9024 2026-02-21T13:47:27.2600480Z mid=38.0107 2026-02-21T13:47:27.2600683Z max=423.7712 2026-02-21T13:47:27.2600946Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:47:27.2601355Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T13:47:27.2601742Z 'l2_groupings': [2], 2026-02-21T13:47:27.2602029Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:47:27.2602349Z 'loop_orders': [[0, 1]], 2026-02-21T13:47:27.2602734Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:47:27.2603017Z 'num_stages': 2, 2026-02-21T13:47:27.2603248Z 'num_warps': 4, 2026-02-21T13:47:27.2603487Z 'pid_type': 'flat', 2026-02-21T13:47:27.2603765Z 'range_flattens': [None, False], 2026-02-21T13:47:27.2604078Z 'range_multi_buffers': [None, None], 2026-02-21T13:47:27.2604386Z 'range_num_stages': [0, 1], 2026-02-21T13:47:27.2605120Z 'range_unroll_factors': [0, 0], 2026-02-21T13:47:27.2605427Z 'range_warp_specializes': [], 2026-02-21T13:47:27.2605671Z 'waves_per_eu': 2} 2026-02-21T13:47:27.2633797Z [1418s] Fitting surrogate: 963 points, 963 targets 2026-02-21T13:47:29.1371522Z [1420s] Generation 10 starting: 89 neighbors, 5 active search path(s) 2026-02-21T13:48:08.4480901Z [1459s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:48:08.8543250Z [1459s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:48:10.1276413Z [1461s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:48:14.2861753Z [1465s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, None], range_num_stages=[0, 1], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:48:15.2814967Z [1466s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[0, 1], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:48:15.2842189Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 0.6 configs/s 2026-02-21T13:48:56.9780261Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 2.0 configs/s 2026-02-21T13:48:57.7437420Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:49:05.2365804Z [1516s] Generation 10 complete: 2026-02-21T13:49:05.2366235Z error=4 2026-02-21T13:49:05.2366440Z timeout=5 2026-02-21T13:49:05.2366648Z ok=85 2026-02-21T13:49:05.2368457Z min=19.0507 2026-02-21T13:49:05.2368720Z mid=33.1354 2026-02-21T13:49:05.2368922Z max=485.2544 2026-02-21T13:49:05.2369172Z best={'block_sizes': [1, 128, 128], 2026-02-21T13:49:05.2369620Z 'indexing': ['pointer', 'pointer', 'block_ptr', 'block_ptr'], 2026-02-21T13:49:05.2370031Z 'l2_groupings': [16], 2026-02-21T13:49:05.2370309Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:49:05.2370646Z 'loop_orders': [[0, 1]], 2026-02-21T13:49:05.2370929Z 'matrix_instr_nonkdim': 32, 2026-02-21T13:49:05.2371193Z 'num_stages': 2, 2026-02-21T13:49:05.2371451Z 'num_warps': 4, 2026-02-21T13:49:05.2371683Z 'pid_type': 'flat', 2026-02-21T13:49:05.2371942Z 'range_flattens': [None, True], 2026-02-21T13:49:05.2372247Z 'range_multi_buffers': [None, None], 2026-02-21T13:49:05.2373057Z 'range_num_stages': [0, 1], 2026-02-21T13:49:05.2373322Z 'range_unroll_factors': [0, 1], 2026-02-21T13:49:05.2373560Z 'range_warp_specializes': [], 2026-02-21T13:49:05.2373787Z 'waves_per_eu': 2} 2026-02-21T13:49:05.2405414Z [1516s] Fitting surrogate: 1057 points, 1057 targets 2026-02-21T13:49:06.2267714Z [1517s] Generation 11 starting: 88 neighbors, 5 active search path(s) 2026-02-21T13:49:47.1022254Z [1557s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:49:48.5313545Z [1559s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:49:50.7857753Z [1561s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[1, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:49:50.7882323Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 0.7 configs/s 2026-02-21T13:50:43.7562733Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 1.8 configs/s 2026-02-21T13:50:44.3550086Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:50:50.2217888Z [1621s] Generation 11 complete: 2026-02-21T13:50:50.2220896Z error=1 2026-02-21T13:50:50.2221170Z timeout=3 2026-02-21T13:50:50.2221406Z ok=89 2026-02-21T13:50:50.2221609Z min=19.0933 2026-02-21T13:50:50.2221814Z mid=36.2580 2026-02-21T13:50:50.2222019Z max=807.0919 2026-02-21T13:50:50.2222274Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:50:50.2222713Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T13:50:50.2223148Z 'l2_groupings': [8], 2026-02-21T13:50:50.2223436Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:50:50.2223748Z 'loop_orders': [[0, 1]], 2026-02-21T13:50:50.2224330Z 'matrix_instr_nonkdim': 32, 2026-02-21T13:50:50.2224605Z 'num_stages': 1, 2026-02-21T13:50:50.2224836Z 'num_warps': 4, 2026-02-21T13:50:50.2225065Z 'pid_type': 'flat', 2026-02-21T13:50:50.2225333Z 'range_flattens': [None, True], 2026-02-21T13:50:50.2225641Z 'range_multi_buffers': [None, None], 2026-02-21T13:50:50.2225946Z 'range_num_stages': [0, 0], 2026-02-21T13:50:50.2226223Z 'range_unroll_factors': [0, 0], 2026-02-21T13:50:50.2226528Z 'range_warp_specializes': [], 2026-02-21T13:50:50.2226807Z 'waves_per_eu': 2} 2026-02-21T13:50:50.2259671Z [1621s] Fitting surrogate: 1150 points, 1150 targets 2026-02-21T13:50:50.9775262Z [1621s] Generation 12 starting: 72 neighbors, 4 active search path(s) 2026-02-21T13:51:27.5000586Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.8 configs/s 2026-02-21T13:52:09.7679305Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 1.4 configs/s 2026-02-21T13:52:10.4277527Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:52:16.8999293Z [1707s] Generation 12 complete: 2026-02-21T13:52:16.8999722Z error=2 2026-02-21T13:52:16.8999875Z ok=74 2026-02-21T13:52:16.9000513Z min=19.1221 2026-02-21T13:52:16.9000670Z mid=31.8995 2026-02-21T13:52:16.9000822Z max=437.4912 2026-02-21T13:52:16.9001004Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:52:16.9001317Z 'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'], 2026-02-21T13:52:16.9001635Z 'l2_groupings': [8], 2026-02-21T13:52:16.9001844Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:52:16.9002084Z 'loop_orders': [[0, 1]], 2026-02-21T13:52:16.9002289Z 'matrix_instr_nonkdim': 32, 2026-02-21T13:52:16.9002488Z 'num_stages': 1, 2026-02-21T13:52:16.9002732Z 'num_warps': 4, 2026-02-21T13:52:16.9002906Z 'pid_type': 'flat', 2026-02-21T13:52:16.9003238Z 'range_flattens': [None, True], 2026-02-21T13:52:16.9003465Z 'range_multi_buffers': [None, None], 2026-02-21T13:52:16.9003691Z 'range_num_stages': [0, 0], 2026-02-21T13:52:16.9003911Z 'range_unroll_factors': [0, 0], 2026-02-21T13:52:16.9004129Z 'range_warp_specializes': [], 2026-02-21T13:52:16.9004335Z 'waves_per_eu': 2} 2026-02-21T13:52:16.9038816Z [1707s] Fitting surrogate: 1226 points, 1226 targets 2026-02-21T13:52:17.5161449Z [1708s] Generation 13 starting: 55 neighbors, 3 active search path(s) 2026-02-21T13:52:54.9981454Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 0.6 configs/s 2026-02-21T13:53:26.2005643Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 56/56 1.7 configs/s 2026-02-21T13:53:26.7422059Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:53:32.0480352Z [1782s] Generation 13 complete: 2026-02-21T13:53:32.0480667Z error=1 2026-02-21T13:53:32.0480804Z ok=58 2026-02-21T13:53:32.0480944Z min=19.1422 2026-02-21T13:53:32.0481115Z mid=33.1211 2026-02-21T13:53:32.0481257Z max=303.2963 2026-02-21T13:53:32.0481421Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:53:32.0482052Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T13:53:32.0482319Z 'l2_groupings': [4], 2026-02-21T13:53:32.0482506Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:53:32.0482816Z 'loop_orders': [[0, 1]], 2026-02-21T13:53:32.0482999Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:53:32.0483184Z 'num_stages': 1, 2026-02-21T13:53:32.0483338Z 'num_warps': 4, 2026-02-21T13:53:32.0483498Z 'pid_type': 'flat', 2026-02-21T13:53:32.0483766Z 'range_flattens': [None, None], 2026-02-21T13:53:32.0483975Z 'range_multi_buffers': [None, True], 2026-02-21T13:53:32.0484186Z 'range_num_stages': [0, 1], 2026-02-21T13:53:32.0484373Z 'range_unroll_factors': [0, 1], 2026-02-21T13:53:32.0484576Z 'range_warp_specializes': [], 2026-02-21T13:53:32.0484764Z 'waves_per_eu': 2} 2026-02-21T13:53:32.0517820Z [1782s] Fitting surrogate: 1285 points, 1285 targets 2026-02-21T13:53:32.6488580Z [1783s] Generation 14 starting: 57 neighbors, 3 active search path(s) 2026-02-21T13:54:08.7621776Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58/58 0.6 configs/s 2026-02-21T13:54:46.7511337Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 58/58 1.4 configs/s 2026-02-21T13:54:47.2402681Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:54:52.0391793Z [1862s] Generation 14 complete: 2026-02-21T13:54:52.0392142Z error=4 2026-02-21T13:54:52.0392343Z ok=56 2026-02-21T13:54:52.0392541Z min=19.1220 2026-02-21T13:54:52.0392775Z mid=30.6319 2026-02-21T13:54:52.0392978Z max=558.1230 2026-02-21T13:54:52.0393222Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:54:52.0393613Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T13:54:52.0393995Z 'l2_groupings': [4], 2026-02-21T13:54:52.0394266Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:54:52.0394585Z 'loop_orders': [[0, 1]], 2026-02-21T13:54:52.0395180Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:54:52.0395455Z 'num_stages': 2, 2026-02-21T13:54:52.0395699Z 'num_warps': 4, 2026-02-21T13:54:52.0395926Z 'pid_type': 'flat', 2026-02-21T13:54:52.0396189Z 'range_flattens': [None, None], 2026-02-21T13:54:52.0396486Z 'range_multi_buffers': [None, True], 2026-02-21T13:54:52.0396801Z 'range_num_stages': [0, 1], 2026-02-21T13:54:52.0397072Z 'range_unroll_factors': [0, 1], 2026-02-21T13:54:52.0397376Z 'range_warp_specializes': [], 2026-02-21T13:54:52.0397651Z 'waves_per_eu': 2} 2026-02-21T13:54:52.0430632Z [1862s] Fitting surrogate: 1345 points, 1345 targets 2026-02-21T13:54:52.6093480Z [1863s] Generation 15 starting: 51 neighbors, 3 active search path(s) 2026-02-21T13:55:28.0111546Z [1898s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:55:32.7062428Z [1903s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 0], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:55:32.9950429Z [1903s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 0], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:55:33.2832304Z [1904s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:55:33.2854769Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 0.6 configs/s 2026-02-21T13:56:02.8412710Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 1.7 configs/s 2026-02-21T13:56:03.1670160Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:56:06.3474700Z [1937s] Generation 15 complete: 2026-02-21T13:56:06.3477872Z error=2 2026-02-21T13:56:06.3478436Z timeout=4 2026-02-21T13:56:06.3478671Z ok=48 2026-02-21T13:56:06.3478843Z min=19.1412 2026-02-21T13:56:06.3479019Z mid=36.5238 2026-02-21T13:56:06.3479188Z max=587.0950 2026-02-21T13:56:06.3479773Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:56:06.3480122Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T13:56:06.3480449Z 'l2_groupings': [4], 2026-02-21T13:56:06.3480683Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:56:06.3480949Z 'loop_orders': [[0, 1]], 2026-02-21T13:56:06.3481190Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:56:06.3481414Z 'num_stages': 2, 2026-02-21T13:56:06.3481602Z 'num_warps': 4, 2026-02-21T13:56:06.3481783Z 'pid_type': 'flat', 2026-02-21T13:56:06.3481990Z 'range_flattens': [None, None], 2026-02-21T13:56:06.3482233Z 'range_multi_buffers': [None, False], 2026-02-21T13:56:06.3482484Z 'range_num_stages': [0, 1], 2026-02-21T13:56:06.3482935Z 'range_unroll_factors': [0, 1], 2026-02-21T13:56:06.3483169Z 'range_warp_specializes': [], 2026-02-21T13:56:06.3483396Z 'waves_per_eu': 2} 2026-02-21T13:56:06.3515034Z [1937s] Fitting surrogate: 1399 points, 1399 targets 2026-02-21T13:56:06.9069687Z [1937s] Generation 16 starting: 50 neighbors, 3 active search path(s) 2026-02-21T13:56:41.2799343Z [1972s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1) 2026-02-21T13:56:48.6075261Z [1979s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 0], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2) 2026-02-21T13:56:48.6097028Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 0.6 configs/s 2026-02-21T13:57:15.9899001Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 1.8 configs/s 2026-02-21T13:57:16.4288625Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:57:20.7108524Z [2011s] Generation 16 complete: 2026-02-21T13:57:20.7111110Z error=5 2026-02-21T13:57:20.7111721Z timeout=2 2026-02-21T13:57:20.7112000Z ok=46 2026-02-21T13:57:20.7112203Z min=19.0803 2026-02-21T13:57:20.7112414Z mid=33.1198 2026-02-21T13:57:20.7112610Z max=373.0114 2026-02-21T13:57:20.7112870Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:57:20.7113312Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T13:57:20.7113711Z 'l2_groupings': [8], 2026-02-21T13:57:20.7113980Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:57:20.7114632Z 'loop_orders': [[0, 1]], 2026-02-21T13:57:20.7114901Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:57:20.7115166Z 'num_stages': 2, 2026-02-21T13:57:20.7115419Z 'num_warps': 4, 2026-02-21T13:57:20.7115651Z 'pid_type': 'flat', 2026-02-21T13:57:20.7115915Z 'range_flattens': [None, None], 2026-02-21T13:57:20.7116219Z 'range_multi_buffers': [None, False], 2026-02-21T13:57:20.7116531Z 'range_num_stages': [0, 1], 2026-02-21T13:57:20.7116813Z 'range_unroll_factors': [0, 1], 2026-02-21T13:57:20.7117107Z 'range_warp_specializes': [], 2026-02-21T13:57:20.7117381Z 'waves_per_eu': 2} 2026-02-21T13:57:20.7157581Z [2011s] Fitting surrogate: 1452 points, 1452 targets 2026-02-21T13:57:21.1262447Z [2012s] Generation 17 starting: 27 neighbors, 2 active search path(s) 2026-02-21T13:57:47.9311784Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28/28 0.4 configs/s 2026-02-21T13:58:15.4921273Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 28/28 1.0 configs/s 2026-02-21T13:58:15.6866429Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:58:17.5634240Z [2068s] Generation 17 complete: 2026-02-21T13:58:17.5634896Z error=3 2026-02-21T13:58:17.5635103Z ok=26 2026-02-21T13:58:17.5635308Z min=19.0784 2026-02-21T13:58:17.5635520Z mid=39.6130 2026-02-21T13:58:17.5635724Z max=527.7296 2026-02-21T13:58:17.5635968Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:58:17.5636369Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T13:58:17.5636780Z 'l2_groupings': [8], 2026-02-21T13:58:17.5637074Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:58:17.5637434Z 'loop_orders': [[0, 1]], 2026-02-21T13:58:17.5637716Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:58:17.5637980Z 'num_stages': 2, 2026-02-21T13:58:17.5638227Z 'num_warps': 4, 2026-02-21T13:58:17.5638607Z 'pid_type': 'flat', 2026-02-21T13:58:17.5638881Z 'range_flattens': [None, None], 2026-02-21T13:58:17.5639191Z 'range_multi_buffers': [None, False], 2026-02-21T13:58:17.5639509Z 'range_num_stages': [0, 1], 2026-02-21T13:58:17.5639791Z 'range_unroll_factors': [0, 1], 2026-02-21T13:58:17.5640082Z 'range_warp_specializes': [], 2026-02-21T13:58:17.5640366Z 'waves_per_eu': 2} 2026-02-21T13:58:17.5678136Z [2068s] Fitting surrogate: 1481 points, 1481 targets 2026-02-21T13:58:17.9807255Z [2068s] Generation 18 starting: 26 neighbors, 2 active search path(s) 2026-02-21T13:58:27.1496546Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 4.7 configs/s 2026-02-21T13:58:43.3872278Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 27/27 1.6 configs/s 2026-02-21T13:58:43.5265109Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:58:44.8692404Z [2095s] Generation 18 complete: 2026-02-21T13:58:44.8692742Z ok=28 2026-02-21T13:58:44.8692947Z min=19.1416 2026-02-21T13:58:44.8693190Z mid=39.7519 2026-02-21T13:58:44.8693396Z max=293.1759 2026-02-21T13:58:44.8693645Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:58:44.8694067Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T13:58:44.8694457Z 'l2_groupings': [8], 2026-02-21T13:58:44.8694749Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:58:44.8695064Z 'loop_orders': [[0, 1]], 2026-02-21T13:58:44.8695338Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:58:44.8695603Z 'num_stages': 3, 2026-02-21T13:58:44.8695837Z 'num_warps': 4, 2026-02-21T13:58:44.8696063Z 'pid_type': 'flat', 2026-02-21T13:58:44.8696664Z 'range_flattens': [None, None], 2026-02-21T13:58:44.8696969Z 'range_multi_buffers': [None, False], 2026-02-21T13:58:44.8697276Z 'range_num_stages': [0, 1], 2026-02-21T13:58:44.8697557Z 'range_unroll_factors': [0, 1], 2026-02-21T13:58:44.8697848Z 'range_warp_specializes': [], 2026-02-21T13:58:44.8698121Z 'waves_per_eu': 2} 2026-02-21T13:58:44.8740684Z [2095s] Fitting surrogate: 1509 points, 1509 targets 2026-02-21T13:58:45.2962404Z [2096s] Generation 19 starting: 29 neighbors, 2 active search path(s) 2026-02-21T13:59:03.2334327Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 0.8 configs/s 2026-02-21T13:59:24.9810332Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 30/30 1.4 configs/s 2026-02-21T13:59:25.2245401Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T13:59:27.4198901Z [2138s] Generation 19 complete: 2026-02-21T13:59:27.4202279Z error=2 2026-02-21T13:59:27.4202664Z ok=29 2026-02-21T13:59:27.4202934Z min=19.1087 2026-02-21T13:59:27.4203157Z mid=39.6798 2026-02-21T13:59:27.4203356Z max=495.8887 2026-02-21T13:59:27.4203618Z best={'block_sizes': [1, 128, 64], 2026-02-21T13:59:27.4204027Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T13:59:27.4204422Z 'l2_groupings': [8], 2026-02-21T13:59:27.4204702Z 'load_eviction_policies': ['', '', ''], 2026-02-21T13:59:27.4205034Z 'loop_orders': [[0, 1]], 2026-02-21T13:59:27.4205309Z 'matrix_instr_nonkdim': 0, 2026-02-21T13:59:27.4205569Z 'num_stages': 3, 2026-02-21T13:59:27.4205818Z 'num_warps': 4, 2026-02-21T13:59:27.4206044Z 'pid_type': 'flat', 2026-02-21T13:59:27.4206302Z 'range_flattens': [None, True], 2026-02-21T13:59:27.4206604Z 'range_multi_buffers': [None, False], 2026-02-21T13:59:27.4207273Z 'range_num_stages': [0, 1], 2026-02-21T13:59:27.4207498Z 'range_unroll_factors': [0, 1], 2026-02-21T13:59:27.4207742Z 'range_warp_specializes': [], 2026-02-21T13:59:27.4207967Z 'waves_per_eu': 2} 2026-02-21T13:59:27.4250477Z [2138s] Fitting surrogate: 1540 points, 1540 targets 2026-02-21T13:59:27.8306377Z [2138s] Generation 20 starting: 29 neighbors, 2 active search path(s) 2026-02-21T13:59:59.9885210Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 0.3 configs/s 2026-02-21T14:00:20.0269271Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 30/30 1.5 configs/s 2026-02-21T14:00:20.2273461Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s 2026-02-21T14:00:22.1641661Z [2193s] Generation 20 complete: 2026-02-21T14:00:22.1642050Z error=3 2026-02-21T14:00:22.1642247Z ok=28 2026-02-21T14:00:22.1642460Z min=19.0642 2026-02-21T14:00:22.1642747Z mid=36.9077 2026-02-21T14:00:22.1642947Z max=419.0839 2026-02-21T14:00:22.1643209Z best={'block_sizes': [1, 128, 64], 2026-02-21T14:00:22.1643609Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T14:00:22.1643996Z 'l2_groupings': [8], 2026-02-21T14:00:22.1644270Z 'load_eviction_policies': ['', '', ''], 2026-02-21T14:00:22.1644602Z 'loop_orders': [[0, 1]], 2026-02-21T14:00:22.1644876Z 'matrix_instr_nonkdim': 0, 2026-02-21T14:00:22.1645142Z 'num_stages': 3, 2026-02-21T14:00:22.1645369Z 'num_warps': 4, 2026-02-21T14:00:22.1645606Z 'pid_type': 'flat', 2026-02-21T14:00:22.1645897Z 'range_flattens': [None, True], 2026-02-21T14:00:22.1646197Z 'range_multi_buffers': [None, False], 2026-02-21T14:00:22.1646520Z 'range_num_stages': [0, 1], 2026-02-21T14:00:22.1646791Z 'range_unroll_factors': [0, 1], 2026-02-21T14:00:22.1647088Z 'range_warp_specializes': [], 2026-02-21T14:00:22.1647369Z 'waves_per_eu': 2} 2026-02-21T14:00:22.1685573Z [2193s] Fitting surrogate: 1571 points, 1571 targets 2026-02-21T14:00:22.3076215Z [2193s] Autotuning complete in 2193.2s after searching 1445 configs. 2026-02-21T14:00:22.3076814Z One can hardcode the best config and skip autotuning with: 2026-02-21T14:00:22.3078703Z @helion.kernel(config=helion.Config(block_sizes=[1, 128, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T14:00:22.3080695Z 2026-02-21T14:00:22.3081092Z [2193s] Code of selected kernel: /tmp/torchinductor_root/w7/cw7ecwu3gro2xz2hdr4lasw34crvsiptntuzvcf2ojdxnm47fxts.py 2026-02-21T14:00:22.3308070Z from __future__ import annotations 2026-02-21T14:00:22.3308461Z 2026-02-21T14:00:22.3308528Z import torch 2026-02-21T14:00:22.3308677Z import triton 2026-02-21T14:00:22.3308840Z import triton.language as tl 2026-02-21T14:00:22.3309145Z from torch._inductor.runtime import triton_helpers 2026-02-21T14:00:22.3309449Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T14:00:22.3309775Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T14:00:22.3309989Z 2026-02-21T14:00:22.3310067Z _BLOCK_SIZE_1 = tl.constexpr(128) 2026-02-21T14:00:22.3310273Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T14:00:22.3310467Z _BLOCK_SIZE_3 = tl.constexpr(64) 2026-02-21T14:00:22.3310591Z 2026-02-21T14:00:22.3310653Z @triton.jit 2026-02-21T14:00:22.3310896Z def _helion_attention(q_view, k_view, v_view, out, _RDIM_SIZE_2: tl.constexpr): 2026-02-21T14:00:22.3311313Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T14:00:22.3311604Z num_pid_m = 192 2026-02-21T14:00:22.3311784Z num_pid_n = tl.cdiv(8192, _BLOCK_SIZE_1) 2026-02-21T14:00:22.3312003Z inner_2d_pid = tl.program_id(0) 2026-02-21T14:00:22.3312212Z num_pid_in_group = 8 * num_pid_n 2026-02-21T14:00:22.3312439Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T14:00:22.3312665Z first_pid_m = group_id * 8 2026-02-21T14:00:22.3312887Z group_size_m = min(num_pid_m - first_pid_m, 8) 2026-02-21T14:00:22.3313184Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T14:00:22.3313504Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T14:00:22.3313739Z offset_0 = pid_0 2026-02-21T14:00:22.3313930Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T14:00:22.3314156Z offset_1 = pid_1 * _BLOCK_SIZE_1 2026-02-21T14:00:22.3314411Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T14:00:22.3314774Z indices_4 = tl.arange(0, _RDIM_SIZE_2).to(tl.int32) 2026-02-21T14:00:22.3315122Z # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T14:00:22.3315515Z m_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], float('-inf'), tl.float32) 2026-02-21T14:00:22.3315838Z # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0) 2026-02-21T14:00:22.3316136Z l_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 1.0, tl.float32) 2026-02-21T14:00:22.3316497Z # src[attention.py:70]: acc = hl.zeros([tile_b, tile_m, head_dim], dtype=torch.float32) 2026-02-21T14:00:22.3316870Z acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128], 0.0, tl.float32) 2026-02-21T14:00:22.3317181Z # src[attention.py:71]: q = q_view[tile_b, tile_m, :] 2026-02-21T14:00:22.3317619Z q = tl.load(q_view + (indices_0[:, None, None] * 1048576 + indices_1[None, :, None] * 128 + indices_4[None, None, :] * 1), None) 2026-02-21T14:00:22.3318087Z # src[attention.py:72]: for tile_n in hl.tile(v_view.size(1)): 2026-02-21T14:00:22.3318395Z # src[attention.py:73]: k = k_view[tile_b, :, tile_n] 2026-02-21T14:00:22.3318673Z # src[attention.py:74]: qk = torch.bmm(q, k) 2026-02-21T14:00:22.3318912Z # src[attention.py:72-85]: ... 2026-02-21T14:00:22.3319361Z for offset_2 in tl.range(0, 8192, _BLOCK_SIZE_3, loop_unroll_factor=1, num_stages=1, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T14:00:22.3319862Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_3).to(tl.int32) 2026-02-21T14:00:22.3320127Z q_copy = q 2026-02-21T14:00:22.3320318Z m_i_copy = m_i 2026-02-21T14:00:22.3320489Z l_i_copy = l_i 2026-02-21T14:00:22.3320654Z acc_copy = acc 2026-02-21T14:00:22.3320822Z q_copy_0 = q_copy 2026-02-21T14:00:22.3321001Z m_i_copy_0 = m_i_copy 2026-02-21T14:00:22.3321165Z l_i_copy_0 = l_i_copy 2026-02-21T14:00:22.3321301Z acc_copy_0 = acc_copy 2026-02-21T14:00:22.3321474Z # src[attention.py:73]: k = k_view[tile_b, :, tile_n] 2026-02-21T14:00:22.3321806Z k = tl.load(k_view + (indices_0[:, None, None] * 1048576 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None) 2026-02-21T14:00:22.3322148Z # src[attention.py:74]: qk = torch.bmm(q, k) 2026-02-21T14:00:22.3322865Z qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16) 2026-02-21T14:00:22.3323490Z # src[attention.py:75]: m_ij = torch.maximum(m_i, torch.amax(qk, -1) * qk_scale) 2026-02-21T14:00:22.3323738Z amax = tl.cast(tl.max(qk, 2), tl.bfloat16) 2026-02-21T14:00:22.3323911Z v_0 = 0.12751743074602467 2026-02-21T14:00:22.3324066Z v_1 = tl.cast(amax * v_0, tl.bfloat16) 2026-02-21T14:00:22.3324235Z v_2 = tl.cast(v_1, tl.float32) 2026-02-21T14:00:22.3324411Z v_3 = triton_helpers.maximum(m_i_copy_0, v_2) 2026-02-21T14:00:22.3324630Z # src[attention.py:76]: qk = qk * qk_scale - m_ij[:, :, None] 2026-02-21T14:00:22.3324828Z v_4 = 0.12751743074602467 2026-02-21T14:00:22.3324981Z v_5 = tl.cast(qk * v_4, tl.bfloat16) 2026-02-21T14:00:22.3325149Z subscript = v_3[:, :, None] 2026-02-21T14:00:22.3325299Z v_6 = tl.cast(v_5, tl.float32) 2026-02-21T14:00:22.3325452Z v_7 = v_6 - subscript 2026-02-21T14:00:22.3325608Z # src[attention.py:77]: p = torch.exp2(qk) 2026-02-21T14:00:22.3325781Z v_8 = libdevice.exp2(v_7) 2026-02-21T14:00:22.3325954Z # src[attention.py:78]: l_ij = torch.sum(p, -1) 2026-02-21T14:00:22.3326147Z l_ij = tl.cast(tl.sum(v_8, 2), tl.float32) 2026-02-21T14:00:22.3326345Z # src[attention.py:79]: alpha = torch.exp2(m_i - m_ij) 2026-02-21T14:00:22.3326535Z v_9 = m_i_copy_0 - v_3 2026-02-21T14:00:22.3326685Z v_10 = libdevice.exp2(v_9) 2026-02-21T14:00:22.3326882Z # src[attention.py:80]: l_i = l_i * alpha + l_ij 2026-02-21T14:00:22.3327062Z v_11 = l_i_copy_0 * v_10 2026-02-21T14:00:22.3327207Z l_i = v_11 + l_ij 2026-02-21T14:00:22.3327370Z # src[attention.py:81]: acc = acc * alpha[:, :, None] 2026-02-21T14:00:22.3327555Z subscript_1 = v_10[:, :, None] 2026-02-21T14:00:22.3327715Z v_13 = acc_copy_0 * subscript_1 2026-02-21T14:00:22.3327896Z # src[attention.py:82]: v = v_view[tile_b, tile_n, :] 2026-02-21T14:00:22.3328226Z v = tl.load(v_view + (indices_0[:, None, None] * 1048576 + indices_2[None, :, None] * 128 + indices_4[None, None, :] * 1), None) 2026-02-21T14:00:22.3328548Z # src[attention.py:83]: p = p.to(v.dtype) 2026-02-21T14:00:22.3328719Z v_14 = tl.cast(v_8, tl.bfloat16) 2026-02-21T14:00:22.3328909Z # src[attention.py:84]: acc = torch.baddbmm(acc, p, v) 2026-02-21T14:00:22.3329522Z acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128]) 2026-02-21T14:00:22.3330111Z # src[attention.py:85]: m_i = m_ij 2026-02-21T14:00:22.3330267Z m_i = v_3 2026-02-21T14:00:22.3330412Z # src[attention.py:87]: acc = acc / l_i[:, :, None] 2026-02-21T14:00:22.3330597Z subscript_2 = l_i[:, :, None] 2026-02-21T14:00:22.3330746Z v_15 = acc / subscript_2 2026-02-21T14:00:22.3330940Z # src[attention.py:88]: out[tile_b, tile_m, :] = acc.to(out.dtype) 2026-02-21T14:00:22.3331170Z v_16 = tl.cast(v_15, tl.bfloat16) 2026-02-21T14:00:22.3331470Z tl.store(out + (indices_0[:, None, None] * 1048576 + indices_1[None, :, None] * 128 + indices_4[None, None, :] * 1), v_16, None) 2026-02-21T14:00:22.3331721Z 2026-02-21T14:00:22.3331906Z def attention(q_in: torch.Tensor, k_in: torch.Tensor, v_in: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T14:00:22.3332131Z """ 2026-02-21T14:00:22.3332234Z Computes scaled dot-product attention. 2026-02-21T14:00:22.3332347Z 2026-02-21T14:00:22.3332467Z Implements the attention mechanism: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V 2026-02-21T14:00:22.3332635Z 2026-02-21T14:00:22.3332671Z Args: 2026-02-21T14:00:22.3332805Z q_in: Query tensor of shape [..., seq_len_q, head_dim] 2026-02-21T14:00:22.3332975Z k_in: Key tensor of shape [..., seq_len_k, head_dim] 2026-02-21T14:00:22.3333143Z v_in: Value tensor of shape [..., seq_len_k, head_dim] 2026-02-21T14:00:22.3333250Z 2026-02-21T14:00:22.3333295Z Returns: 2026-02-21T14:00:22.3333461Z Output tensor of shape [..., seq_len_q, head_dim] 2026-02-21T14:00:22.3333596Z """ 2026-02-21T14:00:22.3333701Z # src[attention.py:56]: m_dim = q_in.size(-2) 2026-02-21T14:00:22.3333840Z m_dim = q_in.size(-2) 2026-02-21T14:00:22.3333959Z # src[attention.py:57]: n_dim = k_in.size(-2) 2026-02-21T14:00:22.3334095Z n_dim = k_in.size(-2) 2026-02-21T14:00:22.3334221Z # src[attention.py:58]: assert n_dim == v_in.size(-2) 2026-02-21T14:00:22.3334373Z assert n_dim == v_in.size(-2) 2026-02-21T14:00:22.3334529Z # src[attention.py:59]: head_dim = hl.specialize(q_in.size(-1)) 2026-02-21T14:00:22.3334691Z head_dim = 128 2026-02-21T14:00:22.3334831Z # src[attention.py:60]: assert head_dim == k_in.size(-1) == v_in.size(-1) 2026-02-21T14:00:22.3335025Z assert head_dim == k_in.size(-1) == v_in.size(-1) 2026-02-21T14:00:22.3335207Z # src[attention.py:61]: q_view = q_in.reshape([-1, m_dim, head_dim]) 2026-02-21T14:00:22.3335382Z q_view = q_in.reshape([-1, m_dim, head_dim]) 2026-02-21T14:00:22.3335555Z # src[attention.py:62]: v_view = v_in.reshape([-1, n_dim, head_dim]) 2026-02-21T14:00:22.3335725Z v_view = v_in.reshape([-1, n_dim, head_dim]) 2026-02-21T14:00:22.3335916Z # src[attention.py:63]: k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2) 2026-02-21T14:00:22.3336155Z k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2) 2026-02-21T14:00:22.3336339Z # src[attention.py:64]: out = torch.empty_like(q_view) 2026-02-21T14:00:22.3336492Z out = torch.empty_like(q_view) 2026-02-21T14:00:22.3336666Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T14:00:22.3336845Z _BLOCK_SIZE_1 = 128 2026-02-21T14:00:22.3336946Z _RDIM_SIZE_2 = 128 2026-02-21T14:00:22.3337101Z # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]): 2026-02-21T14:00:22.3337345Z # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T14:00:22.3337574Z # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0) 2026-02-21T14:00:22.3337730Z # src[attention.py:67-88]: ... 2026-02-21T14:00:22.3338053Z _launcher(_helion_attention, (192 * triton.cdiv(8192, _BLOCK_SIZE_1),), q_view, k_view, v_view, out, _RDIM_SIZE_2, num_warps=4, num_stages=3, waves_per_eu=2, matrix_instr_nonkdim=0) 2026-02-21T14:00:22.3338403Z # src[attention.py:89]: return out.view(q_in.size()) 2026-02-21T14:00:22.3338549Z return out.view(q_in.size()) 2026-02-21T14:00:23.5614666Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T14:00:23.5616878Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T14:00:23.5619212Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2026-02-21T14:00:23.5619689Z WARNING:tritonbench.utils.triton_op:Completed input ID 6: 2026-02-21T14:00:23.5620113Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) 2026-02-21T14:00:23.5620460Z ------------------------------------------ 2026-02-21T14:00:23.5620864Z (4, 48, 8192, 8192, 128) 2026-02-21T14:00:23.5621046Z 2026-02-21T14:00:23.5621466Z 100%|██████████| 6/6 [1:33:29<00:00, 1273.77s/it] 2026-02-21T14:00:23.5621935Z 100%|██████████| 6/6 [1:33:29<00:00, 934.99s/it] 2026-02-21T14:00:23.5623588Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmp01habu_d.csv 2026-02-21T14:00:26.5181086Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead) flex_attention-speedup flex_attention-accuracy helion_attention-speedup helion_attention-accuracy 2026-02-21T14:00:26.5182149Z ------------------------------------------ ------------------------ ------------------------- -------------------------- --------------------------- 2026-02-21T14:00:26.5182901Z (4, 48, 128, 128, 128) 2.0022 0 3.29499 0 2026-02-21T14:00:26.5183513Z (4, 48, 256, 256, 128) 2.17004 0 3.2996 0 2026-02-21T14:00:26.5184139Z (4, 48, 512, 512, 128) 2.75028 1 3.67232 0 2026-02-21T14:00:26.5184746Z (4, 48, 2048, 2048, 128) 3.51329 1 5.09774 0 2026-02-21T14:00:26.5185368Z (4, 48, 4096, 4096, 128) 3.66077 1 5.69485 0 2026-02-21T14:00:26.5185980Z (4, 48, 8192, 8192, 128) 3.88718 1 6.07271 0 2026-02-21T14:00:26.5186984Z average 2.99729 0.666667 4.52203 0 2026-02-21T14:00:34.0719105Z ✅ Completed benchmark for kernel: flash_attention 2026-02-21T14:00:34.0726436Z [ 2026-02-21T14:00:34.0726670Z { 2026-02-21T14:00:34.0726892Z "benchmark": { 2026-02-21T14:00:34.0727291Z "name": "Helion Benchmark", 2026-02-21T14:00:34.0727600Z "extra_info": { 2026-02-21T14:00:34.0727946Z "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-" 2026-02-21T14:00:34.0728333Z } 2026-02-21T14:00:34.0728536Z }, 2026-02-21T14:00:34.0728733Z "model": { 2026-02-21T14:00:34.0728939Z "name": "flash_attention" 2026-02-21T14:00:34.0729156Z }, 2026-02-21T14:00:34.0729313Z "metric": { 2026-02-21T14:00:34.0729522Z "name": "torch_compile_speedup", 2026-02-21T14:00:34.0729850Z "benchmark_values": [ 2026-02-21T14:00:34.0730073Z 2.0021993947171195, 2026-02-21T14:00:34.0730279Z 2.1700365612654573, 2026-02-21T14:00:34.0730517Z 2.7502829234205497, 2026-02-21T14:00:34.0730731Z 3.513288970276924, 2026-02-21T14:00:34.0731297Z 3.660765895251711, 2026-02-21T14:00:34.0731500Z 3.8871759075278574 2026-02-21T14:00:34.0731688Z ] 2026-02-21T14:00:34.0731847Z }, 2026-02-21T14:00:34.0732050Z "shape": [ 2026-02-21T14:00:34.0732237Z "(4, 48, 128, 128, 128)", 2026-02-21T14:00:34.0732444Z "(4, 48, 256, 256, 128)", 2026-02-21T14:00:34.0732650Z "(4, 48, 512, 512, 128)", 2026-02-21T14:00:34.0732894Z "(4, 48, 2048, 2048, 128)", 2026-02-21T14:00:34.0733116Z "(4, 48, 4096, 4096, 128)", 2026-02-21T14:00:34.0733444Z "(4, 48, 8192, 8192, 128)" 2026-02-21T14:00:34.0733684Z ] 2026-02-21T14:00:34.0733838Z }, 2026-02-21T14:00:34.0733987Z { 2026-02-21T14:00:34.0734146Z "benchmark": { 2026-02-21T14:00:34.0734378Z "name": "Helion Benchmark", 2026-02-21T14:00:34.0734601Z "extra_info": { 2026-02-21T14:00:34.0734861Z "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-" 2026-02-21T14:00:34.0735182Z } 2026-02-21T14:00:34.0735329Z }, 2026-02-21T14:00:34.0735491Z "model": { 2026-02-21T14:00:34.0735668Z "name": "flash_attention" 2026-02-21T14:00:34.0735908Z }, 2026-02-21T14:00:34.0736136Z "metric": { 2026-02-21T14:00:34.0736486Z "name": "torch_compile_accuracy", 2026-02-21T14:00:34.0736740Z "benchmark_values": [ 2026-02-21T14:00:34.0736999Z 0.0, 2026-02-21T14:00:34.0737162Z 0.0, 2026-02-21T14:00:34.0737317Z 1.0, 2026-02-21T14:00:34.0737476Z 1.0, 2026-02-21T14:00:34.0737631Z 1.0, 2026-02-21T14:00:34.0737800Z 1.0 2026-02-21T14:00:34.0737953Z ] 2026-02-21T14:00:34.0738107Z }, 2026-02-21T14:00:34.0738289Z "shape": [ 2026-02-21T14:00:34.0738466Z "(4, 48, 128, 128, 128)", 2026-02-21T14:00:34.0738668Z "(4, 48, 256, 256, 128)", 2026-02-21T14:00:34.0738821Z "(4, 48, 512, 512, 128)", 2026-02-21T14:00:34.0738992Z "(4, 48, 2048, 2048, 128)", 2026-02-21T14:00:34.0739149Z "(4, 48, 4096, 4096, 128)", 2026-02-21T14:00:34.0739306Z "(4, 48, 8192, 8192, 128)" 2026-02-21T14:00:34.0739453Z ] 2026-02-21T14:00:34.0739562Z }, 2026-02-21T14:00:34.0739695Z { 2026-02-21T14:00:34.0739805Z "benchmark": { 2026-02-21T14:00:34.0739947Z "name": "Helion Benchmark", 2026-02-21T14:00:34.0740108Z "extra_info": { 2026-02-21T14:00:34.0740290Z "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-" 2026-02-21T14:00:34.0740518Z } 2026-02-21T14:00:34.0740624Z }, 2026-02-21T14:00:34.0740732Z "model": { 2026-02-21T14:00:34.0740875Z "name": "flash_attention" 2026-02-21T14:00:34.0741018Z }, 2026-02-21T14:00:34.0741163Z "metric": { 2026-02-21T14:00:34.0741291Z "name": "helion_speedup", 2026-02-21T14:00:34.0741450Z "benchmark_values": [ 2026-02-21T14:00:34.0741596Z 3.2949895320648634, 2026-02-21T14:00:34.0741742Z 3.2995974823628726, 2026-02-21T14:00:34.0742014Z 3.6723164512582986, 2026-02-21T14:00:34.0742160Z 5.097737134414277, 2026-02-21T14:00:34.0742307Z 5.694853907742621, 2026-02-21T14:00:34.0742447Z 6.072705940751335 2026-02-21T14:00:34.0742640Z ] 2026-02-21T14:00:34.0742772Z }, 2026-02-21T14:00:34.0742881Z "shape": [ 2026-02-21T14:00:34.0743012Z "(4, 48, 128, 128, 128)", 2026-02-21T14:00:34.0743160Z "(4, 48, 256, 256, 128)", 2026-02-21T14:00:34.0743306Z "(4, 48, 512, 512, 128)", 2026-02-21T14:00:34.0743477Z "(4, 48, 2048, 2048, 128)", 2026-02-21T14:00:34.0743637Z "(4, 48, 4096, 4096, 128)", 2026-02-21T14:00:34.0743795Z "(4, 48, 8192, 8192, 128)" 2026-02-21T14:00:34.0743943Z ] 2026-02-21T14:00:34.0744046Z }, 2026-02-21T14:00:34.0744151Z { 2026-02-21T14:00:34.0744257Z "benchmark": { 2026-02-21T14:00:34.0744398Z "name": "Helion Benchmark", 2026-02-21T14:00:34.0744553Z "extra_info": { 2026-02-21T14:00:34.0744727Z "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-" 2026-02-21T14:00:34.0744926Z } 2026-02-21T14:00:34.0745033Z }, 2026-02-21T14:00:34.0745142Z "model": { 2026-02-21T14:00:34.0745305Z "name": "flash_attention" 2026-02-21T14:00:34.0745481Z }, 2026-02-21T14:00:34.0745587Z "metric": { 2026-02-21T14:00:34.0745720Z "name": "helion_accuracy", 2026-02-21T14:00:34.0745881Z "benchmark_values": [ 2026-02-21T14:00:34.0746022Z 0.0, 2026-02-21T14:00:34.0746157Z 0.0, 2026-02-21T14:00:34.0746272Z 0.0, 2026-02-21T14:00:34.0746387Z 0.0, 2026-02-21T14:00:34.0746505Z 0.0, 2026-02-21T14:00:34.0746612Z 0.0 2026-02-21T14:00:34.0746755Z ] 2026-02-21T14:00:34.0746872Z }, 2026-02-21T14:00:34.0746999Z "shape": [ 2026-02-21T14:00:34.0747129Z "(4, 48, 128, 128, 128)", 2026-02-21T14:00:34.0747338Z "(4, 48, 256, 256, 128)", 2026-02-21T14:00:34.0747487Z "(4, 48, 512, 512, 128)", 2026-02-21T14:00:34.0747635Z "(4, 48, 2048, 2048, 128)", 2026-02-21T14:00:34.0747789Z "(4, 48, 4096, 4096, 128)", 2026-02-21T14:00:34.0747940Z "(4, 48, 8192, 8192, 128)" 2026-02-21T14:00:34.0748083Z ] 2026-02-21T14:00:34.0748190Z } 2026-02-21T14:00:34.0767182Z ] 2026-02-21T14:00:34.0823565Z ##[group]Run pytorch/test-infra/.github/actions/gather-benchmark-metadata@main 2026-02-21T14:00:34.0823771Z with: 2026-02-21T14:00:34.0824158Z github-token: *** 2026-02-21T14:00:34.0824256Z venv: .venv/bin/activate 2026-02-21T14:00:34.0824353Z schema-version: v3 2026-02-21T14:00:34.0824443Z env: 2026-02-21T14:00:34.0824528Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:00:34.0824661Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.0824829Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:00:34.0824990Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.0825129Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.0825269Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.0825414Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:00:34.0825572Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:00:34.0825726Z ##[endgroup] 2026-02-21T14:00:34.0872755Z ##[group]Run set -eux 2026-02-21T14:00:34.0872860Z set -eux 2026-02-21T14:00:34.0872943Z  2026-02-21T14:00:34.0873033Z if [[ -z "${GITHUB_TOKEN}" ]]; then 2026-02-21T14:00:34.0873167Z  echo "Missing github-token input" 2026-02-21T14:00:34.0873279Z  exit 1 2026-02-21T14:00:34.0873362Z fi 2026-02-21T14:00:34.0873996Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T14:00:34.0874124Z env: 2026-02-21T14:00:34.0874213Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:00:34.0874339Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.0874498Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:00:34.0874649Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.0874783Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.0874919Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.0875061Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:00:34.0875215Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:00:34.0875437Z GITHUB_TOKEN: *** 2026-02-21T14:00:34.0875527Z ##[endgroup] 2026-02-21T14:00:34.1259011Z + [[ -z *** ]] 2026-02-21T14:00:34.1338360Z ##[group]Run pytorch/test-infra/.github/actions/get-workflow-job-id@main 2026-02-21T14:00:34.1338529Z with: 2026-02-21T14:00:34.1338696Z github-token: *** 2026-02-21T14:00:34.1338793Z env: 2026-02-21T14:00:34.1338878Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:00:34.1339011Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.1339172Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:00:34.1339330Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.1339465Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.1339605Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.1339748Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:00:34.1339999Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:00:34.1340140Z ##[endgroup] 2026-02-21T14:00:34.1347291Z ##[group]Run set -eux 2026-02-21T14:00:34.1347412Z set -eux 2026-02-21T14:00:34.1347495Z  2026-02-21T14:00:34.1347677Z python3 "${GITHUB_ACTION_PATH}/../../scripts/get_workflow_job_id.py" "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2026-02-21T14:00:34.1348058Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T14:00:34.1348183Z env: 2026-02-21T14:00:34.1348270Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:00:34.1348396Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.1348549Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:00:34.1348703Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.1348834Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.1348967Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:34.1349229Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:00:34.1349387Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:00:34.1349590Z GITHUB_TOKEN: *** 2026-02-21T14:00:34.1349674Z ##[endgroup] 2026-02-21T14:00:34.2615755Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/get-workflow-job-id/../../scripts/get_workflow_job_id.py 22253280836 linux.rocm.gpu.gfx942.2-n2gvb-runner-2vdnn 2026-02-21T14:00:37.2159368Z setting job-id=64380329937 2026-02-21T14:00:37.2160125Z setting job-name=run-mi325x (int4_gemm,flash_attention) / benchmark-rocm6.4-int4_gemm,flash_attention-py3.12-mi325x 2026-02-21T14:00:37.2330944Z ##[group]Run set -eux 2026-02-21T14:00:37.2331081Z set -eux 2026-02-21T14:00:37.2331164Z  2026-02-21T14:00:37.2331266Z if [[ -n ".venv/bin/activate" ]]; then 2026-02-21T14:00:37.2331398Z  source ".venv/bin/activate" 2026-02-21T14:00:37.2331511Z fi 2026-02-21T14:00:37.2331601Z  2026-02-21T14:00:37.2331759Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_metadata.py" \ 2026-02-21T14:00:37.2331965Z  --schema-version "${SCHEMA_VERSION}" \ 2026-02-21T14:00:37.2332096Z  --repo "${REPO}" \ 2026-02-21T14:00:37.2332212Z  --head-branch "${HEAD_BRANCH}" \ 2026-02-21T14:00:37.2332333Z  --head-sha "${HEAD_SHA}" \ 2026-02-21T14:00:37.2332456Z  --workflow-id "${WORKFLOW_RUN_ID}" \ 2026-02-21T14:00:37.2332589Z  --run-attempt "${RUN_ATTEMPT}" \ 2026-02-21T14:00:37.2332712Z  --job-id "${JOB_ID}" \ 2026-02-21T14:00:37.2332822Z  --job-name "${JOB_NAME}" 2026-02-21T14:00:37.2333023Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T14:00:37.2333232Z env: 2026-02-21T14:00:37.2333315Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:00:37.2333442Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.2333598Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:00:37.2333760Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.2333894Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.2334025Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.2334166Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:00:37.2334325Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:00:37.2334471Z SCHEMA_VERSION: v3 2026-02-21T14:00:37.2334565Z REPO: pytorch/helion 2026-02-21T14:00:37.2334669Z HEAD_BRANCH: refs/heads/main 2026-02-21T14:00:37.2334793Z HEAD_SHA: 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T14:00:37.2334926Z WORKFLOW_RUN_ID: 22253280836 2026-02-21T14:00:37.2335032Z RUN_ATTEMPT: 1 2026-02-21T14:00:37.2335119Z JOB_ID: 64380329937 2026-02-21T14:00:37.2335329Z JOB_NAME: run-mi325x (int4_gemm,flash_attention) / benchmark-rocm6.4-int4_gemm,flash_attention-py3.12-mi325x 2026-02-21T14:00:37.2335697Z ##[endgroup] 2026-02-21T14:00:37.3177297Z + [[ -n .venv/bin/activate ]] 2026-02-21T14:00:37.3178032Z + source .venv/bin/activate 2026-02-21T14:00:37.3179358Z ++ '[' -z '' ']' 2026-02-21T14:00:37.3179609Z ++ '[' -n x ']' 2026-02-21T14:00:37.3179898Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T14:00:37.3180371Z ++ '[' .venv/bin/activate = /__w/_temp/aa47f69b-53c5-4181-bfd7-b91eff01d549.sh ']' 2026-02-21T14:00:37.3180862Z ++ deactivate nondestructive 2026-02-21T14:00:37.3181444Z ++ unset -f pydoc 2026-02-21T14:00:37.3181672Z ++ '[' -z '' ']' 2026-02-21T14:00:37.3181890Z ++ '[' -z '' ']' 2026-02-21T14:00:37.3182107Z ++ hash -r 2026-02-21T14:00:37.3182319Z ++ '[' -z '' ']' 2026-02-21T14:00:37.3182551Z ++ unset VIRTUAL_ENV 2026-02-21T14:00:37.3182814Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T14:00:37.3183125Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T14:00:37.3183484Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T14:00:37.3183849Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T14:00:37.3184164Z ++ '[' linux-gnu = msys ']' 2026-02-21T14:00:37.3184634Z ++ export VIRTUAL_ENV 2026-02-21T14:00:37.3184894Z ++ '[' -z '' ']' 2026-02-21T14:00:37.3185128Z ++ unset SCRIPT_PATH 2026-02-21T14:00:37.3186117Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T14:00:37.3187884Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T14:00:37.3188621Z ++ export PATH 2026-02-21T14:00:37.3188733Z ++ '[' xhelion '!=' x ']' 2026-02-21T14:00:37.3188874Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T14:00:37.3189014Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T14:00:37.3189146Z ++ '[' -z '' ']' 2026-02-21T14:00:37.3189259Z ++ '[' -z '' ']' 2026-02-21T14:00:37.3189371Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T14:00:37.3189497Z ++ PS1='(helion) ' 2026-02-21T14:00:37.3189609Z ++ export PS1 2026-02-21T14:00:37.3189727Z ++ alias pydoc 2026-02-21T14:00:37.3189840Z ++ true 2026-02-21T14:00:37.3189942Z ++ hash -r 2026-02-21T14:00:37.3190844Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-benchmark-metadata/../../scripts/benchmarks/gather_metadata.py --schema-version v3 --repo pytorch/helion --head-branch refs/heads/main --head-sha 874a7d0cadab18218a84ad3579d329dc95c51820 --workflow-id 22253280836 --run-attempt 1 --job-id 64380329937 --job-name 'run-mi325x (int4_gemm,flash_attention) / benchmark-rocm6.4-int4_gemm,flash_attention-py3.12-mi325x' 2026-02-21T14:00:37.3470108Z ##[group]Run pytorch/test-infra/.github/actions/gather-runners-info@main 2026-02-21T14:00:37.3470291Z with: 2026-02-21T14:00:37.3470387Z venv: .venv/bin/activate 2026-02-21T14:00:37.3470486Z env: 2026-02-21T14:00:37.3470575Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:00:37.3470725Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.3470883Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:00:37.3471045Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.3471182Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.3471324Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.3471461Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:00:37.3471628Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:00:37.3471769Z ##[endgroup] 2026-02-21T14:00:37.3478783Z ##[group]Run set -eux 2026-02-21T14:00:37.3478896Z set -eux 2026-02-21T14:00:37.3478993Z  2026-02-21T14:00:37.3479079Z if command -v nvidia-smi; then 2026-02-21T14:00:37.3479197Z  DEVICE_NAME=cuda 2026-02-21T14:00:37.3479296Z  nvidia-smi 2026-02-21T14:00:37.3479397Z elif command -v rocm-smi; then 2026-02-21T14:00:37.3479603Z  DEVICE_NAME=rocm 2026-02-21T14:00:37.3479700Z  rocm-smi 2026-02-21T14:00:37.3479797Z elif command -v hl-smi; then 2026-02-21T14:00:37.3479910Z  DEVICE_NAME=hpu 2026-02-21T14:00:37.3480007Z  hl-smi 2026-02-21T14:00:37.3480085Z else 2026-02-21T14:00:37.3480171Z  arch=$(uname -m) 2026-02-21T14:00:37.3480264Z  2026-02-21T14:00:37.3480343Z  case "$arch" in 2026-02-21T14:00:37.3480490Z  aarch64|arm64) 2026-02-21T14:00:37.3480596Z  DEVICE_NAME=arm64-cpu 2026-02-21T14:00:37.3480700Z  ;; 2026-02-21T14:00:37.3480779Z  *) 2026-02-21T14:00:37.3480866Z  DEVICE_NAME=cpu 2026-02-21T14:00:37.3480965Z  ;; 2026-02-21T14:00:37.3481043Z  esac 2026-02-21T14:00:37.3481121Z  lscpu 2026-02-21T14:00:37.3481203Z fi 2026-02-21T14:00:37.3481306Z echo "DEVICE_NAME=$DEVICE_NAME" >> $GITHUB_ENV 2026-02-21T14:00:37.3481544Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T14:00:37.3481665Z env: 2026-02-21T14:00:37.3481750Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:00:37.3481874Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.3482025Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:00:37.3482177Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.3482308Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.3482442Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.3482632Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:00:37.3482787Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:00:37.3482930Z ##[endgroup] 2026-02-21T14:00:37.4374231Z + command -v nvidia-smi 2026-02-21T14:00:37.4374504Z + command -v rocm-smi 2026-02-21T14:00:37.4381104Z /usr/bin/rocm-smi 2026-02-21T14:00:37.4381290Z + DEVICE_NAME=rocm 2026-02-21T14:00:37.4381445Z + rocm-smi 2026-02-21T14:00:37.5990972Z 2026-02-21T14:00:37.5991025Z 2026-02-21T14:00:37.5991489Z ============================================ ROCm System Management Interface ============================================ 2026-02-21T14:00:37.5992221Z ====================================================== Concise Info ====================================================== 2026-02-21T14:00:37.5992964Z Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% 2026-02-21T14:00:37.5994043Z  (DID, GUID) (Junction) (Socket) (Mem, Compute, ID)  2026-02-21T14:00:37.5995030Z ========================================================================================================================== 2026-02-21T14:00:37.5996243Z 0 3 0x74a5, 51110 33.0°C 125.0W NPS1, SPX, 0 134Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T14:00:37.5997123Z 1 5 0x74a5, 2987 32.0°C 132.0W NPS1, SPX, 0 133Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T14:00:37.5997949Z 2 4 0x74a5, 61326 32.0°C 133.0W NPS1, SPX, 0 133Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T14:00:37.5998712Z 3 2 0x74a5, 9091 48.0°C 144.0W NPS1, SPX, 0 133Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T14:00:37.5999270Z 4 7 0x74a5, 26567 31.0°C 129.0W NPS1, SPX, 0 133Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T14:00:37.5999805Z 5 9 0x74a5, 43978 36.0°C 137.0W NPS1, SPX, 0 133Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T14:00:37.6000321Z 6 8 0x74a5, 20463 31.0°C 129.0W NPS1, SPX, 0 134Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T14:00:37.6000836Z 7 6 0x74a5, 33762 34.0°C 135.0W NPS1, SPX, 0 133Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T14:00:37.6001296Z ========================================================================================================================== 2026-02-21T14:00:37.6001659Z ================================================== End of ROCm SMI Log =================================================== 2026-02-21T14:00:37.6054374Z + echo DEVICE_NAME=rocm 2026-02-21T14:00:37.6091413Z ##[group]Run set -eux 2026-02-21T14:00:37.6091548Z set -eux 2026-02-21T14:00:37.6091695Z  2026-02-21T14:00:37.6091793Z if [[ "${DEVICE_NAME}" == "cuda" ]]; then 2026-02-21T14:00:37.6091936Z  # Return the same device name as PyTorch 2026-02-21T14:00:37.6092158Z  DEVICE_TYPE=$(nvidia-smi -i 0 --query-gpu=name --format=csv,noheader) 2026-02-21T14:00:37.6092346Z elif [[ "${DEVICE_NAME}" == "rocm" ]]; then 2026-02-21T14:00:37.6092552Z  DEVICE_TYPE=$(rocminfo | grep "Marketing Name" | tail -n1 | awk -F':' '{print $2}' | xargs) 2026-02-21T14:00:37.6092779Z elif [[ "${DEVICE_NAME}" == "hpu" ]]; then 2026-02-21T14:00:37.6093026Z  DEVICE_TYPE="Intel Gaudi3 "$(hl-smi -q | grep "Product Name" | head -n 1 | awk -F ':' '{print $2}' | sed 's/^ *//') 2026-02-21T14:00:37.6093262Z elif [[ "${DEVICE_NAME}" == "cpu" ]]; then 2026-02-21T14:00:37.6093738Z  DEVICE_TYPE="$(lscpu | grep "Model name" | sed -E 's/.*Model name:[[:space:]]*//; s/Intel\(R\)//g; s/\(R\)//g; s/\(TM\)//g; s/CPU//g; s/Processor//g; s/[[:space:]]+/ /g; s/^ //; s/ $//; s/ /_/g')_$(awk -F: '/Core\(s\) per socket/ {c=$2} /Socket\(s\)/ {s=$2} END {gsub(/ /,"",c); gsub(/ /,"",s); printf "%sc", c*s}' < <(lscpu))" 2026-02-21T14:00:37.6094198Z elif [[ "${DEVICE_NAME}" == "arm64-cpu" ]]; then 2026-02-21T14:00:37.6094424Z  DEVICE_TYPE=$(lscpu | grep 'Vendor ID' | cut -f 2 -d ":" | awk '{$1=$1}1' | cut -f 2 -d " ") 2026-02-21T14:00:37.6094611Z fi 2026-02-21T14:00:37.6094716Z echo "DEVICE_TYPE=$DEVICE_TYPE" >> $GITHUB_ENV 2026-02-21T14:00:37.6095049Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T14:00:37.6095171Z env: 2026-02-21T14:00:37.6095285Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:00:37.6095410Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.6095567Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:00:37.6095723Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.6095855Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.6095990Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.6096128Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:00:37.6096389Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:00:37.6096522Z DEVICE_NAME: rocm 2026-02-21T14:00:37.6096628Z ##[endgroup] 2026-02-21T14:00:37.6814587Z + [[ rocm == \c\u\d\a ]] 2026-02-21T14:00:37.6814795Z + [[ rocm == \r\o\c\m ]] 2026-02-21T14:00:37.6819326Z ++ rocminfo 2026-02-21T14:00:37.6832743Z ++ grep 'Marketing Name' 2026-02-21T14:00:37.6832906Z ++ tail -n1 2026-02-21T14:00:37.6833034Z ++ awk -F: '{print $2}' 2026-02-21T14:00:37.6833157Z ++ xargs 2026-02-21T14:00:37.8702643Z + DEVICE_TYPE='AMD Instinct MI325X' 2026-02-21T14:00:37.8703369Z + echo 'DEVICE_TYPE=AMD Instinct MI325X' 2026-02-21T14:00:37.8745293Z ##[group]Run set -eux 2026-02-21T14:00:37.8745507Z set -eux 2026-02-21T14:00:37.8745687Z  2026-02-21T14:00:37.8745868Z if [[ -n ".venv/bin/activate" ]]; then 2026-02-21T14:00:37.8746102Z  source ".venv/bin/activate" 2026-02-21T14:00:37.8746321Z fi 2026-02-21T14:00:37.8746485Z  2026-02-21T14:00:37.8746711Z python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82 2026-02-21T14:00:37.8747100Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_runners_info.py" 2026-02-21T14:00:37.8747485Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T14:00:37.8747715Z env: 2026-02-21T14:00:37.8747983Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:00:37.8748197Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.8748421Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:00:37.8748625Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.8748828Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.8749015Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:37.8749207Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:00:37.8749491Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:00:37.8749681Z DEVICE_NAME: rocm 2026-02-21T14:00:37.8749928Z DEVICE_TYPE: AMD Instinct MI325X 2026-02-21T14:00:37.8750086Z ##[endgroup] 2026-02-21T14:00:37.9736568Z + [[ -n .venv/bin/activate ]] 2026-02-21T14:00:37.9737006Z + source .venv/bin/activate 2026-02-21T14:00:37.9737283Z ++ '[' -z '' ']' 2026-02-21T14:00:37.9737497Z ++ '[' -n x ']' 2026-02-21T14:00:37.9737763Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T14:00:37.9738162Z ++ '[' .venv/bin/activate = /__w/_temp/4f24e081-2409-4924-8914-94b31dabd3e3.sh ']' 2026-02-21T14:00:37.9738587Z ++ deactivate nondestructive 2026-02-21T14:00:37.9738821Z ++ unset -f pydoc 2026-02-21T14:00:37.9739028Z ++ '[' -z '' ']' 2026-02-21T14:00:37.9739249Z ++ '[' -z '' ']' 2026-02-21T14:00:37.9739432Z ++ hash -r 2026-02-21T14:00:37.9739663Z ++ '[' -z '' ']' 2026-02-21T14:00:37.9739862Z ++ unset VIRTUAL_ENV 2026-02-21T14:00:37.9740102Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T14:00:37.9740383Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T14:00:37.9740695Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T14:00:37.9741074Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T14:00:37.9741362Z ++ '[' linux-gnu = msys ']' 2026-02-21T14:00:37.9742856Z ++ export VIRTUAL_ENV 2026-02-21T14:00:37.9743113Z ++ '[' -z '' ']' 2026-02-21T14:00:37.9743402Z ++ unset SCRIPT_PATH 2026-02-21T14:00:37.9744166Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T14:00:37.9745467Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T14:00:37.9746291Z ++ export PATH 2026-02-21T14:00:37.9746517Z ++ '[' xhelion '!=' x ']' 2026-02-21T14:00:37.9746765Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T14:00:37.9747062Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T14:00:37.9747286Z ++ '[' -z '' ']' 2026-02-21T14:00:37.9747682Z ++ '[' -z '' ']' 2026-02-21T14:00:37.9747905Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T14:00:37.9748132Z ++ PS1='(helion) ' 2026-02-21T14:00:37.9748330Z ++ export PS1 2026-02-21T14:00:37.9748561Z ++ alias pydoc 2026-02-21T14:00:37.9748756Z ++ true 2026-02-21T14:00:37.9748964Z ++ hash -r 2026-02-21T14:00:37.9749264Z + python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82 2026-02-21T14:00:38.5017821Z Collecting psutil==7.0.0 2026-02-21T14:00:38.5530433Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB) 2026-02-21T14:00:38.5680216Z Collecting nvidia-ml-py==13.580.82 2026-02-21T14:00:38.5735666Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl.metadata (9.6 kB) 2026-02-21T14:00:38.5810555Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB) 2026-02-21T14:00:38.6123171Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl (49 kB) 2026-02-21T14:00:38.6755561Z Installing collected packages: nvidia-ml-py, psutil 2026-02-21T14:00:38.6760539Z Attempting uninstall: nvidia-ml-py 2026-02-21T14:00:38.6774759Z Found existing installation: nvidia-ml-py 13.590.48 2026-02-21T14:00:38.6784068Z Uninstalling nvidia-ml-py-13.590.48: 2026-02-21T14:00:38.7615284Z Successfully uninstalled nvidia-ml-py-13.590.48 2026-02-21T14:00:38.7889776Z Attempting uninstall: psutil 2026-02-21T14:00:38.7906240Z Found existing installation: psutil 7.2.2 2026-02-21T14:00:38.7918843Z Uninstalling psutil-7.2.2: 2026-02-21T14:00:38.7922305Z Successfully uninstalled psutil-7.2.2 2026-02-21T14:00:38.8773520Z 2026-02-21T14:00:38.8778315Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2026-02-21T14:00:38.8779568Z tritonbench 0.0.1 requires triton, which is not installed. 2026-02-21T14:00:38.8797703Z Successfully installed nvidia-ml-py-13.580.82 psutil-7.0.0 2026-02-21T14:00:38.9777953Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-runners-info/../../scripts/benchmarks/gather_runners_info.py 2026-02-21T14:00:47.7336037Z ##[group]Run pytorch/test-infra/.github/actions/gather-dependencies@main 2026-02-21T14:00:47.7336225Z with: 2026-02-21T14:00:47.7336310Z venv: .venv/bin/activate 2026-02-21T14:00:47.7336407Z env: 2026-02-21T14:00:47.7336489Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:00:47.7336618Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:47.7336775Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:00:47.7336927Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:47.7337062Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:47.7337201Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:47.7337342Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:00:47.7337496Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:00:47.7337631Z DEVICE_NAME: rocm 2026-02-21T14:00:47.7337727Z DEVICE_TYPE: AMD Instinct MI325X 2026-02-21T14:00:47.7337832Z ##[endgroup] 2026-02-21T14:00:47.7344415Z ##[group]Run set -eux 2026-02-21T14:00:47.7344527Z set -eux 2026-02-21T14:00:47.7344616Z  2026-02-21T14:00:47.7344715Z # TODO (huydhn): Implement this part 2026-02-21T14:00:47.7344857Z echo "dependencies={}" >> "${GITHUB_OUTPUT}" 2026-02-21T14:00:47.7345065Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T14:00:47.7345185Z env: 2026-02-21T14:00:47.7345272Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:00:47.7345396Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:47.7345549Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:00:47.7345707Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:47.7345839Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:47.7345975Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:47.7346112Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:00:47.7346272Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:00:47.7346408Z DEVICE_NAME: rocm 2026-02-21T14:00:47.7346508Z DEVICE_TYPE: AMD Instinct MI325X 2026-02-21T14:00:47.7346618Z ##[endgroup] 2026-02-21T14:00:47.8097460Z + echo 'dependencies={}' 2026-02-21T14:00:47.8157351Z ##[group]Run actions/upload-artifact@v6 2026-02-21T14:00:47.8157487Z with: 2026-02-21T14:00:47.8157611Z name: benchmark-results-mi325x-int4_gemm,flash_attention 2026-02-21T14:00:47.8157759Z path: test/test-reports 2026-02-21T14:00:47.8157861Z if-no-files-found: warn 2026-02-21T14:00:47.8157965Z compression-level: 6 2026-02-21T14:00:47.8158064Z overwrite: false 2026-02-21T14:00:47.8158158Z include-hidden-files: false 2026-02-21T14:00:47.8158256Z env: 2026-02-21T14:00:47.8158339Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:00:47.8158463Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:47.8158618Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:00:47.8158773Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:47.8158906Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:47.8159171Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:00:47.8159319Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:00:47.8159477Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:00:47.8159614Z DEVICE_NAME: rocm 2026-02-21T14:00:47.8159705Z DEVICE_TYPE: AMD Instinct MI325X 2026-02-21T14:00:47.8159813Z ##[endgroup] 2026-02-21T14:00:47.8161576Z ##[command]/usr/bin/docker exec 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T14:00:47.9673928Z With the provided path, there will be 1 file uploaded 2026-02-21T14:00:47.9676122Z Artifact name is valid! 2026-02-21T14:00:47.9676253Z Root directory input is valid! 2026-02-21T14:00:53.2440670Z Beginning upload of artifact content to blob storage 2026-02-21T14:00:55.6810262Z Uploaded bytes 586 2026-02-21T14:00:55.7160646Z Finished uploading artifact content to blob storage! 2026-02-21T14:00:55.7161372Z SHA256 digest of uploaded artifact zip is 42ed98c5c494960b63600f8bc3f969c4bb6eaafe9f6a44cc13b7347ea28c1d8c 2026-02-21T14:00:55.7162035Z Finalizing artifact upload 2026-02-21T14:00:55.8682943Z Artifact benchmark-results-mi325x-int4_gemm,flash_attention.zip successfully finalized. Artifact ID 5601585868 2026-02-21T14:00:55.8684036Z Artifact benchmark-results-mi325x-int4_gemm,flash_attention has been successfully uploaded! Final size is 586 bytes. Artifact ID is 5601585868 2026-02-21T14:00:55.8685091Z Artifact download URL: https://github.com/pytorch/helion/actions/runs/22253280836/artifacts/5601585868 2026-02-21T14:00:55.8852819Z Post job cleanup. 2026-02-21T14:00:55.8855628Z ##[command]/usr/bin/docker exec 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T14:00:56.0269098Z UV_PYTHON_INSTALL_DIR is already set to /github/home/.local/share/uv/python 2026-02-21T14:00:56.0270751Z (node:871989) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2026-02-21T14:00:56.0271215Z (Use `node --trace-deprecation ...` to show where the warning was created) 2026-02-21T14:00:56.0406283Z Post job cleanup. 2026-02-21T14:00:56.0408320Z ##[command]/usr/bin/docker exec 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T14:00:56.2314033Z Post job cleanup. 2026-02-21T14:00:56.2316190Z ##[command]/usr/bin/docker exec 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T14:00:56.3761231Z [command]/usr/bin/git version 2026-02-21T14:00:56.3789584Z git version 2.43.0 2026-02-21T14:00:56.3821747Z Temporarily overriding HOME='/__w/_temp/87c4a125-c580-4df0-9e3f-e3635800cd14' before making global git config changes 2026-02-21T14:00:56.3822122Z Adding repository directory to the temporary git global config as a safe directory 2026-02-21T14:00:56.3824325Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion 2026-02-21T14:00:56.3843318Z Removing SSH command configuration 2026-02-21T14:00:56.3846121Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2026-02-21T14:00:56.3862728Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2026-02-21T14:00:56.4071493Z Removing HTTP extra header 2026-02-21T14:00:56.4073650Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2026-02-21T14:00:56.4093109Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2026-02-21T14:00:56.4244860Z Removing includeIf entries pointing to credentials config files 2026-02-21T14:00:56.4247605Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2026-02-21T14:00:56.4265787Z includeif.gitdir:/__w/helion/helion/.git.path 2026-02-21T14:00:56.4266085Z includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path 2026-02-21T14:00:56.4266349Z includeif.gitdir:/github/workspace/.git.path 2026-02-21T14:00:56.4266603Z includeif.gitdir:/github/workspace/.git/worktrees/*.path 2026-02-21T14:00:56.4270388Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git.path 2026-02-21T14:00:56.4288882Z /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config 2026-02-21T14:00:56.4302310Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config 2026-02-21T14:00:56.4321481Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path 2026-02-21T14:00:56.4333159Z /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config 2026-02-21T14:00:56.4337706Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config 2026-02-21T14:00:56.4351714Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git.path 2026-02-21T14:00:56.4365196Z /github/runner_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config 2026-02-21T14:00:56.4372045Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config 2026-02-21T14:00:56.4387985Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git/worktrees/*.path 2026-02-21T14:00:56.4400037Z /github/runner_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config 2026-02-21T14:00:56.4407018Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config 2026-02-21T14:00:56.4423851Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2026-02-21T14:00:56.4572829Z Removing credentials config '/__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config' 2026-02-21T14:00:56.4691996Z Stop and remove container: c6efec98bce142318ef5a57c93d78ef5_rocmdevubuntu2404644complete_87ce82 2026-02-21T14:00:56.4694556Z ##[command]/usr/bin/docker rm --force 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 2026-02-21T14:00:58.4963592Z 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 2026-02-21T14:00:58.4994906Z Remove container network: github_network_2fc8484199d94410beae10a38ccd998e 2026-02-21T14:00:58.4997175Z ##[command]/usr/bin/docker network rm github_network_2fc8484199d94410beae10a38ccd998e 2026-02-21T14:00:59.0057460Z github_network_2fc8484199d94410beae10a38ccd998e 2026-02-21T14:00:59.0117045Z Evaluate and set job outputs 2026-02-21T14:00:59.0121147Z Set output 'benchmark-metadata' 2026-02-21T14:00:59.0122225Z Set output 'runners-info' 2026-02-21T14:00:59.0122491Z Set output 'dependencies' 2026-02-21T14:00:59.0122817Z Cleaning up orphan processes